linkExtractor

Type: function

By default, the crawler adds URLs to the queue based on pathsToMatch, fileTypesToMatch, and exclusionPatterns. To customize this behavior, define a linkExtractor function. This function runs on every crawled page and returns a list of URLs as strings.

Examples

JavaScript

  {
    linkExtractor: ({ $, url, defaultExtractor }) => {
      if (/example.com\/doc\//.test(url.href)) {
        // For all pages under /doc, only queue the first found link
        return defaultExtractor().slice(0,1);
      }
      // Otherwise, use the default logic (queue all found links)
      return defaultExtractor();
    },
  }

JavaScript

{
  linkExtractor: ({ $, url, defaultExtractor }) => {
    // This turns off link discovery, except for URLs listed in sitemap.xml
    return /sitemap.xml/.test(url.href) ? defaultExtractor() : [];
  },
}

JavaScript

{
  linkExtractor: ({ $ }) => {
    // Access the DOM and extract what you specify
    return [$('.my-link').attr('href')]
  },
}

Parameters

object

A Cheerio instance with the HTML of the crawled page.

defaultExtractor

function

The default extractor function.

url

URL

URL object of the crawled page.

Get started

Extract data

Enrich data

Troubleshooting

APIs

Netlify plugin

Examples

Parameters

Get started

Extract data

Enrich data

Troubleshooting

APIs

Netlify plugin

​Examples

​Parameters

Examples

Parameters