Skip to main content
  • Type: function
By default, the crawler adds URLs to the queue based on pathsToMatch, fileTypesToMatch, and exclusionPatterns. To customize this behavior, define a linkExtractor function. This function runs on every crawled page and returns a list of URLs as strings.

Examples

JavaScript
  {
    linkExtractor: ({ $, url, defaultExtractor }) => {
      if (/example.com\/doc\//.test(url.href)) {
        // For all pages under /doc, only queue the first found link
        return defaultExtractor().slice(0,1);
      }
      // Otherwise, use the default logic (queue all found links)
      return defaultExtractor();
    },
  }
JavaScript
{
  linkExtractor: ({ $, url, defaultExtractor }) => {
    // This turns off link discovery, except for URLs listed in sitemap.xml
    return /sitemap.xml/.test(url.href) ? defaultExtractor() : [];
  },
}
JavaScript
{
  linkExtractor: ({ $ }) => {
    // Access the DOM and extract what you specify
    return [$('.my-link').attr('href')]
  },
}

Parameters

$
object
A Cheerio instance with the HTML of the crawled page.
defaultExtractor
function
The default extractor function.
url
URL
URL object of the crawled page.
I