By default, the crawler adds URLs to the queue based on
pathsToMatch
,
fileTypesToMatch
,
and exclusionPatterns
.
To customize this behavior, define a linkExtractor
function.
This function runs on every crawled page and returns a list of URLs as strings.
Examples
{
linkExtractor: ({ $, url, defaultExtractor }) => {
if (/example.com\/doc\//.test(url.href)) {
// For all pages under /doc, only queue the first found link
return defaultExtractor().slice(0,1);
}
// Otherwise, use the default logic (queue all found links)
return defaultExtractor();
},
}
{
linkExtractor: ({ $, url, defaultExtractor }) => {
// This turns off link discovery, except for URLs listed in sitemap.xml
return /sitemap.xml/.test(url.href) ? defaultExtractor() : [];
},
}
{
linkExtractor: ({ $ }) => {
// Access the DOM and extract what you specify
return [$('.my-link').attr('href')]
},
}
Parameters
A Cheerio instance with the HTML of the crawled page.
The default extractor function.
URL object of the crawled page.