Updates the configuration of the specified crawler. Every time you update the configuration, a new version is created.
editSettingsBasic authentication header of the form Basic <encoded-value>, where <encoded-value> is the base64-encoded string username:password.
Crawler ID. Universally unique identifier (UUID) of the crawler.
"e0f6db8a-24f5-4092-83a4-1b2c6cb6d809"
Crawler configuration to update.
You can only update top-level configuration properties.
To update a nested configuration, such as actions.recordExtractor,
you must provide the complete top-level object such as actions.
A list of actions.
1 - 30 elementsAlgolia application ID where the crawler creates and updates indices.
Determines the number of concurrent tasks per second that can run for this configuration.
A higher rate limit means more crawls per second. Algolia prevents system overload by ensuring the number of URLs added in the last second and the number of URLs being processed is less than the rate limit:
max(new_urls_added, active_urls_processing) <= rateLimit
Start with a low value (for example, 2) and increase it if you need faster crawling.
Be aware that a high rateLimit can have a huge impact on bandwidth cost and server resource consumption.
The number of pages processed per second depends on the average time it takes to fetch, process, and upload a URL.
For a given rateLimit if fetching, processing, and uploading URLs takes (on average):
rateLimit pages per second.rateLimit / 4 pages per second.In the latter case, increasing rateLimit improves performance, up to a point.
However, if the processing time remains at four seconds, increasing rateLimit won't increase the number of pages processed per second.
1 <= x <= 1004
The Algolia API key the crawler uses for indexing records. If you don't provide an API key, one will be generated by the Crawler when you create a configuration.
The API key must have:
search, addObject, deleteObject, deleteIndex, settings, editSettings, listIndexes, browseindexPrefix.
For example, if the prefix is crawler_, the API key must have access to crawler_*.Don't use your Admin API key.
URLs to exclude from crawling.
100Use micromatch for negation, wildcards, and more.
[
"https://www.example.com/excluded",
"!https://www.example.com/this-one-url",
"https://www.example.com/exclude/**"
]
References to external data sources for enriching the extracted records.
10For more information, see Enrich extracted records with external data.
The Crawler treats extraUrls the same as startUrls.
Specify extraUrls if you want to differentiate between URLs you manually added to fix site crawling from those you initially specified in startUrls.
9999Determines if the crawler should extract records from a page with a canonical URL.
If ignoreCanonicalTo is set to:
true all canonical URLs are ignored.Determines if the crawler should follow links with a nofollow directive.
If true, the crawler will ignore the nofollow directive and crawl links on the page.
The crawler always ignores links that don't match your configuration settings.
ignoreNoFollowTo applies to:
robots meta tag contains nofollow or none.rel attribute containing the nofollow directive.Whether to ignore the noindex robots meta tag.
If true, pages with this meta tag will be crawled.
Whether the crawler should follow rel="prev" and rel="next" pagination links in the <head> section of an HTML page.
true, the crawler ignores the pagination links.false, the crawler follows the pagination links.Query parameters to ignore while crawling.
All URLs with the matching query parameters are treated as identical. This prevents indexing URLs that just differ by their query parameters.
9999Use wildcards to match multiple query parameters.
["ref", "utm_*"]
Whether to ignore rules defined in your robots.txt file.
A prefix for all indices created by this crawler. It's combined with the indexName for each action to form the complete index name.
64"crawler_"
Crawler index settings.
These index settings are only applied during the first crawl of an index.
Any subsequent changes won't be applied to the index. Instead, make changes to your index settings in the Algolia dashboard.
Function for extracting URLs from links on crawled pages.
For more information, see the linkExtractor documentation.
Authorization method and credentials for crawling protected content.
The Crawler supports these authentication methods:
oauthRequest).
The Crawler uses OAuth 2.0 client credentials to obtain an access token for authentication.Basic authentication
The Crawler extracts the Set-Cookie response header from the login page, stores that cookie,
and sends it in the Cookie header when crawling all pages defined in the configuration.
This cookie is retrieved only at the start of each full crawl. If it expires, it isn't automatically renewed.
The Crawler can obtain the session cookie in one of two ways:
fetchRequest).
The Crawler sends a direct request with your credentials to the login endpoint, similar to a curl command.browserRequest).
The Crawler emulates a web browser by loading the login page, entering the credentials,
and submitting the login form as a real user would.OAuth 2.0
The crawler supports OAuth 2.0 client credentials grant flow:
Authorization headerThis token is only fetched at the beginning of each complete crawl. If it expires, it isn't automatically renewed.
Client authentication passes the credentials (client_id and client_secret) in the request body.
The Azure AD v1.0 provider is supported.
{
"url": "https://example.com/secure/login-with-post",
"requestOptions": {
"method": "POST",
"headers": {
"Content-Type": "application/x-www-form-urlencoded"
},
"body": "id=my-id&password=my-password",
"timeout": 5000
}
}
Determines the maximum path depth of crawled URLs.
Path depth is calculated based on the number of slash characters (/) after the domain (starting at 1).
For example:
http://example.comhttp://example.com/http://example.com/foohttp://example.com/foo/http://example.com/foo/barhttp://example.com/foo/bar/URLs added with startUrls and sitemaps aren't checked for maxDepth..
1 <= x <= 1005
Limits the number of URLs your crawler processes.
Change it to a low value, such as 100, for quick crawling tests.
Change it to a higher explicit value for full crawls to prevent it from getting "lost" in complex site structures.
Because the Crawler works on many pages simultaneously, maxUrls doesn't guarantee finding the same pages each time it runs.
1 <= x <= 15000000250
If true, use a Chrome headless browser to crawl pages.
Because crawling JavaScript-based web pages is slower than crawling regular HTML pages, you can apply this setting to a specific list of pages. Use micromatch to define URL patterns, including negations and wildcards.
Lets you add options to HTTP requests made by the crawler.
Checks to ensure the crawl was successful.
For more information, see the Safety checks documentation.
Whether to back up your index before the crawler overwrites it with new records.
Schedule for running the crawl.
Instead of manually starting a crawl each time, you can set up a schedule for automatic crawls.
Use the visual UI or add the schedule parameter to your configuration.
schedule uses Later.js syntax to specify when to crawl your site.
Here are some key things to keep in mind when using Later.js syntax with the Crawler:
"every weekday at 12:00 pm"
Sitemaps with URLs from where to start crawling.
9999URLs from where to start crawling.
9999OK
Universally unique identifier (UUID) of the task.
"98458796-b7bb-4703-8b1b-785c1080b110"