Authorizations
Basic authentication header of the form Basic <encoded-value>
, where <encoded-value>
is the base64-encoded string username:password
.
Path Parameters
Crawler ID.
"e0f6db8a-24f5-4092-83a4-1b2c6cb6d809"
Body
Crawler configuration to update.
You can only update top-level configuration properties.
To update a nested configuration, such as actions.recordExtractor
,
you must provide the complete top-level object such as actions
.
Crawler configuration.
A list of actions.
1 - 30
elementsAlgolia application ID where the crawler creates and updates indices.
Determines the number of concurrent tasks per second that can run for this configuration.
A higher rate limit means more crawls per second. Algolia prevents system overload by ensuring the number of URLs added in the last second and the number of URLs being processed is less than the rate limit:
max(new_urls_added, active_urls_processing) <= rateLimit
Start with a low value (for example, 2) and increase it if you need faster crawling.
Be aware that a high rateLimit
can have a huge impact on bandwidth cost and server resource consumption.
The number of pages processed per second depends on the average time it takes to fetch, process, and upload a URL.
For a given rateLimit
if fetching, processing, and uploading URLs takes (on average):
- Less than a second, your crawler processes up to
rateLimit
pages per second. - Four seconds, your crawler processes up to
rateLimit / 4
pages per second.
In the latter case, increasing rateLimit
improves performance, up to a point.
However, if the processing time remains at four seconds, increasing rateLimit
won't increase the number of pages processed per second.
1 <= x <= 100
4
The Algolia API key the crawler uses for indexing records. If you don't provide an API key, one will be generated by the Crawler when you create a configuration.
The API key must have:
- These rights and restrictions:
search
,addObject
,deleteObject
,deleteIndex
,settings
,editSettings
,listIndexes
,browse
- Access to the correct set of indices, based on the crawler's
indexPrefix
. For example, if the prefix iscrawler_
, the API key must have access tocrawler_*
.
Don't use your Admin API key.
URLs to exclude from crawling.
100
Use micromatch for negation, wildcards, and more.
[
"https://www.example.com/excluded",
"!https://www.example.com/this-one-url",
"https://www.example.com/exclude/**"
]
References to external data sources for enriching the extracted records.
10
For more information, see Enrich extracted records with external data.
The Crawler treats extraUrls
the same as startUrls
.
Specify extraUrls
if you want to differentiate between URLs you manually added to fix site crawling from those you initially specified in startUrls
.
9999
Determines if the crawler should extract records from a page with a canonical URL.
If ignoreCanonicalTo
is set to:
true
all canonical URLs are ignored.- One or more URL patterns, the crawler will ignore the canonical URL if it matches a pattern.
Determines if the crawler should follow links with a nofollow
directive.
If true
, the crawler will ignore the nofollow
directive and crawl links on the page.
The crawler always ignores links that don't match your configuration settings.
ignoreNoFollowTo
applies to:
- Links that are ignored because the
robots
meta tag containsnofollow
ornone
. - Links with a
rel
attribute containing thenofollow
directive.
Whether to ignore the noindex
robots meta tag.
If true
, pages with this meta tag will be crawled.
Whether the crawler should follow rel="prev"
and rel="next"
pagination links in the <head>
section of an HTML page.
- If
true
, the crawler ignores the pagination links. - If
false
, the crawler follows the pagination links.
Query parameters to ignore while crawling.
All URLs with the matching query parameters are treated as identical. This prevents indexing URLs that just differ by their query parameters.
9999
Use wildcards to match multiple query parameters.
["ref", "utm_*"]
Whether to ignore rules defined in your robots.txt
file.
A prefix for all indices created by this crawler. It's combined with the indexName
for each action to form the complete index name.
64
"crawler_"
Crawler index settings.
These index settings are only applied during the first crawl of an index.
Any subsequent changes won't be applied to the index. Instead, make changes to your index settings in the Algolia dashboard.
Function for extracting URLs from links on crawled pages.
For more information, see the linkExtractor
documentation.
Authorization method and credentials for crawling protected content.
The Crawler supports these authentication methods:
- Basic authentication. The Crawler obtains a session cookie from the login page.
- OAuth 2.0 authentication (
oauthRequest
). The Crawler uses OAuth 2.0 client credentials to obtain an access token for authentication.
Basic authentication
The Crawler extracts the Set-Cookie
response header from the login page, stores that cookie,
and sends it in the Cookie
header when crawling all pages defined in the configuration.
This cookie is retrieved only at the start of each full crawl. If it expires, it isn't automatically renewed.
The Crawler can obtain the session cookie in one of two ways:
- HTTP request authentication (
fetchRequest
). The Crawler sends a direct request with your credentials to the login endpoint, similar to acurl
command. - Browser-based authentication (
browserRequest
). The Crawler emulates a web browser by loading the login page, entering the credentials, and submitting the login form as a real user would.
OAuth 2.0
The crawler supports OAuth 2.0 client credentials grant flow:
- It performs an access token request with the provided credentials
- Stores the fetched token in an
Authorization
header - Sends the token when crawling site pages.
This token is only fetched at the beginning of each complete crawl. If it expires, it isn't automatically renewed.
Client authentication passes the credentials (client_id
and client_secret
) in the request body.
The Azure AD v1.0 provider is supported.
Information for making a HTTP request for authorization.
- HTTP request
- Browser-based
- OAuth 2.0
{
"url": "https://example.com/secure/login-with-post",
"requestOptions": {
"method": "POST",
"headers": {
"Content-Type": "application/x-www-form-urlencoded"
},
"body": "id=my-id&password=my-password",
"timeout": 5000
}
}
Determines the maximum path depth of crawled URLs.
Path depth is calculated based on the number of slash characters (/
) after the domain (starting at 1).
For example:
- 1
http://example.com
- 1
http://example.com/
- 1
http://example.com/foo
- 2
http://example.com/foo/
- 2
http://example.com/foo/bar
- 3
http://example.com/foo/bar/
URLs added with startUrls
and sitemaps
aren't checked for maxDepth
..
1 <= x <= 100
5
Limits the number of URLs your crawler processes.
Change it to a low value, such as 100, for quick crawling tests.
Change it to a higher explicit value for full crawls to prevent it from getting "lost" in complex site structures.
Because the Crawler works on many pages simultaneously, maxUrls
doesn't guarantee finding the same pages each time it runs.
1 <= x <= 15000000
250
If true
, use a Chrome headless browser to crawl pages.
Because crawling JavaScript-based web pages is slower than crawling regular HTML pages, you can apply this setting to a specific list of pages. Use micromatch to define URL patterns, including negations and wildcards.
Whether to render all pages.
Lets you add options to HTTP requests made by the crawler.
Checks to ensure the crawl was successful.
For more information, see the Safety checks documentation.
Whether to back up your index before the crawler overwrites it with new records.
Schedule for running the crawl.
Instead of manually starting a crawl each time, you can set up a schedule for automatic crawls.
Use the visual UI or add the schedule
parameter to your configuration.
schedule
uses Later.js syntax to specify when to crawl your site.
Here are some key things to keep in mind when using Later.js
syntax with the Crawler:
- The interval between two scheduled crawls must be at least 24 hours.
- To crawl daily, use "every 1 day" instead of "everyday" or "every day".
- If you don't specify a time, the crawl can happen any time during the scheduled day.
- Specify times for the UTC (GMT+0) timezone
- Include minutes when specifying a time. For example, "at 3:00 pm" instead of "at 3pm".
- Use "at 12:00 am" to specify midnight, not "at 00:00 am".
"every weekday at 12:00 pm"
Sitemaps with URLs from where to start crawling.
9999
URLs from where to start crawling.
9999
Response
OK
Universally unique identifier (UUID) of the task.
"98458796-b7bb-4703-8b1b-785c1080b110"