Skip to main content
PATCH
/
1
/
crawlers
/
{id}
/
config
Update crawler configuration
curl --request PATCH \
  --url https://crawler.algolia.com/api/1/crawlers/{id}/config \
  --header 'Authorization: Basic <encoded-value>' \
  --header 'Content-Type: application/json' \
  --data '{
  "actions": [
    {
      "autoGenerateObjectIDs": true,
      "cache": {
        "enabled": true
      },
      "discoveryPatterns": [
        "https://www.algolia.com/**"
      ],
      "fileTypesToMatch": [
        "html",
        "pdf"
      ],
      "hostnameAliases": {
        "dev.example.com": "example.com"
      },
      "indexName": "algolia_website",
      "name": "<string>",
      "pathAliases": {
        "example.com": {
          "/foo": "/bar"
        }
      },
      "pathsToMatch": [
        "https://www.algolia.com/**"
      ],
      "recordExtractor": {
        "__type": "function",
        "source": "<string>"
      },
      "schedule": "<string>",
      "selectorsToMatch": [
        ".products",
        "!.featured"
      ]
    }
  ],
  "apiKey": "<string>",
  "appId": "<string>",
  "exclusionPatterns": [
    "https://www.example.com/excluded",
    "!https://www.example.com/this-one-url",
    "https://www.example.com/exclude/**"
  ],
  "externalData": [
    "testCSV"
  ],
  "extraUrls": [
    "<string>"
  ],
  "ignoreCanonicalTo": true,
  "ignoreNoFollowTo": true,
  "ignoreNoIndex": true,
  "ignorePaginationAttributes": true,
  "ignoreQueryParams": [
    "ref",
    "utm_*"
  ],
  "ignoreRobotsTxtRules": true,
  "indexPrefix": "crawler_",
  "initialIndexSettings": {},
  "linkExtractor": {
    "__type": "function",
    "source": "({ $, url, defaultExtractor }) => {\n  if (/example.com\\/doc\\//.test(url.href)) {\n    // For all pages under `/doc`, only extract the first found URL.\n    return defaultExtractor().slice(0, 1)\n  }\n  // For all other pages, use the default.\n  return defaultExtractor()\n}\n"
  },
  "login": {
    "url": "https://example.com/secure/login-with-post",
    "requestOptions": {
      "method": "POST",
      "headers": {
        "Content-Type": "application/x-www-form-urlencoded"
      },
      "body": "id=my-id&password=my-password",
      "timeout": 5000
    }
  },
  "maxDepth": 5,
  "maxUrls": 250,
  "rateLimit": 4,
  "renderJavaScript": true,
  "requestOptions": {
    "proxy": "<string>",
    "timeout": 30000,
    "retries": 3,
    "headers": {
      "Accept-Language": "fr-FR",
      "Authorization": "Bearer Aerehdf==",
      "Cookie": "session=1234"
    }
  },
  "safetyChecks": {
    "beforeIndexPublishing": {
      "maxLostRecordsPercentage": 10,
      "maxFailedUrls": 123
    }
  },
  "saveBackup": true,
  "schedule": "every weekday at 12:00 pm",
  "sitemaps": [
    "https://example.com/sitemap.xyz"
  ],
  "startUrls": [
    "https://www.example.com"
  ]
}'
{
  "taskId": "98458796-b7bb-4703-8b1b-785c1080b110"
}

Authorizations

Authorization
string
header
required

Basic authentication header of the form Basic <encoded-value>, where <encoded-value> is the base64-encoded string username:password.

Path Parameters

id
string
required

Crawler ID.

Example:

"e0f6db8a-24f5-4092-83a4-1b2c6cb6d809"

Body

application/json

Crawler configuration to update. You can only update top-level configuration properties. To update a nested configuration, such as actions.recordExtractor, you must provide the complete top-level object such as actions.

Crawler configuration.

actions
object[]
required

A list of actions.

Required array length: 1 - 30 elements
appId
string
required

Algolia application ID where the crawler creates and updates indices.

rateLimit
integer
required

Determines the number of concurrent tasks per second that can run for this configuration.

A higher rate limit means more crawls per second. Algolia prevents system overload by ensuring the number of URLs added in the last second and the number of URLs being processed is less than the rate limit:

max(new_urls_added, active_urls_processing) <= rateLimit

Start with a low value (for example, 2) and increase it if you need faster crawling. Be aware that a high rateLimit can have a huge impact on bandwidth cost and server resource consumption.

The number of pages processed per second depends on the average time it takes to fetch, process, and upload a URL. For a given rateLimit if fetching, processing, and uploading URLs takes (on average):

  • Less than a second, your crawler processes up to rateLimit pages per second.
  • Four seconds, your crawler processes up to rateLimit / 4 pages per second.

In the latter case, increasing rateLimit improves performance, up to a point. However, if the processing time remains at four seconds, increasing rateLimit won't increase the number of pages processed per second.

Required range: 1 <= x <= 100
Example:

4

apiKey
string

The Algolia API key the crawler uses for indexing records. If you don't provide an API key, one will be generated by the Crawler when you create a configuration.

The API key must have:

  • These rights and restrictions: search, addObject, deleteObject, deleteIndex, settings, editSettings, listIndexes, browse
  • Access to the correct set of indices, based on the crawler's indexPrefix. For example, if the prefix is crawler_, the API key must have access to crawler_*.

Don't use your Admin API key.

exclusionPatterns
string[]

URLs to exclude from crawling.

Maximum length: 100

Use micromatch for negation, wildcards, and more.

Example:
[
"https://www.example.com/excluded",
"!https://www.example.com/this-one-url",
"https://www.example.com/exclude/**"
]
externalData
string[]

References to external data sources for enriching the extracted records.

Maximum length: 10
extraUrls
string[]

The Crawler treats extraUrls the same as startUrls. Specify extraUrls if you want to differentiate between URLs you manually added to fix site crawling from those you initially specified in startUrls.

Maximum length: 9999
ignoreCanonicalTo

Determines if the crawler should extract records from a page with a canonical URL.

If ignoreCanonicalTo is set to:

  • true all canonical URLs are ignored.
  • One or more URL patterns, the crawler will ignore the canonical URL if it matches a pattern.
ignoreNoFollowTo
boolean

Determines if the crawler should follow links with a nofollow directive. If true, the crawler will ignore the nofollow directive and crawl links on the page.

The crawler always ignores links that don't match your configuration settings. ignoreNoFollowTo applies to:

  • Links that are ignored because the robots meta tag contains nofollow or none.
  • Links with a rel attribute containing the nofollow directive.
ignoreNoIndex
boolean

Whether to ignore the noindex robots meta tag. If true, pages with this meta tag will be crawled.

ignorePaginationAttributes
boolean
default:true

Whether the crawler should follow rel="prev" and rel="next" pagination links in the <head> section of an HTML page.

  • If true, the crawler ignores the pagination links.
  • If false, the crawler follows the pagination links.
ignoreQueryParams
string[]

Query parameters to ignore while crawling.

All URLs with the matching query parameters are treated as identical. This prevents indexing URLs that just differ by their query parameters.

Maximum length: 9999

Use wildcards to match multiple query parameters.

Example:
["ref", "utm_*"]
ignoreRobotsTxtRules
boolean

Whether to ignore rules defined in your robots.txt file.

indexPrefix
string

A prefix for all indices created by this crawler. It's combined with the indexName for each action to form the complete index name.

Maximum length: 64
Example:

"crawler_"

initialIndexSettings
object

Crawler index settings.

These index settings are only applied during the first crawl of an index.

Any subsequent changes won't be applied to the index. Instead, make changes to your index settings in the Algolia dashboard.

Function for extracting URLs from links on crawled pages.

For more information, see the linkExtractor documentation.

login
object

Authorization method and credentials for crawling protected content.

The Crawler supports these authentication methods:

  • Basic authentication. The Crawler obtains a session cookie from the login page.
  • OAuth 2.0 authentication (oauthRequest). The Crawler uses OAuth 2.0 client credentials to obtain an access token for authentication.

Basic authentication

The Crawler extracts the Set-Cookie response header from the login page, stores that cookie, and sends it in the Cookie header when crawling all pages defined in the configuration.

This cookie is retrieved only at the start of each full crawl. If it expires, it isn't automatically renewed.

The Crawler can obtain the session cookie in one of two ways:

  • HTTP request authentication (fetchRequest). The Crawler sends a direct request with your credentials to the login endpoint, similar to a curl command.
  • Browser-based authentication (browserRequest). The Crawler emulates a web browser by loading the login page, entering the credentials, and submitting the login form as a real user would.

OAuth 2.0

The crawler supports OAuth 2.0 client credentials grant flow:

  1. It performs an access token request with the provided credentials
  2. Stores the fetched token in an Authorization header
  3. Sends the token when crawling site pages.

This token is only fetched at the beginning of each complete crawl. If it expires, it isn't automatically renewed.

Client authentication passes the credentials (client_id and client_secret) in the request body. The Azure AD v1.0 provider is supported.

Information for making a HTTP request for authorization.

  • HTTP request
  • Browser-based
  • OAuth 2.0
Example:
{
"url": "https://example.com/secure/login-with-post",
"requestOptions": {
"method": "POST",
"headers": {
"Content-Type": "application/x-www-form-urlencoded"
},
"body": "id=my-id&password=my-password",
"timeout": 5000
}
}
maxDepth
integer

Determines the maximum path depth of crawled URLs.

Path depth is calculated based on the number of slash characters (/) after the domain (starting at 1). For example:

  • 1 http://example.com
  • 1 http://example.com/
  • 1 http://example.com/foo
  • 2 http://example.com/foo/
  • 2 http://example.com/foo/bar
  • 3 http://example.com/foo/bar/

URLs added with startUrls and sitemaps aren't checked for maxDepth..

Required range: 1 <= x <= 100
Example:

5

maxUrls
integer

Limits the number of URLs your crawler processes.

Change it to a low value, such as 100, for quick crawling tests. Change it to a higher explicit value for full crawls to prevent it from getting "lost" in complex site structures. Because the Crawler works on many pages simultaneously, maxUrls doesn't guarantee finding the same pages each time it runs.

Required range: 1 <= x <= 15000000
Example:

250

renderJavaScript

If true, use a Chrome headless browser to crawl pages.

Because crawling JavaScript-based web pages is slower than crawling regular HTML pages, you can apply this setting to a specific list of pages. Use micromatch to define URL patterns, including negations and wildcards.

Whether to render all pages.

requestOptions
object

Lets you add options to HTTP requests made by the crawler.

safetyChecks
object

Checks to ensure the crawl was successful.

For more information, see the Safety checks documentation.

saveBackup
boolean

Whether to back up your index before the crawler overwrites it with new records.

schedule
string

Schedule for running the crawl.

Instead of manually starting a crawl each time, you can set up a schedule for automatic crawls. Use the visual UI or add the schedule parameter to your configuration.

schedule uses Later.js syntax to specify when to crawl your site. Here are some key things to keep in mind when using Later.js syntax with the Crawler:

  • The interval between two scheduled crawls must be at least 24 hours.
  • To crawl daily, use "every 1 day" instead of "everyday" or "every day".
  • If you don't specify a time, the crawl can happen any time during the scheduled day.
  • Specify times for the UTC (GMT+0) timezone
  • Include minutes when specifying a time. For example, "at 3:00 pm" instead of "at 3pm".
  • Use "at 12:00 am" to specify midnight, not "at 00:00 am".
Example:

"every weekday at 12:00 pm"

sitemaps
string[]

Sitemaps with URLs from where to start crawling.

Maximum length: 9999
startUrls
string[]

URLs from where to start crawling.

Maximum length: 9999

Response

OK

taskId
string
required

Universally unique identifier (UUID) of the task.

Example:

"98458796-b7bb-4703-8b1b-785c1080b110"

I