Update crawler configuration

curl --request PATCH \
  --url https://crawler.algolia.com/api/1/crawlers/{id}/config \
  --header 'Authorization: Basic <encoded-value>' \
  --header 'Content-Type: application/json' \
  --data '{
  "actions": [
    {
      "autoGenerateObjectIDs": true,
      "cache": {
        "enabled": true
      },
      "discoveryPatterns": [
        "https://www.algolia.com/**"
      ],
      "fileTypesToMatch": [
        "html",
        "pdf"
      ],
      "hostnameAliases": {
        "dev.example.com": "example.com"
      },
      "indexName": "algolia_website",
      "name": "<string>",
      "pathAliases": {
        "example.com": {
          "/foo": "/bar"
        }
      },
      "pathsToMatch": [
        "https://www.algolia.com/**"
      ],
      "recordExtractor": {
        "__type": "function",
        "source": "<string>"
      },
      "schedule": "<string>",
      "selectorsToMatch": [
        ".products",
        "!.featured"
      ]
    }
  ],
  "apiKey": "<string>",
  "appId": "<string>",
  "exclusionPatterns": [
    "https://www.example.com/excluded",
    "!https://www.example.com/this-one-url",
    "https://www.example.com/exclude/**"
  ],
  "externalData": [
    "testCSV"
  ],
  "extraUrls": [
    "<string>"
  ],
  "ignoreCanonicalTo": true,
  "ignoreNoFollowTo": true,
  "ignoreNoIndex": true,
  "ignorePaginationAttributes": true,
  "ignoreQueryParams": [
    "ref",
    "utm_*"
  ],
  "ignoreRobotsTxtRules": true,
  "indexPrefix": "crawler_",
  "initialIndexSettings": {},
  "linkExtractor": {
    "__type": "function",
    "source": "({ $, url, defaultExtractor }) => {\n  if (/example.com\\/doc\\//.test(url.href)) {\n    // For all pages under `/doc`, only extract the first found URL.\n    return defaultExtractor().slice(0, 1)\n  }\n  // For all other pages, use the default.\n  return defaultExtractor()\n}\n"
  },
  "login": {
    "url": "https://example.com/secure/login-with-post",
    "requestOptions": {
      "method": "POST",
      "headers": {
        "Content-Type": "application/x-www-form-urlencoded"
      },
      "body": "id=my-id&password=my-password",
      "timeout": 5000
    }
  },
  "maxDepth": 5,
  "maxUrls": 250,
  "rateLimit": 4,
  "renderJavaScript": true,
  "requestOptions": {
    "proxy": "<string>",
    "timeout": 30000,
    "retries": 3,
    "headers": {
      "Accept-Language": "fr-FR",
      "Authorization": "Bearer Aerehdf==",
      "Cookie": "session=1234"
    }
  },
  "safetyChecks": {
    "beforeIndexPublishing": {
      "maxLostRecordsPercentage": 10,
      "maxFailedUrls": 123
    }
  },
  "saveBackup": true,
  "schedule": "every weekday at 12:00 pm",
  "sitemaps": [
    "https://example.com/sitemap.xyz"
  ],
  "startUrls": [
    "https://www.example.com"
  ]
}'

{
  "taskId": "98458796-b7bb-4703-8b1b-785c1080b110"
}

PATCH

crawlers

{id}

config

Update crawler configuration

curl --request PATCH \
  --url https://crawler.algolia.com/api/1/crawlers/{id}/config \
  --header 'Authorization: Basic <encoded-value>' \
  --header 'Content-Type: application/json' \
  --data '{
  "actions": [
    {
      "autoGenerateObjectIDs": true,
      "cache": {
        "enabled": true
      },
      "discoveryPatterns": [
        "https://www.algolia.com/**"
      ],
      "fileTypesToMatch": [
        "html",
        "pdf"
      ],
      "hostnameAliases": {
        "dev.example.com": "example.com"
      },
      "indexName": "algolia_website",
      "name": "<string>",
      "pathAliases": {
        "example.com": {
          "/foo": "/bar"
        }
      },
      "pathsToMatch": [
        "https://www.algolia.com/**"
      ],
      "recordExtractor": {
        "__type": "function",
        "source": "<string>"
      },
      "schedule": "<string>",
      "selectorsToMatch": [
        ".products",
        "!.featured"
      ]
    }
  ],
  "apiKey": "<string>",
  "appId": "<string>",
  "exclusionPatterns": [
    "https://www.example.com/excluded",
    "!https://www.example.com/this-one-url",
    "https://www.example.com/exclude/**"
  ],
  "externalData": [
    "testCSV"
  ],
  "extraUrls": [
    "<string>"
  ],
  "ignoreCanonicalTo": true,
  "ignoreNoFollowTo": true,
  "ignoreNoIndex": true,
  "ignorePaginationAttributes": true,
  "ignoreQueryParams": [
    "ref",
    "utm_*"
  ],
  "ignoreRobotsTxtRules": true,
  "indexPrefix": "crawler_",
  "initialIndexSettings": {},
  "linkExtractor": {
    "__type": "function",
    "source": "({ $, url, defaultExtractor }) => {\n  if (/example.com\\/doc\\//.test(url.href)) {\n    // For all pages under `/doc`, only extract the first found URL.\n    return defaultExtractor().slice(0, 1)\n  }\n  // For all other pages, use the default.\n  return defaultExtractor()\n}\n"
  },
  "login": {
    "url": "https://example.com/secure/login-with-post",
    "requestOptions": {
      "method": "POST",
      "headers": {
        "Content-Type": "application/x-www-form-urlencoded"
      },
      "body": "id=my-id&password=my-password",
      "timeout": 5000
    }
  },
  "maxDepth": 5,
  "maxUrls": 250,
  "rateLimit": 4,
  "renderJavaScript": true,
  "requestOptions": {
    "proxy": "<string>",
    "timeout": 30000,
    "retries": 3,
    "headers": {
      "Accept-Language": "fr-FR",
      "Authorization": "Bearer Aerehdf==",
      "Cookie": "session=1234"
    }
  },
  "safetyChecks": {
    "beforeIndexPublishing": {
      "maxLostRecordsPercentage": 10,
      "maxFailedUrls": 123
    }
  },
  "saveBackup": true,
  "schedule": "every weekday at 12:00 pm",
  "sitemaps": [
    "https://example.com/sitemap.xyz"
  ],
  "startUrls": [
    "https://www.example.com"
  ]
}'

{
  "taskId": "98458796-b7bb-4703-8b1b-785c1080b110"
}

Authorizations

Authorization

string

header

required

Basic authentication header of the form Basic <encoded-value>, where <encoded-value> is the base64-encoded string username:password.

Path Parameters

string

required

Crawler ID.

Example:

"e0f6db8a-24f5-4092-83a4-1b2c6cb6d809"

Body

application/json

Crawler configuration to update. You can only update top-level configuration properties. To update a nested configuration, such as actions.recordExtractor, you must provide the complete top-level object such as actions.

Crawler configuration.

actions

object[]

required

A list of actions.

Required array length: 1 - 30 elements

Show child attributes

appId

string

required

Algolia application ID where the crawler creates and updates indices.

rateLimit

integer

required

Determines the number of concurrent tasks per second that can run for this configuration.

A higher rate limit means more crawls per second. Algolia prevents system overload by ensuring the number of URLs added in the last second and the number of URLs being processed is less than the rate limit:

max(new_urls_added, active_urls_processing) <= rateLimit

Start with a low value (for example, 2) and increase it if you need faster crawling. Be aware that a high rateLimit can have a huge impact on bandwidth cost and server resource consumption.

The number of pages processed per second depends on the average time it takes to fetch, process, and upload a URL. For a given rateLimit if fetching, processing, and uploading URLs takes (on average):

Less than a second, your crawler processes up to rateLimit pages per second.
Four seconds, your crawler processes up to rateLimit / 4 pages per second.

In the latter case, increasing rateLimit improves performance, up to a point. However, if the processing time remains at four seconds, increasing rateLimit won't increase the number of pages processed per second.

Required range: 1 <= x <= 100

Example:

4

apiKey

string

The Algolia API key the crawler uses for indexing records. If you don't provide an API key, one will be generated by the Crawler when you create a configuration.

The API key must have:

These rights and restrictions: search, addObject, deleteObject, deleteIndex, settings, editSettings, listIndexes, browse
Access to the correct set of indices, based on the crawler's indexPrefix. For example, if the prefix is crawler_, the API key must have access to crawler_*.

Don't use your Admin API key.

exclusionPatterns

string[]

URLs to exclude from crawling.

Maximum length: 100

Use micromatch for negation, wildcards, and more.

Example:

[
  "https://www.example.com/excluded",
  "!https://www.example.com/this-one-url",
  "https://www.example.com/exclude/**"
]

externalData

string[]

References to external data sources for enriching the extracted records.

Maximum length: 10

For more information, see Enrich extracted records with external data.

extraUrls

string[]

The Crawler treats extraUrls the same as startUrls. Specify extraUrls if you want to differentiate between URLs you manually added to fix site crawling from those you initially specified in startUrls.

Maximum length: 9999

ignoreCanonicalTo

Determines if the crawler should extract records from a page with a canonical URL.

If ignoreCanonicalTo is set to:

true all canonical URLs are ignored.
One or more URL patterns, the crawler will ignore the canonical URL if it matches a pattern.

ignoreNoFollowTo

boolean

Determines if the crawler should follow links with a nofollow directive. If true, the crawler will ignore the nofollow directive and crawl links on the page.

The crawler always ignores links that don't match your configuration settings. ignoreNoFollowTo applies to:

Links that are ignored because the robots meta tag contains nofollow or none.
Links with a rel attribute containing the nofollow directive.

ignoreNoIndex

boolean

Whether to ignore the noindex robots meta tag. If true, pages with this meta tag will be crawled.

ignorePaginationAttributes

boolean

default:true

Whether the crawler should follow rel="prev" and rel="next" pagination links in the <head> section of an HTML page.

If true, the crawler ignores the pagination links.
If false, the crawler follows the pagination links.

ignoreQueryParams

string[]

Query parameters to ignore while crawling.

All URLs with the matching query parameters are treated as identical. This prevents indexing URLs that just differ by their query parameters.

Maximum length: 9999

Use wildcards to match multiple query parameters.

Example:

["ref", "utm_*"]

ignoreRobotsTxtRules

boolean

Whether to ignore rules defined in your robots.txt file.

indexPrefix

string

A prefix for all indices created by this crawler. It's combined with the indexName for each action to form the complete index name.

Maximum length: 64

Example:

"crawler_"

initialIndexSettings

object

Crawler index settings.

These index settings are only applied during the first crawl of an index.

Any subsequent changes won't be applied to the index. Instead, make changes to your index settings in the Algolia dashboard.

Show child attributes

linkExtractor

object

Function for extracting URLs from links on crawled pages.

For more information, see the linkExtractor documentation.

Show child attributes

Authorization method and credentials for crawling protected content.

The Crawler supports these authentication methods:

Basic authentication. The Crawler obtains a session cookie from the login page.
OAuth 2.0 authentication (oauthRequest). The Crawler uses OAuth 2.0 client credentials to obtain an access token for authentication.

Basic authentication

The Crawler extracts the Set-Cookie response header from the login page, stores that cookie, and sends it in the Cookie header when crawling all pages defined in the configuration.

This cookie is retrieved only at the start of each full crawl. If it expires, it isn't automatically renewed.

The Crawler can obtain the session cookie in one of two ways:

HTTP request authentication (fetchRequest). The Crawler sends a direct request with your credentials to the login endpoint, similar to a curl command.
Browser-based authentication (browserRequest). The Crawler emulates a web browser by loading the login page, entering the credentials, and submitting the login form as a real user would.

OAuth 2.0

The crawler supports OAuth 2.0 client credentials grant flow:

It performs an access token request with the provided credentials
Stores the fetched token in an Authorization header
Sends the token when crawling site pages.

This token is only fetched at the beginning of each complete crawl. If it expires, it isn't automatically renewed.

Client authentication passes the credentials (client_id and client_secret) in the request body. The Azure AD v1.0 provider is supported.

Information for making a HTTP request for authorization.

HTTP request
Browser-based
OAuth 2.0

Show child attributes

Example:

{
  "url": "https://example.com/secure/login-with-post",
  "requestOptions": {
    "method": "POST",
    "headers": {
      "Content-Type": "application/x-www-form-urlencoded"
    },
    "body": "id=my-id&password=my-password",
    "timeout": 5000
  }
}

maxDepth

integer

Determines the maximum path depth of crawled URLs.

Path depth is calculated based on the number of slash characters (/) after the domain (starting at 1). For example:

1 http://example.com
1 http://example.com/
1 http://example.com/foo
2 http://example.com/foo/
2 http://example.com/foo/bar
3 http://example.com/foo/bar/

URLs added with startUrls and sitemaps aren't checked for maxDepth..

Required range: 1 <= x <= 100

Example:

5

maxUrls

integer

Limits the number of URLs your crawler processes.

Change it to a low value, such as 100, for quick crawling tests. Change it to a higher explicit value for full crawls to prevent it from getting "lost" in complex site structures. Because the Crawler works on many pages simultaneously, maxUrls doesn't guarantee finding the same pages each time it runs.

Required range: 1 <= x <= 15000000

Example:

250

renderJavaScript

If true, use a Chrome headless browser to crawl pages.

Because crawling JavaScript-based web pages is slower than crawling regular HTML pages, you can apply this setting to a specific list of pages. Use micromatch to define URL patterns, including negations and wildcards.

Whether to render all pages.

requestOptions

object

Lets you add options to HTTP requests made by the crawler.

Show child attributes

safetyChecks

object

Checks to ensure the crawl was successful.

For more information, see the Safety checks documentation.

Show child attributes

saveBackup

boolean

Whether to back up your index before the crawler overwrites it with new records.

schedule

string

Schedule for running the crawl.

Instead of manually starting a crawl each time, you can set up a schedule for automatic crawls. Use the visual UI or add the schedule parameter to your configuration.

schedule uses Later.js syntax to specify when to crawl your site. Here are some key things to keep in mind when using Later.js syntax with the Crawler:

The interval between two scheduled crawls must be at least 24 hours.
To crawl daily, use "every 1 day" instead of "everyday" or "every day".
If you don't specify a time, the crawl can happen any time during the scheduled day.
Specify times for the UTC (GMT+0) timezone
Include minutes when specifying a time. For example, "at 3:00 pm" instead of "at 3pm".
Use "at 12:00 am" to specify midnight, not "at 00:00 am".

Example:

"every weekday at 12:00 pm"

sitemaps

string[]

Sitemaps with URLs from where to start crawling.

Maximum length: 9999

startUrls

string[]

URLs from where to start crawling.

Maximum length: 9999

Response

taskId

string

required

Universally unique identifier (UUID) of the task.

Example:

"98458796-b7bb-4703-8b1b-785c1080b110"

Crawl URLs List configuration versions

⌘I

Crawler API

Authorizations

Path Parameters

Body

Response