Skip to main content
  • Type: Action []
  • Required
A single action defines:
  • The URLs to crawl
  • The extraction process for those websites
  • The indices to which the extracted records are added
A single web page can match multiple actions. In this case, your crawler creates a record for each matched action.

Examples

JavaScript
{
  actions: [
    {
      indexName: 'dev_blog_algolia',
      pathsToMatch: ['https://blog.algolia.com/**'],
      fileTypesToMatch: ['pdf'],
      autoGenerateObjectIDs: false,
      schedule: 'every 1 day',
      recordExtractor: ({ url, $, contentLength, fileType, dataSources })  => {
        ...
      }
    },
  ],
}

Parameters

Action

indexName
string
required
Index name targeted by this action. This value is appended to the indexPrefix if specified.
pathsToMatch
string[]
required
URL patterns for web pages to which this action should apply. The patterns are evaluated using the micromatch library. You can use wildcard characters, negation, and more.
recordExtractor
function
required
A JavaScript function to extract content from a web page and turn it into Algolia records. The function should return a JSON array which may be empty. An empty array means the page is skipped.Example:
JavaScript
recordExtractor: ({ url, $, contentLength, fileType }) => {
  return [
    {
      url: url.href,
      text: $("p").html(),
      // ... anything you want
    },
  ];
  // return []; skips the page
};
name
string
Unique identifier of this action. Required if schedule is set.
schedule
string
How often to perform a crawl. For more information, see schedule.
selectorsToMatch
string[]
CSS selectors to identify content to index. The crawler includes only pages that contain at least one element matching these selectors. You can use wildcards and negation, for example, !.main ignores pages with an element with the main class.
fileTypesToMatch
string[]
default:"[\"html\"]"
Type of files to index. For a list of supported file types, see Non-HTML documents. Non-HTML file types are first converted to HTML with Apache Tika, then processed as an HTML page.
autoGenerateObjectIDs
bool
default:true
Whether to generate object IDs for records that don’t already have an objectID field. If false, extracted records without object IDs throw an error.
I