- Type:
Action []
- Required
- The URLs to crawl
- The extraction process for those websites
- The indices to which the extracted records are added
Examples
JavaScript
Report incorrect code
Copy
{
actions: [
{
indexName: 'dev_blog_algolia',
pathsToMatch: ['https://blog.algolia.com/**'],
fileTypesToMatch: ['pdf'],
autoGenerateObjectIDs: false,
schedule: 'every 1 day',
recordExtractor: ({ url, $, contentLength, fileType, dataSources }) => {
...
}
},
],
}
Parameters
Action
Index name targeted by this action.
This value is appended to the
indexPrefix
if specified.URL patterns for web pages to which this action should apply.
The patterns are evaluated using the
micromatch
library.
You can use wildcard characters, negation, and more.A JavaScript function to extract content from a web page and turn it into Algolia records.
The function should return a JSON array which may be empty.
An empty array means the page is skipped.Example:
JavaScript
Report incorrect code
Copy
recordExtractor: ({ url, $, contentLength, fileType }) => {
return [
{
url: url.href,
text: $("p").html(),
// ... anything you want
},
];
// return []; skips the page
};
Show child attributes
Show child attributes
A Cheerio instance with the HTML of the crawled page.
Number of bytes of the crawled page.
External data sources for the crawled page.
Each key corresponds to an
externalData
object.Example:JavaScript
Report incorrect code
Copy
{
dataSources: {
dataSourceId1: { data1: "val1", data2: "val2" },
dataSourceId2: { data1: "val1", data2: "val2" },
},
}
File type of the crawled page or document.
Helper functions for extracting content and turning it into Algolia records.
Show child attributes
Show child attributes
A function that extracts content from pages identified as articles.
Articles are pages with the The
og:type
meta tag: <meta property="og:type" content="article"/>
or with the JSON-LD schema types: Article
, NewsArticle
, Report
, or BlogPosting
.Example:JavaScript
Report incorrect code
Copy
recordExtractor: ({ url, $, helpers }) => {
return helpers.article({ url, $ });
};
article
helper returns an object with the following properties:TypeScript
Report incorrect code
Copy
{
/**
* The object's unique identifier,
* in this case, the article's URL
*/
objectID: string,
/**
* The article's URL (without parameters or hashes)
*/
url: string,
/**
* The language the page content is written in
* - html[attr=lang]
*/
lang?: string,
/**
* The article's headline (selected from one of the following in order of preference):
* - meta[property="og:title"]
* - meta[name="twitter:title"]
* - head > title
* - First <h1>
*/
headline : string,
/**
* The article's description (selected from one of the following in order of preference):
* - meta[name="description"]
* - meta[property="og:description"]
* - meta[name="twitter:description"]
*/
description?: string,
/**
* Article keywords
* - meta[name="keywords"]
* - The `keywords` field of the JSON-LD Article object: https://schema.org/Article
*/
keywords: string[],
/**
* Article tags
* - meta[property="article:tag"]
*/
tags: string[],
/**
* The image associated with the article (selected from one of the following in order of preference):
* - meta[property="og:image"]
* - meta[name="twitter:image"]
*/
image?: string,
/**
* The article's author
* - meta[property="article:author"]
* - The `author` field of the JSON-LD Article object: https://schema.org/Article
*/
authors?: string[],
/**
* Article publication date
* - meta[property="article:published_time"]
* - The `datePublished` field of the JSON-LD Article object: https://schema.org/Article
*/
datePublished?: string,
/**
* The date when the article was last modified
* - meta[property="article:modified_time"]
* - The `dateModified` field of the JSON-LD Article object: https://schema.org/Article
*/
dateModified?: string,
/**
* The article category
* - meta[property="article:section"]
* - The `category` field of the JSON-LD Article object: https://schema.org/Article
*/
category?: string,
/**
* The article's content (body copy)
*/
content?: string,
}
A function that extracts code snippets from a page.
It searches for HTML The
pre
elements on the page and also extracts the programming language,
in which the code snippet is written.If the crawler finds several code snippets on a page, this property returns a list of those snippets.Example:JavaScript
Report incorrect code
Copy
recordExtractor: ({ url, $, helpers }) => {
const code = helpers.codeSnippets({ tag, languageClassPrefix });
return { code };
};
codeSnippets
helper returns an array of code objects with the following properties:TypeScript
Report incorrect code
Copy
{
/**
* The content of the code snippet
* - pre, code, etc. default: pre
*/
content: string;
/**
* The code snippet's language (if found)
* - pre[class^=language-] default: language-
*/
languageClassPrefix?: string;
/**
* The URL to the nearest sibling <a> tag
* - pre + a
*/
codeUrl?: string;
}[]
A function that extracts content and formats it to be compatible with DocSearch.
It creates an optimal number of records for relevancy and hierarchy.
You can also use it without DocSearch or to index non-documentation content.Example:For more examples, see Record extractor in the DocSearch documentation.
JavaScript
Report incorrect code
Copy
recordExtractor: ({ url, $, helpers }) => {
return helpers.docsearch({
aggregateContent: true,
indexHeadings: true,
recordVersion: "v3",
recordProps: {
lvl0: {
selectors: "header h1",
},
lvl1: "article h2",
lvl2: "article h3",
lvl3: "article h4",
lvl4: "article h5",
lvl5: "article h6",
content: "main p, main li",
},
});
};
Show child attributes
Show child attributes
Main DocSearch record properties.
Show child attributes
Show child attributes
Selector for the main category or section of the crawled page.Example:
JavaScript
Report incorrect code
Copy
{
lvl0: {
selectors: ".page-category",
defaultValue: "documentation"
},
}
Selector for the main title of the crawled page.Example:
JavaScript
Report incorrect code
Copy
{
// ...
lvl1: "head > title",
}
Selectors for the main content elements of the crawled page.
JavaScript
Report incorrect code
Copy
{
// ...
content: "body > p, main li",
}
Attribute for increasing or decreasing the relevance of this record.
You can pass any number as a string, including negative numbers.
JavaScript
Report incorrect code
Copy
{
// ...
pageRank: "30",
}
Selector for level 2 headings (
h2
) on the page.Selector for level 3 headings (
h3
) on the page.Selector for level 4 headings (
h4
) on the page.Selector for level 5 headings (
h5
) on the page.Selector for level 6 headings (
h6
) on the page.Extra attributes to add to the extracted records.Example:
JavaScript
Report incorrect code
Copy
{
myCustomAttribute: ".myCustomClass",
ogDesc: {
selectors: "head meta[name='og:desc']",
defaultValue: "Default description",
},
}
Whether to automatically merge sibling elements, separated by a line break.
For example,
<p>Foo</p><p>Bar</p>
with aggregateContent
set to true
creates one record. With aggregateContent
set to false
, two records are created.Whether to create records for headings.
If false, only records for the
content
level are created.
You can provide an object with from
and to
properties to limit
the range of extracted heading levels.Example:JavaScript
Report incorrect code
Copy
{
recordProps: {
indexHeadings: false
indexHeadings: { from: 4, to: 6 }
}
}
Version of the extracted records. Not correlated with the DocSearch version
and may increase independently.
v2
. Compatible with DocSearch version 2v3
. Compatible with DocSearch version 3
A function that extracts text content from pages regardless of its type or category.Example:The
JavaScript
Report incorrect code
Copy
recordExtractor: ({ url, $, helpers }) => {
return helpers.page({
url,
$,
recordProps: {
title: "head title",
content: "body",
},
});
};
page
helper returns an object with the following properties:TypeScript
Report incorrect code
Copy
{
/**
* The object's unique identifier
*/
objectID: string;
/**
* The page's URL
*/
url: string;
/**
* The URL's hostname
* - http://example.com/ = example.com
*/
hostname: string;
/**
* The URL's path, everything after the hostname
*/
path: string;
/**
* The URL depth, based on the number of slashes after the domain
* - http://example.com/ = 1
* - http://example.com/about = 1
* - http://example.com/about/ = 2
* - etc.
*/
depth: number;
/**
* The page's file type.
* One of: html, xml, json, pdf, doc, xls, ppt, odt, ods, odp, email
*/
fileType: FileType;
/**
* The page length in bytes
*/
contentLength: number;
/**
* The page title
* - head > title
*/
title?: string;
/**
* The page's description
* - meta[name=description]
*/
description?: string;
/**
* The page's keywords
* - meta[name="keywords"]
*/
keywords?: string[];
/**
* The image associated with the page
* - meta[property="og:image"]
*/
image?: string;
/**
* The page headers
* - h1 and h2 tags content
*/
headers?: string[];
/**
* The page's content (body copy)
*/
content: string;
}
A function that extracts content from pages with the following JSON-LD schema types:
The
Product
, DietarySupplement
, Drug
, IndividualProduct
,
ProductCollection
, ProductGroup
, ProductModel
, SomeProducts
, or Vehicle
.Example:JavaScript
Report incorrect code
Copy
recordExtractor: ({ url, $, helpers }) => {
return helpers.product({ url, $ });
};
product
helper returns an object with the following properties:TypeScript
Report incorrect code
Copy
{
/**
* The object's unique identifier,
* in this case, it will be the product page's URL
*/
objectID: string,
/**
* The product page URL (without parameters or hashes)
*/
url: string,
/**
* The product name
* - html[attr=lang]
*/
lang?: string,
/**
* The language the page content is written in
* - `name` field of JSON-LD Product object: https://schema.org/Product
*/
name?: string,
/**
* The product SKU
* - The `sku` field of the JSON-LD Product object: https://schema.org/Product
*/
sku : string,
/**
* The product's description
* - The `description` field of the JSON-LD Product object: https://schema.org/Product
*/
description?: string,
/**
* The image associated with the product
* - The `image` field of the JSON-LD Product object: https://schema.org/Product
*/
image?: string,
/**
* The product's price (selected from one of the following in order of preference):
* - The `offers.price` field of the JSON-LD Product object: https://schema.org/Product
* - The `offers.highPrice` field of the JSON-LD Product object: https://schema.org/Product
* - The `offers.lowPrice` field of the JSON-LD Product object: https://schema.org/Product
*/
price?: string,
/**
* The product's currency
* - The `offers.priceCurrency` field of the JSON-LD Product object: https://schema.org/Product
*/
price?: string,
/**
* The product category
* - The `category` field of the JSON-LD Product object: https://schema.org/Product
*/
category?: string,
}
A function that extracts text from a HTML page and splits it into one or more records.
This reduces the size of individual records.Example:In the preceding example, crawling a long HTML page returns an array of records that don’t exceed the limit of 1,000 bytes per record.
Each extracted record would look similar to:To prevent duplicate results when searching for a word that appears in multiple records from the same URL,
set
JavaScript
Report incorrect code
Copy
recordExtractor: ({ url, $, helpers }) => {
const baseRecord = {
url,
title: $("head title").text().trim(),
};
const records = helpers.splitContentIntoRecords({
baseRecord,
$elements: $("body"),
maxRecordBytes: 1000,
textAttributeName: "text",
orderingAttributeName: "part",
});
// You can still alter produced records
// afterwards, if needed.
return records;
};
JavaScript
Report incorrect code
Copy
[
{
url: "http://test.com/index.html",
title: "Home - Test.com",
part: 0,
text: "Welcome on test.com, the best resource to",
},
{
url: "http://test.com/index.html",
title: "Home - Test.com",
part: 1,
text: "find interesting content online.",
},
];
distinct
to true in your index settings,
set url
as attributeForDistinct
, and add a custom ranking from first record on your page to the last:JavaScript
Report incorrect code
Copy
initialIndexSettings: {
"my-index": {
distinct: true,
attributeForDistinct: "url",
searchableAttributes: ["title", "text"],
customRanking: ["asc(part)"],
},
}
Show child attributes
Show child attributes
A Cheerio selector for the HTML element from which to extract the content.
Attributes (and their values) to add to all extracted records.
Maximum number of bytes per record.
To avoid errors, see Record size limits.
Attribute name for storing the text of each extracted record.
Attribute name for storing the sequential number of each extracted record.
Unique identifier of this action.
Required if
schedule
is set.CSS selectors to identify content to index.
The crawler includes only pages that contain at least one element matching these selectors.
You can use wildcards and negation, for example,
!.main
ignores pages with an element with the main
class.Type of files to index.
For a list of supported file types,
see Non-HTML documents.
Non-HTML file types are first converted to HTML with Apache Tika,
then processed as an HTML page.
Whether to generate object IDs for records that don’t already have an
objectID
field.
If false, extracted records without object IDs throw an error.