Extract data from crawled pages

When you create a new crawler, a configuration file is automatically generated. This file helps the crawler understand what information to collect from different parts of your site, like product pages or blog articles. Following the first crawl, you can refine this configuration file to collect more comprehensive or granular details. To target specific content for extraction, you can use meta tags, CSS selectors, or helpers.

Be aware that any change to the site’s structure and attributes might affect crawling. Use attributes that are less likely to alter, even if other aspects of the site change. For example:

Prefer data-* global attributes
Prefer simple CSS selectors over complex ones (with children or siblings).

If there are any changes to site attributes, you must update the crawler’s configuration to guarantee continuous data extraction.

Helpers

The Crawler can use helpers to extract supported JSON-LD attributes from the Article and Product schemas. To identify JSON-LD attributes on your site, use an online schema markup validator, such as: validator.schema.org. For example, to analyze the blog post jsonld.com/jsonld-webpage-vs-website, copy that URL into the validator. You’ll find the author information in the Webpage schema.

Author information in JSON-LD attributes of a blog post

Unsupported JSON-LD attributes

To extract unsupported JSON-LD attributes and schemas in your pages, define a custom helper by adding the following instructions to your crawler’s configuration:

JavaScript

let jsonld;
const node = $('script[type="application/ld+json"]').get(0);
try {
  jsonld = JSON.parse(node.firstChild.data);
} catch (err) {
  // Log errors in the console
  console.log(node);

For more information, see Troubleshooting data extraction issues.

Meta tags

For specifying the content a crawler should extract, meta tags are often a good start. This is because meta tags are less likely to alter during site updates or redesigns, minimizing the need to update the crawler configuration. However, meta tags might not include everything you want. If you’re trying to get information that’s not in a meta tag, like a blog post’s author, you can:

Add the missing information to your site’s meta tags
Use helpers
Use CSS selectors.

Extract data with meta tags

You can set up actions in your crawler’s configuration file to look for specific meta tags on your pages. The following example captures a blog post’s description by finding that meta tag in the head section of that page’s HTML and creating an action to tell the crawler how to extract it: description: $("meta[name=description]").attr("content"),

The description meta tag in the head section of the page

Change what you collect

Add or remove Algolia record attributes by modifying the corresponding instruction in the configuration’s recordExtractor. For example:

To stop capturing the description attribute for blog posts in the next crawl, delete $("meta[name=description]").attr("content") from the configuration.
To add an attribute, add a corresponding recordExtractor with the relevant meta tag to the configuration file.

CSS selectors

You can use CSS selectors to pinpoint the information you want to extract from the body of a page. For example, to extract the author name from an MDN blog post, the appropriate CSS selector is .author.

Author information in the CSS classes of a blog post

To extract this, add the following to your configuration’s recordExtractor: author: $(".author").text(), For more information, see Debug CSS selectors.

Get started

Extract data

Enrich data

Troubleshooting

APIs

Netlify plugin

Extract data from crawled pages

Helpers

Unsupported JSON-LD attributes

Meta tags

Extract data with meta tags

Change what you collect

CSS selectors

Get started

Extract data

Enrich data

Troubleshooting

APIs

Netlify plugin

​Helpers

​Unsupported JSON-LD attributes

​Meta tags

​Extract data with meta tags

​Change what you collect

​CSS selectors

Helpers

Unsupported JSON-LD attributes

Meta tags

Extract data with meta tags

Change what you collect

CSS selectors