Skip to main content
When you create a new crawler, a configuration file is automatically generated. This file helps the crawler understand what information to collect from different parts of your site, like product pages or blog articles. Following the first crawl, you can refine this configuration file to collect more comprehensive or granular details. To target specific content for extraction, you can use meta tags, CSS selectors, or helpers.
Be aware that any change to the site’s structure and attributes might affect crawling. Use attributes that are less likely to alter, even if other aspects of the site change. For example:If there are any changes to site attributes, you must update the crawler’s configuration to guarantee continuous data extraction.

Helpers

The Crawler can use helpers to extract supported JSON-LD attributes from the Article and Product schemas. To identify JSON-LD attributes on your site, use an online schema markup validator, such as: validator.schema.org. For example, to analyze the blog post jsonld.com/jsonld-webpage-vs-website, copy that URL into the validator. You’ll find the author information in the Webpage schema. Author information in JSON-LD attributes of a blog post

Unsupported JSON-LD attributes

To extract unsupported JSON-LD attributes and schemas in your pages, define a custom helper by adding the following instructions to your crawler’s configuration:
JavaScript
let jsonld;
const node = $('script[type="application/ld+json"]').get(0);
try {
  jsonld = JSON.parse(node.firstChild.data);
} catch (err) {
  // Log errors in the console
  console.log(node);
For more information, see Troubleshooting data extraction issues.

Meta tags

For specifying the content a crawler should extract, meta tags are often a good start. This is because meta tags are less likely to alter during site updates or redesigns, minimizing the need to update the crawler configuration. However, meta tags might not include everything you want. If you’re trying to get information that’s not in a meta tag, like a blog post’s author, you can:

Extract data with meta tags

You can set up actions in your crawler’s configuration file to look for specific meta tags on your pages. The following example captures a blog post’s description by finding that meta tag in the head section of that page’s HTML and creating an action to tell the crawler how to extract it: description: $("meta[name=description]").attr("content"), The description meta tag in the head section of the page

Change what you collect

Add or remove Algolia record attributes by modifying the corresponding instruction in the configuration’s recordExtractor. For example:
  • To stop capturing the description attribute for blog posts in the next crawl, delete $("meta[name=description]").attr("content") from the configuration.
  • To add an attribute, add a corresponding recordExtractor with the relevant meta tag to the configuration file.

CSS selectors

You can use CSS selectors to pinpoint the information you want to extract from the body of a page. For example, to extract the author name from an MDN blog post, the appropriate CSS selector is .author. Author information in the CSS classes of a blog post To extract this, add the following to your configuration’s recordExtractor: author: $(".author").text(), For more information, see Debug CSS selectors.
I