recordExtractor
,
the most important parameter is the Cheerio instance ($
).
Cheerio is a server-side implementation of jQuery.
The Crawler uses it to expose the page’s DOM so you can extract the content you want using Cheerio’s
Selectors API.
While Cheerio provides extensive documentation,
you may need to experiment with its syntax to successfully crawl your pages.
This guide outlines common techniques for building records from your site’s content.
Common extraction techniques
The following [helpers](/doc/uild a UI hierarchy/crawler/getting-started/concepts/#helpers) may be useful for extracting content from your pages.Extract content from metadata elements
To get content frommeta
elements,
parse their content
attribute.
JavaScript
Extract data from JSON-LD
To get content from supported JSON-LD attributes:JavaScript
Get text from several CSS selectors
To get content from several CSS selectors, query them all and retrieve an array of content.JavaScript
Build a UI hierarchy
The InstantSearch UI libraries provide ahierarchicalMenu
widget
for displaying hierarchical information.
This widget expects a special format of your records.
If your site shows a breadcrumb,
you can turn it into a hierarchy in your records.
HTML
JavaScript
Index separate indices based on content
To add records to separate indices, create severalactions
,
each targeting a separate indexName
.
You can then decide which pages each action
processes by specifying the pathsToMatch
parameter
Sometimes you need to check the page content to determine which action should process it.
For example, if you have a separate index for each language, use the html
tag’s lang
attribute to determine which index to use.
In the following example, both actions process the same pages but crawls or skips them depending on the lang
attribute.
JavaScript
Split content
For better performance and relevance, split long content into several records.Split PDF files
The Crawler transforms PDF documents into HTML with Apache Tika and exposes it to you with Cheerio. Use the HTML tab of the URL Tester to see the extracted HTML. Based on the structure of the resulting HTML, you should be able to separate the content into individual records.Basic PDF splitting
The HTML that Tika generates is often mainly composed ofp
tags, meaning the $('p').text()
returns the complete text of your PDF.
Since PDF documents tend to be long and there’s a size limit for Algolia records,
wrap such text with the splitContentIntoRecords
helper.
For example:
JavaScript
Advanced PDF splitting
PDF generation tools often create files with minimal structure. It’s typical to encounterdiv
tags as a way to define individual pages.
For example, this document has the following structure when transformed into HTML:
HTML
#page=n
at the end of a URL pointing to a PDF document, the browser opens it on that page.
By generating one record per page, you can redirect users to the page of the document that matches their search, which further improves their experience.
For example:
JavaScript
Split pages using URI fragments
If you have URI fragments in your pages, it’s a good idea to have your records pointing to them. With the following HTML:HTML
JavaScript