Learn how to extract data from document formats such as PDF with your Algolia crawler.
The Crawler can extract data from files like PDF and Word documents.
To do this, it uses Apache Tika to extract a document’s content and transform it into a basic HTML file.
Because it’s difficult to translate non-HTML documents into HTML, there are limitations:
PDF documents can break if it’s exported with an unknown font.
The transformed HTML has little semantic value: headings, paragraphs,
and lists in the original document might not be marked in the HTML.
This makes good relevancy hard to achieve.
Document indexing is slower than classic HTML indexing.
To enable document extraction, add the fileTypesToMatch
parameter to at least one of your crawler’s actions.
For a list of supported file types, see Supported file types.The document’s transformed HTML is stored in the recordExtractor.$ parameter.
The file type is stored in the recordExtractor.fileType parameter.
The file type email includes all documents related to email.
The Crawler supports the Outlook Mail Message (.msg) format.For example, Tika converts this email into the following HTML:
HTML
Report incorrect code
Copy
<html xmlns="http://www.w3.org/1999/xhtml"><head> <meta name="date" content="2017-06-01T15:24:31Z" /> <meta name="Message:To-Email" content="to@domain.com" /> <meta name="dc:description" content="this is a mail to test msg file" /> <meta name="subject" content="this is a mail to test msg file" /> <meta name="dc:creator" content="from@domain.com" /> <meta name="Message:From-Email" content="from@domain.com" /> <meta name="dcterms:created" content="2017-06-01T15:24:31Z" /> <meta name="Message-To" content="to@domain.com" /> <meta name="dcterms:modified" content="2017-06-01T15:24:31Z" /> <meta name="Last-Modified" content="2017-06-01T15:24:31Z" /> <meta name="Message-Recipient-Address" content="to@domain.com" /> <meta name="Message:Raw-Header:X-Unsent" content="1" /> <meta name="Message:Raw-Header:Subject" content="this is a mail to test msg file" /> <meta name="meta:mapi-message-class" content="NOTE" /> <meta name="Message:To-Display-Name" content="to@domain.com" /> <meta name="Last-Save-Date" content="2017-06-01T15:24:31Z" /> <meta name="Message:Raw-Header:MIME-Version" content="1.0" /> <meta name="meta:save-date" content="2017-06-01T15:24:31Z" /> <meta name="dc:title" content="this is a mail to test msg file" /> <meta name="Message:Raw-Header:Message-ID" content="<c58b1b52f61f4789ba40339c6e993440>" /> <meta name="modified" content="2017-06-01T15:24:31Z" /> <meta name="Content-Type" content="application/vnd.ms-outlook" /> <meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser" /> <meta name="X-Parsed-By" content="org.apache.tika.parser.microsoft.OfficeParser" /> <meta name="creator" content="from@domain.com" /> <meta name="Message:Raw-Header:From" content="from@domain.com" /> <meta name="meta:author" content="from@domain.com" /> <meta name="meta:creation-date" content="2017-06-01T15:24:31Z" /> <meta name="meta:mapi-from-representing-email" content="from@domain.com" /> <meta name="Creation-Date" content="2017-06-01T15:24:31Z" /> <meta name="Message-Cc" content="" /> <meta name="Message-Bcc" content="" /> <meta name="meta:mapi-from-representing-name" content="from@domain.com" /> <meta name="Message:Raw-Header:To" content="to@domain.com" /> <meta name="Message:From-Name" content="from@domain.com" /> <meta name="Author" content="from@domain.com" /> <meta name="Message-From" content="from@domain.com" /> <meta name="Message:To-Name" content="" /> <title>this is a mail to test msg file</title></head><body> <h1>this is a mail to test msg file</h1> <dl> <dt>From</dt> <dd>from@domain.com</dd> <dt>To</dt> <dd>to@domain.com</dd> <dt>Recipients</dt> <dd>to@domain.com</dd> </dl> <div class="message-body"> <p>This message was sent using a msg file </p> </div></body></html>