Within a search engine, tokenization is the process of splitting text into “tokens”, both during querying and indexing.
Tokens are the basic units for finding matches between queries and .
Algolia’s tokenizer divides characters into two classes: non-separators and separators.
Non-separators are alphanumeric characters, and separators are non-alphanumeric characters like spaces and hyphens (-).Turning a string into tokens (tokenizing) happens character-by-character.
The tokenizer identifies the longest groups of contiguous characters belonging to the same class (separator or non-separator),
and creates a token for each group.For example, the string Hello, World! results in four tokens:
Hello (non-separator)
, (with a trailing space) (separator)
World (non-separator)
! (separator)
Hello and World are comprised of non-separator characters, while , (with a trailing space) and ! are comprised of separators.Only non-separator characters are indexed, and thus searchable, by default. In the example above, only Hello and World are indexed. Regardless if a user searches for Hello, World! or hello world, any record with these tokens will be a match.
You can customize what characters are indexed using separatorsToIndex.
Including a character in this setting has these consequences:
It’s tokenized as a non-separator.
It’s not combined it with adjacent characters.
The tokenizer always puts the character alone in its own token,
even if it appears next to other non-separators, or even next to itself.
It’s indexed.
For example, if separatorsToIndex is set to #@ (hash and at sign), then the string #@lgolia!! is tokenized as:
# (non-separator)
@ (non-separator)
lgolia (non-separator)
!! (separator)
Since # and @ are included in separatorsToIndex,
the tokens #, @, and lgolia are indexed.
Even though they appear next to each other, # and @ are separate tokens.Now, when a user searches for #, @, or LGOLIA!! this record matches.
Although characters in separatorsToIndex are tokenized as their own,
when they’re adjacent to a non-separator token, the order should be preserved.For example, if @ is included in separatorsToIndex,
then the string alice@wonderland is interpreted as alice @ wonderland (all tokens must be adjacent, in this order).
The phrase alice @ wonderland (with spaces in-between) has the same tokens, but with no restrictions on order.
A search for alice@wonderland, returns records with alice@wonderland and alice @ wonderland (with spaces),
but not records with wonderland @ alice or alice was @ wonderland.When tokens must occur in a particular order,
it’s known as a sequence expression.Algolia always creates sequence expressions when alphanumeric characters surround a hyphen (-),
even if the hyphen isn’t included inseparatorsToIndex.
For example, the term real-time creates a sequence expression.
The query real-time matches records with real time and real-time,
but not real [...] time, time real, or time [...] real ([...] indicates other words in the string).
The query real time, without a hyphen, matches any records with those two words,
regardless of order or proximity.
Sequence expression matching relies on words position:
all tokens must be adjacent.The indexing only keeps the position of the first 1,000 words of every attribute.
For all words beyond this limit,
sequence expression matching doesn’t work.