Skip to main content
Within a search engine, tokenization is the process of splitting text into โ€œtokensโ€, both during querying and indexing. Tokens are the basic units for finding matches between queries and records.

Separators and non-separators

Algoliaโ€™s tokenizer divides characters into two classes: non-separators and separators. Non-separators are alphanumeric characters, and separators are non-alphanumeric characters like spaces and hyphens (-). Turning a string into tokens (tokenizing) happens character-by-character. The tokenizer identifies the longest groups of contiguous characters belonging to the same class (separator or non-separator), and creates a token for each group. For example, the string Hello, World! results in four tokens:
  • Hello (non-separator)
  • , (with a trailing space) (separator)
  • World (non-separator)
  • ! (separator)
Hello and World are comprised of non-separator characters, while , (with a trailing space) and ! are comprised of separators. Only non-separator characters are indexed, and thus searchable, by default. In the example above, only Hello and World are indexed. Regardless if a user searches for Hello, World! or hello world, any record with these tokens will be a match.

Index separators

You can customize what characters are indexed using separatorsToIndex. Including a character in this setting has these consequences:
  • Itโ€™s tokenized as a non-separator.
  • Itโ€™s not combined it with adjacent characters. The tokenizer always puts the character alone in its own token, even if it appears next to other non-separators, or even next to itself.
  • Itโ€™s indexed.
For example, if separatorsToIndex is set to #@ (hash and at sign), then the string #@lgolia!! is tokenized as:
  • # (non-separator)
  • @ (non-separator)
  • lgolia (non-separator)
  • !! (separator)
Since # and @ are included in separatorsToIndex, the tokens #, @, and lgolia are indexed. Even though they appear next to each other, # and @ are separate tokens. Now, when a user searches for #, @, or LGOLIA!! this record matches.

Sequence expressions

Although characters in separatorsToIndex are tokenized as their own, when theyโ€™re adjacent to a non-separator token, the order should be preserved. For example, if @ is included in separatorsToIndex, then the string alice@wonderland is interpreted as alice @ wonderland (all tokens must be adjacent, in this order). The phrase alice @ wonderland (with spaces inbetween) has the same tokens, but with no restrictions on order. A search for alice@wonderland, returns records with alice@wonderland and alice @ wonderland (with spaces), but not records with wonderland @ alice or alice was @ wonderland. When tokens must occur in a particular order, itโ€™s known as a sequence expression. Algolia always creates sequence expressions when alphanumeric characters surround a hyphen (-), even if the hyphen isnโ€™t included inseparatorsToIndex. For example, the term real-time creates a sequence expression. The query real-time matches records with real time and real-time, but not real [...] time, time real, or time [...] real ([...] indicates other words in the string). The query real time, without a hyphen, matches any records with those two words, regardless of order or proximity.

Sequence expressions limitation

Sequence expression matching relies on words position: all tokens must be adjacent. The indexing only keeps the position of the first 1,000 words of every attribute. For all words beyond this limit, sequence expression matching doesnโ€™t work.

Mitigation and solution

To mitigate the issue, you can:
  • Transform the query, for example from real-time to real time
  • Use smaller records
โŒ˜I