- Lowercase all characters
- Remove all diacritics, for example, accents
- Remove punctuation within words, for example, apostrophes
- Manage punctuation between words
- Use word separators, for example, spaces
- Transform traditional Chinese to modern
When is normalization applied?
Normalization happens at both indexing and query time, ensuring consistency in how your data is represented as well as matched. Normalization is language-agnostic and you can’t turn it off. You can change the default normalization by providing a custom normalization for your index.Character-based normalization
Algolia uses Unicode (UTF-16), which handles every known language. Character-based normalization means reducing the full UTF-16 character set to a smaller, more consistent subset of Unicode characters.Diacritics
By default, Algolia removes diacritics from words. For example:é
becomes e
, ø
becomes o
, or で
becomes て
.
If this causes issues, you can specify characters that will keep diacritics with the keepDiacriticsOnCharacters
parameter.
Characters passed to this parameter aren’t normalized.
Word separators
The engine uses the space character (among other techniques) to detect word boundaries. However, not every language relies exclusively on spacing to separate words. Spacing is a fairly reliable method of word detection (tokenization). Where it’s less efficient, the problem may be that it doesn’t go far enough: while most words are detected, some within compound words aren’t. You can improve word detection with dictionaries. Some languages concatenate and compound words (agglutinated words) and others string together words without using spaces (CJK). With the use of dictionaries, you can spot the “words within the words”.Word-based normalization
You can’t turn off the following techniques:- Splitting. Split words when they’re combined: “jamesbrown” matches with “James Brown”. Words are only split if there aren’t any typos.
- Concatenation. Combine words that are separated by a space: “entert ainment” matches with “entertainment”. Words are only combined if there aren’t any typos.
- Acronyms. Separator characters between letters are removed. D.N.A is considered the same as DNA.
- Hyphenated words. If each separated component is three or more letters, each component is treated as a standalone word (off-campus → off + campus + offcampus).
Normalization for logogram-based languages (CJK)
Word detection
Some languages don’t use spaces to delimit words. Without words, search is limited to a sequential, character-based matching. This is a serious limitation, as it doesn’t allow for some important and basic search features, such as inverse word matching (“red shirt” / “shirt red”), non-contiguous words (“chocolate cookies” finds “chocolate chip cookies”), the use of ANDs and ORs, or Rules, and other situations. Detecting words in CJK logograms, Algolia follows a two-step process:- Use the Unicode (ICU) library to find words. This library is based on the MECAB dictionary, enriched with data from Wiktionary.
- If that fails, use a sequential character-based search.
Language-specific dictionaries for CJK words
The engine can detect when a user is entering CJK characters, but it can’t detect the exact CJK language. This means that, whenever CJK is detected, the engine applies a generic CJK logic to separate logograms. This is often fine, but if you want the engine to apply language-specific dictionaries, use thequeryLanguages
setting.
For example, with queryLanguages
, you can specify Chinese (“zh”) in the first position to ensure the use of a Chinese dictionary in finding words.
Note that you can change the dictionary dynamically with each search, enabling multi-lingual support.
Traditional to standard Chinese character conversion
As part of the normalization process, all traditional characters are converted into their modern Unicode counterparts.Normalization for Arabic languages
Short vowels removal
Arabic languages make extensive use of diacritics to give hints on pronunciation. Yet, it’s not uncommon to omit them when typing, which may hurt search of text with diacritics. Usually, Algolia ignores diacritics by default, but those are a bit different, as they’re considered as full-fledged characters by the Unicode Standard. Algolia processes the most common of those diacritics to ignore them in both indices and queries. Consequently, searching with or without them yields the same results. These diacritics are ignored:- Fathah
- Kasrah
- Dammah
- Sukun