A elasticsearch char filter for dividing multi-language to different fields.
Some analyzers of Elasticsearch are based on language such as english analyzer. They tokenize documents into terms according to the specific grammars.
However, the analyzers will not bypass the foreign language. When tacking multi-language documents, there is a recommaned solution that creating sub-fields which use the special analyzer depending on its language. The foreign lanuage would impact the accurate and efficiency of search.
For instance,
POST _analyze
{
"analyzer": "english",
"text": "We are going to meet at 中山路."
}
The generated terms are,
[ "we", "go", "meet", "中", "山", "路" ]
"中山路" is actually a road name and it is divided into independent charaters.
If we switch to Chinese analyzer, for instance, IK, it tokenizes Chinese characters correctly but keep English words,
POST _analyze
{
"analyzer": "ik_smart",
"text": "We are going to meet at 中山路."
}
The generated terms are,
["we", "going", "meet", "中山路"]
This time, Chinese characters tokenized correctly, but it also kept English word.
Both of these cases will leed to search issues because the matching score covers all the fields. This char filter is aiming at filtering languages to make one filed only storing one language terms.
- Download released zip file from: https://github.com/stormisover/es-language-char-filter/releases/download/0.1/language-char-filter-0.1.zip
- Unzip to elasticsearch/plugin/language-char-filter
- Restart elasticsearch
Define your char filter
"char_filter": {
"language_char_filter" : {
"type": "language_char_filter",
"lang": "EN"
}
}
The paramter lang
is used to assign that which language should be filtered. The valid value is,
- zh-CN
- EN