You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
From org.apache.lucene:lucene-analysis-common:9.11.1, the static variable DEFAULT_MAX_GRAM_SIZE of EdgeNGramTokenizer is ONE not TWO.
Logically, the maximum n-gram size must be >= minGramSize and it's not a problem but NOT PRACTICAL.
Since many libraries(git code: Elasticsearch, OpenSearch) use NGramTokenizer.DEFAULT_MAX_NGRAM_SIZE not EdgeNGramTokenizer's.
Will there be any dependency problem in Lucene project as a result of my suggestion?
I agree that these defaults are not practical. The only purpose of this tokenizer that I'm familiar with is speeding up prefix queries. This makes me wonder if we should have an even higher max gram size, e.g. 8 or 10.
In actual Elasticsearch settings, it is common to use values of 8 or 10 for MAX_NGRAM_SIZE as @jpountz said. Of course, people may have different preferences, but I think it is not a good idea to set both max and min values to 1.
Description
From org.apache.lucene:lucene-analysis-common:9.11.1, the static variable
DEFAULT_MAX_GRAM_SIZE
of EdgeNGramTokenizer is ONE not TWO.Logically, the maximum n-gram size must be >= minGramSize and it's not a problem but NOT PRACTICAL.
Since many libraries(git code: Elasticsearch, OpenSearch) use
NGramTokenizer.DEFAULT_MAX_NGRAM_SIZE
notEdgeNGramTokenizer
's.Will there be any dependency problem in Lucene project as a result of my suggestion?
See the below codes:
The text was updated successfully, but these errors were encountered: