Should EdgeNGramTokenizer's DEFAULT_MAX_GRAM_SIZE be ONE? #13802

YeonghyeonKO · 2024-09-18T12:46:47Z

Description

From org.apache.lucene:lucene-analysis-common:9.11.1, the static variable DEFAULT_MAX_GRAM_SIZE of EdgeNGramTokenizer is ONE not TWO.

Logically, the maximum n-gram size must be >= minGramSize and it's not a problem but NOT PRACTICAL.

Since many libraries(git code: Elasticsearch, OpenSearch) use NGramTokenizer.DEFAULT_MAX_NGRAM_SIZE not EdgeNGramTokenizer's.
Will there be any dependency problem in Lucene project as a result of my suggestion?

See the below codes:

public class EdgeNGramTokenizer extends NGramTokenizer {
    public static final int DEFAULT_MAX_GRAM_SIZE = 1;    /* How about changing '1' to '2'? */
    public static final int DEFAULT_MIN_GRAM_SIZE = 1;

    public EdgeNGramTokenizer(int minGram, int maxGram) {
        super(minGram, maxGram, true);
    }

    public EdgeNGramTokenizer(AttributeFactory factory, int minGram, int maxGram) {
        super(factory, minGram, maxGram, true);
    }
}

The text was updated successfully, but these errors were encountered:

jpountz · 2024-09-18T15:56:46Z

I agree that these defaults are not practical. The only purpose of this tokenizer that I'm familiar with is speeding up prefix queries. This makes me wonder if we should have an even higher max gram size, e.g. 8 or 10.

YeonghyeonKO · 2024-09-20T01:03:31Z

In actual Elasticsearch settings, it is common to use values of 8 or 10 for MAX_NGRAM_SIZE as @jpountz said. Of course, people may have different preferences, but I think it is not a good idea to set both max and min values to 1.

- [issue apache#13802](apache#13802)

YeonghyeonKO added the type:enhancement label Sep 18, 2024

YeonghyeonKO added a commit to YeonghyeonKO/lucene that referenced this issue Sep 20, 2024

Update EdgeNGramTokenizer.java

77b999a

- [issue apache#13802](apache#13802)

This was referenced Sep 20, 2024

update EdgeNGramTokenizer.DEFAULT_MAX_NGRAM_SIZE to be practical #13813

Closed

Update EdgeNGramTokenizer.DEFAULT_MAX_NGRAM_SIZE to be practical #13814

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should EdgeNGramTokenizer's DEFAULT_MAX_GRAM_SIZE be ONE? #13802

Should EdgeNGramTokenizer's DEFAULT_MAX_GRAM_SIZE be ONE? #13802

YeonghyeonKO commented Sep 18, 2024 •

edited

Loading

jpountz commented Sep 18, 2024

YeonghyeonKO commented Sep 20, 2024 •

edited

Loading

Should EdgeNGramTokenizer's DEFAULT_MAX_GRAM_SIZE be ONE? #13802

Should EdgeNGramTokenizer's DEFAULT_MAX_GRAM_SIZE be ONE? #13802

Comments

YeonghyeonKO commented Sep 18, 2024 • edited Loading

Description

jpountz commented Sep 18, 2024

YeonghyeonKO commented Sep 20, 2024 • edited Loading

YeonghyeonKO commented Sep 18, 2024 •

edited

Loading

YeonghyeonKO commented Sep 20, 2024 •

edited

Loading