Conditional stemming for 'persian' analyzer #113421

cbuescher · 2024-09-23T21:01:43Z

The 'persian' analyzer for Lucene 10 comes with PersianStemFilter as the last token filter by default. In order to maintain compatibility for old indices, we only use the new analyzer for newly created indices but configure a legacy analyzer with the old behaviour for older index versions.

github-actions · 2024-09-23T21:01:56Z

Documentation preview:

✨ Changed pages

elasticsearchmachine · 2024-09-23T21:02:34Z

Pinging @elastic/es-search-relevance (Team:Search Relevance)

The 'persian' analyzer for Lucene 10 comes with PersianStemFilter as the last token filter by default. In order to maintain compatibility for old indices, we only use the new analyzer for newly created indices but configure a legacy analyzer with the old behaviour for older index versions.

benwtrent · 2024-09-24T11:34:32Z

...alysis-common/src/yamlRestTest/resources/rest-api-spec/test/analysis-common/20_analyzers.yml

@@ -901,6 +901,16 @@
    - length: { tokens: 1 }
    - match:  { tokens.0.token: خورد }

+    - do:


Is this to test stemming?

Then this test will need to be guarded with skip version or something. ES 9 will have mixed cluster tests with 8.last & 9.current and the 8.last won't have the stemming automatically correct?

I would prefer a new test that is guarded, that way the original test isn't always skipped.

All that would have made sense to me a few months ago. Now we live in a world where we merge Lucene 10 to "main" at some point which not necessarily is the point in which it becomes 9.0. So this is becoming a bit of a head-scratcher for me. I need to figure out if we can skip based on IndexVersion or not (since that is what we condition the behavioural change on), if we need some new sort of capability voodoo for that etc...

fwiw I'm afraid I might have to introduce a "cluster_feature" for this. Maybe it makes sense to have one for "Cluster runs with Lucene 10".

benwtrent · 2024-09-24T11:35:26Z

...ysis-common/src/yamlRestTest/resources/rest-api-spec/test/search.query/80_persian_search.yml

+# integration tests for persian analyzer changes from Lucene 9 to Lucene 10
+setup:


If you want to test with old data, then upgrade, then verify the query results don't change, a rolling upgrade test, or one of the full restart tests.

Yes I was afraid I'd need that full-blown infra and was somehow hoping I could leverage some yaml test. This at least shows that stemming works in the new version of the analyzer, i.e. both search terms are matching both documents which means they are analyzed to the same root form.

pxsalehi · 2024-09-24T11:50:26Z

...sis-common/src/test/java/org/elasticsearch/analysis/common/PersianAnalyzerProviderTests.java

+            Settings.EMPTY
+        );
+        Analyzer analyzer = persianAnalyzerProvider.get();
+        assertAnalyzesTo(analyzer, " من کتاب های زیادی خوانده ام.", new String[] { "كتاب", "زياد", "خوانده" });


"كتاب", "زياد", "خوان"
would make the most sense to me. That is, compared to the one on L74, [1] is correct here, and in both [2] is the same and correct, and what is mentioned as stem in both of them in [0], seems not to be correct to me.

Okay, thanks. Token 0 is the output in both cases though and might be a general issues with the "persian" analyzer in both versions though. The goal here is not to test the quality of the analyzer per-se but the changes between the output in Lucene 9 and Lucene 10. I was mostly interested in the input sentence and if that makes sense.

yes, it makes sense, and is the same in both: I have read a lot of books. The dot is at the wrong end, I'm assuming right-to-left issues.

I'll probably remove the dots, already had trouble copy/pasting this since I'm not used to right-to-left languages unfortunately.

cbuescher · 2024-09-24T16:18:10Z

Reopened against the new Lucene 10 integration branch (lucene_snapshot) over here #113482

elasticsearchmachine added the needs:triage Requires assignment of a team area label label Sep 23, 2024

cbuescher added :Search Relevance/Analysis How text is split into tokens v9.0.0 and removed needs:triage Requires assignment of a team area label labels Sep 23, 2024

elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Sep 23, 2024

benwtrent self-requested a review September 23, 2024 21:09

cbuescher force-pushed the persian-analyzer-l10 branch from c3a3e26 to 8486254 Compare September 23, 2024 21:39

cbuescher added >enhancement >upgrade labels Sep 23, 2024

cbuescher force-pushed the persian-analyzer-l10 branch from 8486254 to 89ead07 Compare September 24, 2024 10:26

benwtrent reviewed Sep 24, 2024

View reviewed changes

pxsalehi reviewed Sep 24, 2024

View reviewed changes

cbuescher added 2 commits September 24, 2024 15:55

Guard yaml stemming test with cluster feature

d012b50

spotless

7a5b8ab

brianseeders deleted the branch elastic:lucene_snapshot_10 September 24, 2024 15:39

brianseeders closed this Sep 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conditional stemming for 'persian' analyzer #113421

Conditional stemming for 'persian' analyzer #113421

cbuescher commented Sep 23, 2024

github-actions bot commented Sep 23, 2024

elasticsearchmachine commented Sep 23, 2024

benwtrent Sep 24, 2024

cbuescher Sep 24, 2024

benwtrent Sep 24, 2024

benwtrent Sep 24, 2024

cbuescher Sep 24, 2024

cbuescher Sep 24, 2024

benwtrent Sep 24, 2024

cbuescher Sep 24, 2024

pxsalehi Sep 24, 2024

cbuescher Sep 24, 2024 •

edited

Loading

pxsalehi Sep 24, 2024

cbuescher Sep 24, 2024

cbuescher commented Sep 24, 2024

		# integration tests for persian analyzer changes from Lucene 9 to Lucene 10
		setup:

Conditional stemming for 'persian' analyzer #113421

Conditional stemming for 'persian' analyzer #113421

Conversation

cbuescher commented Sep 23, 2024

github-actions bot commented Sep 23, 2024

elasticsearchmachine commented Sep 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cbuescher Sep 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cbuescher commented Sep 24, 2024

cbuescher Sep 24, 2024 •

edited

Loading