BaseTokenStreamTestCase.assertAnalyzesTo fails when Analyzer contains… #12750

lukas-vlcek · 2023-11-02T17:05:18Z

… PathHierarchy tokenizer

Description

This PR is expected to fail. It demonstrates issue with BaseTokenStreamTestCase.assertAnalyzesTo() method in connection to PathHierarchyTokenizer.

Is there any reason why PathHierarchyTokenizer shall not be used in the test like this? There are definitely other tokenizers that are being tested like this, ie. they are wrapped in Analyzer and then assertAnalyzesTo() method is called to check the tokens. What is special about PathHierarchy tokenizer that it does not work?

I think the problem might not be in the tokenizer but in the test method itself or in the way I call it (maybe I need to pass in more parameters/flags to get rid of the issue?). The testing method is complex, especially when it gets to checkAnalysisConsistency() part.

I am looking for any useful tips. Thank you!

… PathHierarchy tokenizer This PR is expected to fail. It demonstrates issue with BaseTokenStreamTestCase.assertAnalyzesTo method in connection to PathHierarchy tokenizer. Is there any reason why PathHierarchy tokenizer shall not be used in the test like this? There are definitely other tokenizers that are being tested like this, ie. they are wrapped in Analyzer and then assertAnalyzesTo() method is called to check the tokens. What is special about PathHierarchy tokenizer that it does not work? I think the problem might not be in the tokenizer but in the test method itself or in the way I call it (maybe I need to pass in more parameters/flags to get rid of the issue?). The testing method is complex, especially when it gets to checkAnalysisConsistency() part. I am looking for any useful tips. Thank you! Signed-off-by: Lukáš Vlček <[email protected]>

mikemccand · 2023-11-21T11:11:49Z

This looks like the root cause?:

      java.lang.AssertionError: inconsistent endOffset 1 pos=0 posLen=1 token=/a/b expected:<2> but was:<4>

Indeed I think the issue is a problem with PathHierarchyTokenizer: it produces tokens all on top of one another (instead of in sequence at incrementing positions) yet the tokens claim different start/end offsets, and BaseTokenStreamTestCase detects that as a corrupt token graph. I think this tokenizer should be setting the PositionLengthAttribute as well, to indicate that each token reaches to a further position ... this should make BaseTokenStreamTestCase happy.

See this blog post for more details about how TokenStreams are actually graphs in Lucene.

msfroh · 2023-11-27T19:21:51Z

I was looking into this and the approach used for (Edge)NGramTokenizer back in 2013: a03e38d

The solution there is to always set the position increment and length to 1:

lucene/lucene/analysis/common/src/java/org/apache/lucene/analysis/ngram/NGramTokenizer.java

Lines 186 to 187 in 8ef6a0d

    
           posIncAtt.setPositionIncrement(1); 
        
           posLenAtt.setPositionLength(1);

With that change, your test passes (but I had to change every other test): msfroh@0d05366

Given that it's not backward-compatible, I imagine it would have to be a change for 10.0? Also, whatever we do should probably also be applied to ReversePathHierarchyTokenizer too.

lukas-vlcek · 2023-12-04T15:32:41Z

I am going to close this PR.
I opened a new PR that has fix for ReversePathHierarchyTokenizer and PathHierarchyTokenizer: #12875

lukas-vlcek mentioned this pull request Nov 2, 2023

Deprecate CamelCase PathHierarchy tokenizer name opensearch-project/OpenSearch#10894

Merged

5 tasks

lukas-vlcek mentioned this pull request Dec 4, 2023

Fix position increment in (Reverse)PathHierarchyTokenizer #12875

Merged

lukas-vlcek closed this Dec 4, 2023

lukas-vlcek deleted the PathHierarchyAnalyzerTest branch December 7, 2023 17:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BaseTokenStreamTestCase.assertAnalyzesTo fails when Analyzer contains… #12750

BaseTokenStreamTestCase.assertAnalyzesTo fails when Analyzer contains… #12750

lukas-vlcek commented Nov 2, 2023

mikemccand commented Nov 21, 2023

msfroh commented Nov 27, 2023

lukas-vlcek commented Dec 4, 2023 •

edited

Loading

BaseTokenStreamTestCase.assertAnalyzesTo fails when Analyzer contains… #12750

BaseTokenStreamTestCase.assertAnalyzesTo fails when Analyzer contains… #12750

Conversation

lukas-vlcek commented Nov 2, 2023

Description

mikemccand commented Nov 21, 2023

msfroh commented Nov 27, 2023

lukas-vlcek commented Dec 4, 2023 • edited Loading

lukas-vlcek commented Dec 4, 2023 •

edited

Loading