Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BaseTokenStreamTestCase.assertAnalyzesTo fails when Analyzer contains… #12750

Closed

Conversation

lukas-vlcek
Copy link
Contributor

… PathHierarchy tokenizer

Description

This PR is expected to fail. It demonstrates issue with BaseTokenStreamTestCase.assertAnalyzesTo() method in connection to PathHierarchyTokenizer.

Is there any reason why PathHierarchyTokenizer shall not be used in the test like this? There are definitely other tokenizers that are being tested like this, ie. they are wrapped in Analyzer and then assertAnalyzesTo() method is called to check the tokens. What is special about PathHierarchy tokenizer that it does not work?

I think the problem might not be in the tokenizer but in the test method itself or in the way I call it (maybe I need to pass in more parameters/flags to get rid of the issue?). The testing method is complex, especially when it gets to checkAnalysisConsistency() part.

I am looking for any useful tips. Thank you!

… PathHierarchy tokenizer

This PR is expected to fail. It demonstrates issue with BaseTokenStreamTestCase.assertAnalyzesTo
method in connection to PathHierarchy tokenizer.

Is there any reason why PathHierarchy tokenizer shall not be used in the test like this? There
are definitely other tokenizers that are being tested like this, ie. they are wrapped in Analyzer
and then assertAnalyzesTo() method is called to check the tokens. What is special about
PathHierarchy tokenizer that it does not work?

I think the problem might not be in the tokenizer but in the test method itself or in the way
I call it (maybe I need to pass in more parameters/flags to get rid of the issue?). The testing
method is complex, especially when it gets to checkAnalysisConsistency() part.

I am looking for any useful tips. Thank you!

Signed-off-by: Lukáš Vlček <[email protected]>
@mikemccand
Copy link
Member

This looks like the root cause?:

      java.lang.AssertionError: inconsistent endOffset 1 pos=0 posLen=1 token=/a/b expected:<2> but was:<4>

Indeed I think the issue is a problem with PathHierarchyTokenizer: it produces tokens all on top of one another (instead of in sequence at incrementing positions) yet the tokens claim different start/end offsets, and BaseTokenStreamTestCase detects that as a corrupt token graph. I think this tokenizer should be setting the PositionLengthAttribute as well, to indicate that each token reaches to a further position ... this should make BaseTokenStreamTestCase happy.

See this blog post for more details about how TokenStreams are actually graphs in Lucene.

@msfroh
Copy link
Contributor

msfroh commented Nov 27, 2023

I was looking into this and the approach used for (Edge)NGramTokenizer back in 2013: a03e38d

The solution there is to always set the position increment and length to 1:

posIncAtt.setPositionIncrement(1);
posLenAtt.setPositionLength(1);

With that change, your test passes (but I had to change every other test): msfroh@0d05366

Given that it's not backward-compatible, I imagine it would have to be a change for 10.0? Also, whatever we do should probably also be applied to ReversePathHierarchyTokenizer too.

@lukas-vlcek
Copy link
Contributor Author

lukas-vlcek commented Dec 4, 2023

I am going to close this PR.
I opened a new PR that has fix for ReversePathHierarchyTokenizer and PathHierarchyTokenizer: #12875

@lukas-vlcek lukas-vlcek closed this Dec 4, 2023
@lukas-vlcek lukas-vlcek deleted the PathHierarchyAnalyzerTest branch December 7, 2023 17:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants