Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible bug in 'standard' lexer: the longest token is matched incorrectly #370

Closed
kodo-pp opened this issue Apr 25, 2019 · 11 comments
Closed

Comments

@kodo-pp
Copy link

kodo-pp commented Apr 25, 2019

When I use the standard lexer (default for lalr parser) sometimes it determines the longest match incorrectly.

Consider the following code:

import lark

grammar = r'''
start: BINOP | ASSIGNMENT_OP

ASSIGNMENT_OP: "="
    | "+="

BINOP: "*"
    | "/"
    | "%" 
    | "+"
    | "-"
    | "long_operator"

%import common.WS
%ignore WS
'''

parser = lark.Lark(grammar, parser='lalr')
code = '+='
print(list(parser.lex(code)))

It should print [Token(ASSIGNMENT_OP, '+=')] because the longest match is += and the documentation says that in this case the longest match should be returned. However, the code prints [Token(BINOP, '+'), Token(ASSIGNMENT_OP, '=')], which is incorrect.

I suppose the lexer might be comparing not the lengths of actually matching terminals (+ and += in this case) but the lengths of the longest possible options (long_operator and +=)

P.S. lark version is 0.7.0, installed as a package python-lark-parser in ArchLinux

@erezsh
Copy link
Member

erezsh commented Apr 25, 2019

I suppose the lexer might be comparing not the lengths of actually matching terminals (+ and += in this case) but the lengths of the longest possible options (long_operator and +=)

Exactly. The only real fix would be to implement regexp analysis, which isn't worth the trouble.

@kodo-pp
Copy link
Author

kodo-pp commented Apr 25, 2019

Then I have two questuons:

  1. Why does the documentation say that the lexer performs in the "correct" way when in fact it isn't true?

  2. Won't the following be the fix? For each option we compute the length of the longest matching token (in the case I described long_operator would have zero-length match while + would have a match with length 1) and then use the longest token

@whitten
Copy link

whitten commented Apr 25, 2019

kodo-pp said; Then I have two questions:
1.Why does the documentation say that the lexer performs in the "correct" way when in fact it isn't true?
As you wrote your grammar, "+=" and "=" are each one token long. Perhaps if you separated the "+=" into "+" "=" the lexer would see it is two separate tokens?
LALR loves to compile things into a table and the grammar rules are fully distinct from the lexer rules. I expect that forcing them into separate tokens will allow the parts of "+=" to be seen by the grammar rules.

2.Won't the following be the fix? For each option we compute the length of the longest matching token (in the case I described long_operator would have zero-length match while + would have a match with length 1) and then use the longest token

You see something in the grammar I don't see. "long_operator" just looks like a single lexer token that is expressed in 13 characters. Is it the name of function/procedure or a non-terminal I don't see ?

@kodo-pp
Copy link
Author

kodo-pp commented Apr 25, 2019

+= is not separated. The funny thing is that is I remove long_operator from the lexer rule, += starts be lexed as a single token. I'll attach a Jupyter notebook demonstrating it a bit later. BTW I'm not even running the parser itself -- I'm only using the lexer

About long_operator. Yes, you are completely right: it is a 13 characters long literal. I'm saying that instead of measuring the length of actually matching token (+) the lexer calculates the length of the longest theoretically possible token in the rule

@kodo-pp
Copy link
Author

kodo-pp commented Apr 25, 2019

I'm attaching the notebook I was talking about. It can be downloaded here

@erezsh
Copy link
Member

erezsh commented Apr 26, 2019

@kodo-pp

  1. Where in the documentation does it say that? Maybe it needs fixing.

  2. What you are describing - testing several regexps in run-time for the same token, is exactly what Earley's dynamic lexer does. It would be too slow for LALR, although I might consider a pull-request for it as an extra feature.

I should also mention that the solution to your problem is very simple: You can just split the terminal into two.

@kodo-pp
Copy link
Author

kodo-pp commented Apr 26, 2019

@erezsh

  1. Wiki page (section Terminals, subsection "Terminals when using a lexer") and the corresponding readthedocs page

  2. OK. I will try to do something with it. By can I use earley lexer and lalr parser or they can't be combined like that? UPD: Sorry, I misread your message about the dynamic lexer -- I thought you were talking about the standard one

@erezsh
Copy link
Member

erezsh commented Apr 26, 2019

  1. Length of match (for regexps, the longest theoretical match is used)

Sounds to me like the docs are spot on. Can you explain why you think they are incorrect?

@kodo-pp
Copy link
Author

kodo-pp commented Apr 26, 2019

@erezsh, Maybe because there are no regexps in my code? As you can see, all matches are string literals. The rule you quoted says about the longest theoretical match only for regexps, but not for string literals.

Despite this I understand your point. I agree this is not a bug but rather a not-so-obvious caveat. Maybe this should be stated in the docs more clear. So this issue may be closed if you want to.

@erezsh
Copy link
Member

erezsh commented Apr 26, 2019

Maybe because there are no regexps in my code?

Incorrect. a | b | c is a regexp. Whenever you define a terminal, that is not a plain string, it's a regexp.

If you want to group different terminals together, that's what rules are for.

If you think you can provide a better explanation, feel free to write one, and I might include it in the docs.

@erezsh erezsh closed this as completed Apr 26, 2019
@kodo-pp
Copy link
Author

kodo-pp commented Apr 26, 2019

OK, now I see I was wrong. Thank you for your explanation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants