Possible bug in 'standard' lexer: the longest token is matched incorrectly #370

kodo-pp · 2019-04-25T14:21:45Z

When I use the standard lexer (default for lalr parser) sometimes it determines the longest match incorrectly.

Consider the following code:

import lark

grammar = r'''
start: BINOP | ASSIGNMENT_OP

ASSIGNMENT_OP: "="
    | "+="

BINOP: "*"
    | "/"
    | "%" 
    | "+"
    | "-"
    | "long_operator"

%import common.WS
%ignore WS
'''

parser = lark.Lark(grammar, parser='lalr')
code = '+='
print(list(parser.lex(code)))

It should print [Token(ASSIGNMENT_OP, '+=')] because the longest match is += and the documentation says that in this case the longest match should be returned. However, the code prints [Token(BINOP, '+'), Token(ASSIGNMENT_OP, '=')], which is incorrect.

I suppose the lexer might be comparing not the lengths of actually matching terminals (+ and += in this case) but the lengths of the longest possible options (long_operator and +=)

P.S. lark version is 0.7.0, installed as a package python-lark-parser in ArchLinux

The text was updated successfully, but these errors were encountered:

erezsh · 2019-04-25T14:55:10Z

I suppose the lexer might be comparing not the lengths of actually matching terminals (+ and += in this case) but the lengths of the longest possible options (long_operator and +=)

Exactly. The only real fix would be to implement regexp analysis, which isn't worth the trouble.

kodo-pp · 2019-04-25T15:36:22Z

Then I have two questuons:

Why does the documentation say that the lexer performs in the "correct" way when in fact it isn't true?
Won't the following be the fix? For each option we compute the length of the longest matching token (in the case I described long_operator would have zero-length match while + would have a match with length 1) and then use the longest token

whitten · 2019-04-25T16:59:33Z

kodo-pp said; Then I have two questions:
1.Why does the documentation say that the lexer performs in the "correct" way when in fact it isn't true?
As you wrote your grammar, "+=" and "=" are each one token long. Perhaps if you separated the "+=" into "+" "=" the lexer would see it is two separate tokens?
LALR loves to compile things into a table and the grammar rules are fully distinct from the lexer rules. I expect that forcing them into separate tokens will allow the parts of "+=" to be seen by the grammar rules.

2.Won't the following be the fix? For each option we compute the length of the longest matching token (in the case I described long_operator would have zero-length match while + would have a match with length 1) and then use the longest token

You see something in the grammar I don't see. "long_operator" just looks like a single lexer token that is expressed in 13 characters. Is it the name of function/procedure or a non-terminal I don't see ?

kodo-pp · 2019-04-25T18:38:57Z

+= is not separated. The funny thing is that is I remove long_operator from the lexer rule, += starts be lexed as a single token. I'll attach a Jupyter notebook demonstrating it a bit later. BTW I'm not even running the parser itself -- I'm only using the lexer

About long_operator. Yes, you are completely right: it is a 13 characters long literal. I'm saying that instead of measuring the length of actually matching token (+) the lexer calculates the length of the longest theoretically possible token in the rule

kodo-pp · 2019-04-25T19:36:07Z

I'm attaching the notebook I was talking about. It can be downloaded here

erezsh · 2019-04-26T00:38:19Z

@kodo-pp

Where in the documentation does it say that? Maybe it needs fixing.
What you are describing - testing several regexps in run-time for the same token, is exactly what Earley's dynamic lexer does. It would be too slow for LALR, although I might consider a pull-request for it as an extra feature.

I should also mention that the solution to your problem is very simple: You can just split the terminal into two.

kodo-pp · 2019-04-26T05:56:57Z

@erezsh

Wiki page (section Terminals, subsection "Terminals when using a lexer") and the corresponding readthedocs page
OK. I will try to do something with it. ~~By can I use earley lexer and lalr parser or they can't be combined like that?~~ UPD: Sorry, I misread your message about the dynamic lexer -- I thought you were talking about the standard one

erezsh · 2019-04-26T09:36:06Z

Length of match (for regexps, the longest theoretical match is used)

Sounds to me like the docs are spot on. Can you explain why you think they are incorrect?

kodo-pp · 2019-04-26T11:53:28Z

@erezsh, Maybe because there are no regexps in my code? As you can see, all matches are string literals. The rule you quoted says about the longest theoretical match only for regexps, but not for string literals.

Despite this I understand your point. I agree this is not a bug but rather a not-so-obvious caveat. Maybe this should be stated in the docs more clear. So this issue may be closed if you want to.

erezsh · 2019-04-26T11:57:34Z

Maybe because there are no regexps in my code?

Incorrect. a | b | c is a regexp. Whenever you define a terminal, that is not a plain string, it's a regexp.

If you want to group different terminals together, that's what rules are for.

If you think you can provide a better explanation, feel free to write one, and I might include it in the docs.

kodo-pp · 2019-04-26T11:59:22Z

OK, now I see I was wrong. Thank you for your explanation

erezsh closed this as completed Apr 26, 2019

AiStudent mentioned this issue Nov 17, 2024

Longest match lexer option to mimic the Unix tool Lex #1490

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible bug in 'standard' lexer: the longest token is matched incorrectly #370

Possible bug in 'standard' lexer: the longest token is matched incorrectly #370

kodo-pp commented Apr 25, 2019

erezsh commented Apr 25, 2019

kodo-pp commented Apr 25, 2019 •

edited

Loading

whitten commented Apr 25, 2019

kodo-pp commented Apr 25, 2019

kodo-pp commented Apr 25, 2019

erezsh commented Apr 26, 2019

kodo-pp commented Apr 26, 2019 •

edited

Loading

erezsh commented Apr 26, 2019

kodo-pp commented Apr 26, 2019

erezsh commented Apr 26, 2019

kodo-pp commented Apr 26, 2019

Possible bug in 'standard' lexer: the longest token is matched incorrectly #370

Possible bug in 'standard' lexer: the longest token is matched incorrectly #370

Comments

kodo-pp commented Apr 25, 2019

erezsh commented Apr 25, 2019

kodo-pp commented Apr 25, 2019 • edited Loading

whitten commented Apr 25, 2019

kodo-pp commented Apr 25, 2019

kodo-pp commented Apr 25, 2019

erezsh commented Apr 26, 2019

kodo-pp commented Apr 26, 2019 • edited Loading

erezsh commented Apr 26, 2019

kodo-pp commented Apr 26, 2019

erezsh commented Apr 26, 2019

kodo-pp commented Apr 26, 2019

kodo-pp commented Apr 25, 2019 •

edited

Loading

kodo-pp commented Apr 26, 2019 •

edited

Loading