-
Notifications
You must be signed in to change notification settings - Fork 561
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoiding possible incompatible regexp features in future development #23021
Comments
Thank you for calling our attention to these developments. Since what you are in effect requesting is for Perl to take a certain development track going forward, at this point the best place to have this discussion is on the perl5-porters mailing list (https://www.nntp.perl.org/group/perl.perl5.porters/). That's because the initial stage of this discussion has to be seen by the widest range of people concerned with Perl's development. Once we get a consensus as to Perl's policy with respect to keeping development in synch with PCRE2 is, then we can use some mixture of our PPC process and this issue tracker to guide development. |
On Sat, 22 Feb 2025 at 09:14, Zoltan Herczeg ***@***.***> wrote:
I am sorry if this is not the right place for such discussion. Please let
me know the right place for it.
In PCRE2 regular expression engine we have been adding some new regexp
features, and it would be good if we could avoid incompatible features in
the future, i.e. perl wil not use the syntax of them for something else.
Feature flags could still be used, but it is better if we don't need to.
- This one is already released. I think there is a low chance of
reusing it.
Syntax: (*scan_substring:(CAPTURE_LIST)PATTERN) or
(*scs:(CAPTURE_LIST)PATTERN)
More about it: https://zherczeg.github.io/sljit/scan_substring.html
- The next one has a higher chance:
The (?PARNO) recursive subpattern syntax is extended with capture list:
(?PARNO:CAPTURE_LIST). The capture list is a comma separated list of
capturing brackets. The value of these captures are not restored after the
recursive matching is completed.
This is not released, so the syntax can be changed.
CC @NWilson <https://github.com/NWilson>
—
Reply to this email directly, view it on GitHub
<#23021>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAZ5R737Q6OIRRMS7DKXFL2RAWWLAVCNFSM6AAAAABXUX2XTCVHI2DSMVQWIX3LMV43ASLTON2WKOZSHA3TANJZGI2DSMI>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
[image: zherczeg]*zherczeg* created an issue (Perl/perl5#23021)
<#23021>
I am sorry if this is not the right place for such discussion. Please let
me know the right place for it.
This is more or less the right place for this. Long ago I reached out to
Philip Hazel to try to establish some kind of regex syntax oversight
process. Neither project has added that many, and I was too busy to pursue
it formally, so the process never "firmed up". It is very good of you to
reach out. Let's figure out a good way we can cooperate.
Please note that in the below that when I say "our" or "we" or "us" I mean
the Perl project, and when I say "you" or "your" i mean the PCRE project.
In PCRE2 regular expression engine we have been adding some new regexp
features, and it would be good if we could avoid incompatible features in
the future, i.e. perl wil not use the syntax of them for something else.
Generally speaking we would do our best to avoid using something you add in
a way that is totally different from what you do. There are some minor
differences in how the two engines approach certain matters, so there may
be minor discrepancies, but on our side we would do our best to avoid such
problems.
However I do think it is good to involve us when you choose to add a new
construct. It may be that we have strong feelings about how things are
spelled out, and getting us involved early will prevent any differences of
opinion festering and causing bad blood between the projects.
Feature flags could still be used, but it is better if we don't need to.
Totally agree.
- This one is already released. I think there is a low chance of
reusing it.
Syntax: (*scan_substring:(CAPTURE_LIST)PATTERN) or
(*scs:(CAPTURE_LIST)PATTERN)
More about it: https://zherczeg.github.io/sljit/scan_substring.html
I have no objection to this really. Especially as it is already released. I
am moderately disappointed that we did not come up with a convention for
when to use uppercase and when to use lowercase in these "verb like"
constructs [that is (*IDENTIFIER) style meta-patterns and directives], but
that may have been us, and not you. Perhaps it would be a good idea if we
took a bit of time to think of some conventions so we dont end up with a
mess in the long run. I am not super keen on short-forms like 'scs' but I
can live with it.
So for instance when I originally added verbs they were all uppercase,
later on Karl added some that did not have to do with controlling match
behavior, and they were made to be lower-case (IIRC). IMO it would be nice
if we could have some bright-line guidance like that which made it more
obvious why a given construct was upper-case or lower-case, or used a
particular convention in how it expressed things.
I say this because one of the reasons Perl syntax caught on and became the
dominant syntax was that Larry cleaned up the earlier conventions so that
it was much easier to remember what happened when something was
back-slashed. Eg, in Perl syntax backslash-non-alphanumeric like \[ is a
literal and NOT a meta-character, and backslash-alphanumeric is a
meta-character and NOT a literal. Other regex engines have a weird mix of
both cases, which makes it hard to remember how to write a pattern (Eg, in
vim \< is a left sided word-break). Given this precedent it would be good
if we had a convention which was easy to remember and understand.
- The next one has a higher chance:
The (?PARNO) recursive subpattern syntax is extended with capture list:
(?PARNO:CAPTURE_LIST). The capture list is a comma separated list of
capturing brackets. The value of these captures are not restored after the
recursive matching is completed.
This is not released, so the syntax can be changed.
I have no strong objection to this. The syntax sounds reasonable. The
intent seems reasonable. Whether or not it is doable in the current Perl
engine is another question. But I definitely think it would be a nice
feature and I see no reason we would not follow your precedent. I do
/kinda/ wonder if we are setting ourselves up for problems in terms of
establishing some conventions for this. In most of the verbs a colon suffix
indicates a mark name. (Something i thought would be used much more than
has proved to be the case!), and in this case we are adding colon suffix
meaning capture lists. This kinda bothers me. As does the fact that in
your example above
(*scan_substring:(CAPTURE_LIST)PATTERN)
the capture list is in parens, but in the PARNO proposal case the parens
are not required. I am inclined to say one of the two is wrong. From a
language design and language learning perspective using a common form for
similar things makes the language easier to learn. So perhaps given
"scan_substring" is already released it would be better to make the PARNO
case also be parenthesized. On the other hand, it seems to me a better
approach would be if 'scan_substring' was of the form:
(*scan_substring:CAPTURE_LIST:pattern)
but maybe it is too late to change it.
There is something to be said for a rule like "when a capture list is
specified in a meta-pattern it MUST be parenthesized, and be comma
separated". So then the PARNO case would be
(?PARNO:(CAPTURE_LIST))
The mnemonic being that parens inside of a verb-like meta-pattern should
always contain a list of capture names or indexes.
Anyway, thanks for reaching out to us, we really should formalize our
relationship so that we dont add things that mess up your plans, and vice
versa.
At a certain level it would be nice if we both used the same code, but i
doubt that will ever happen.
cheers,
Yves
…--
perl -Mre=debug -e "/just|another|perl|hacker/"
|
Thank you for the feedback! I remember we had discussions about the syntax several years ago, but I could not find where. It would be great to continue those plans. Perhaps setting up a low-traffic mailing list for it? It looks like I totally misunderstood the naming of The
|
On Sun, 23 Feb 2025 at 17:47, Zoltan Herczeg ***@***.***> wrote:
Thank you for the feedback! I remember we had discussions about the syntax
several years ago, but I could not find where. It would be great to
continue those plans. Perhaps setting up a low-traffic mailing list for it?
It looks like I totally misunderstood the naming of (*id: constructs. I
thought capital letters are reserved for verbs exclusively, and lowercase
letters for generic constructs. Perl has some: (*script_run: or (*pla:.
In PCRE2, we have non-atomic versions, such as (*napla:.
The (*scan_substring:(CAPTURE_LIST)PATTERN) tried to be similar to
conditional blocks: (?(condition)yes-pattern|no-pattern), the ? is
replaced by *scan_substring:, which represents the "command", and the
condition is extended to a list. I suspect this feature is less interesting
for perl, since captures are available as variables, and code blocks can be
nested into patterns.
(?PARNO:(CAPTURE_LIST)) was one of the variants we were discussing to
use, so we will change the syntax. Honestly, any syntax is good for me as
long as it is not overly complex.
—
Reply to this email directly, view it on GitHub
<#23021 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAZ5R5ICWNG7ECSZRUFJIT2RH3TRAVCNFSM6AAAAABXUX2XTCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNZWHE4DENRVGM>
.
You are receiving this because you commented.Message ID:
***@***.***>
[image: zherczeg]*zherczeg* left a comment (Perl/perl5#23021)
<#23021 (comment)>
Thank you for the feedback! I remember we had discussions about the syntax
several years ago, but I could not find where. It would be great to
continue those plans. Perhaps setting up a low-traffic mailing list for it?
I think that would be a good idea.
It looks like I totally misunderstood the naming of (*id: constructs. I
thought capital letters are reserved for verbs exclusively, and lowercase
letters for generic constructs. Perl has some: (*script_run: or (*pla:.
In PCRE2, we have non-atomic versions, such as (*napla:.
It may be that I am the one in the wrong here. Perhaps we just need to
clarify where we are now so we dont shoot ourselves in the foot in the
future.
The (*scan_substring:(CAPTURE_LIST)PATTERN) tried to be similar to
conditional blocks: (?(condition)yes-pattern|no-pattern), the ? is
replaced by *scan_substring:, which represents the "command", and the
condition is extended to a list. I suspect this feature is less interesting
for perl, since captures are available as variables, and code blocks can be
nested into patterns.
Ah, that is also something we must consider, your use case is more generic
than ours, so we must be flexible to your needs.
(?PARNO:(CAPTURE_LIST)) was one of the variants we were discussing to
use, so we will change the syntax. Honestly, any syntax is good for me as
long as it is not overly complex.
I agree more or less with the caveat that whatever we do should be easy to
remember and not contain contradictions.
I am not the most inspired in regard to language design, which is why I
cc'ed the people I did on this. They have long played a role at some level
in these discussions, and broader feedback I think can only help. How long
would you be comfortable with us making a decision? Are you bursting to
release this ASAP, or can we wait a few weeks for people to mull it over?
Yves
…--
perl -Mre=debug -e "/just|another|perl|hacker/"
|
We have just released the code so we have at least six months before the next one. Plenty of time to make any decisions. |
I am sorry if this is not the right place for such discussion. Please let me know the right place for it.
In PCRE2 regular expression engine we have been adding some new regexp features, and it would be good if we could avoid incompatible features in the future, i.e. perl wil not use the syntax of them for something else. Feature flags could still be used, but it is better if we don't need to.
Syntax:
(*scan_substring:(CAPTURE_LIST)PATTERN)
or(*scs:(CAPTURE_LIST)PATTERN)
More about it: https://zherczeg.github.io/sljit/scan_substring.html
The
(?PARNO)
recursive subpattern syntax is extended with capture list:(?PARNO:CAPTURE_LIST)
. The capture list is a comma separated list of capturing brackets. The value of these captures are not restored after the recursive matching is completed.This is not released, so the syntax can be changed.
CC @NWilson
The text was updated successfully, but these errors were encountered: