-
Notifications
You must be signed in to change notification settings - Fork 561
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regular expression whitespace character has different meanings in UTF-8 and single-byte modes #8541
Comments
From [email protected]Created by [email protected]non-breaking space (hex 0xA0) is considered as whitespace Example 1: -- cut here -- my ($s); $s = " \t\xA0"; print "<$s>\n"; Execute this script and you'll see: '<!! >' and if you'll uncomment Perl Info
|
From @demerphqOn 7/23/06, via RT shadrinovdd @ ystu. ru <perlbug-followup@perl.org> wrote:
I dont believe that this can be characterized as a bug. Unicode has -- |
The RT System itself - Status changed from 'new' to 'open' |
From @druud62shadrinovdd @ ystu . ru schreef:
Sure is funny: perl -MEncode -wle ' perl -MEncode -wle ' -- "Gewoon is een tijger." |
From [email protected]On Mon, 24 Jul 2006 14:26:58 +0200, "Dr.Ruud" wrote
Don't use Encode::_utf8_on. The document of Encode.pm says: =item _utf8_on(STRING) Instead, use utf8::upgrade. use charnames (); The argument to charnames::viacode should be a Unicode code point Regards, |
From [email protected]Quoth demerphq@gmail.com (demerphq):
Nevertheless, it is hardly sensible for regexps to match differently Ben -- |
From @druud62SADAHIRO Tomoyuki schreef:
OK, thanks. -- "Gewoon is een tijger." |
From @demerphqOn 7/24/06, Ben Morrow <benmorrow@tiscali.co.uk> wrote:
I don't think I agree. ASCII is a seven bit encoding. Not an eight bit <blockquote source='http://en.wikipedia.org/wiki/ASCII'> While ASCII basically uses 7-bit codes, Unicode and the UCS use </blockquote> The char classes in unicode and the char classes in traditional perl So it seems to me the only sensible approach is that which we have Hope that clears things up for you. Cheers, -- |
From [email protected]Quoth demerphq@gmail.com (demerphq):
<snip ASCII is 7bit, so 8bit chars aren't well defined> Of course, you are right about this. However, Perl 5.8 treats 8bit
I couldn't agree more, and it is definitely a problem. However, do you #!/usr/bin/perl $_ = chr 0xA0; is extremely unintuitive, to say the least? Adding and removing a non-\s The nearest thing to a solution I can see is to have regexps An alternative would be to ensure that strings are always downgraded if Basically I'm arguing that since users are not supposed to concern Ben -- |
From @druud62Ben Morrow schreef:
Yes. 0xA0 is the NO-BREAK SPACE in ISO-8859-1, but it is not
Well, "use downgrade" could be the default, so "no downgrade" is -- "Gewoon is een tijger." |
From @rgarciaOn 24/07/06, Ben Morrow <benmorrow@tiscali.co.uk> wrote:
The module encoding::warnings is designed to produce warnings against $ perl test.pl (However, with blead, I get a segfault instead...) |
From @ysthOn Mon, Jul 24, 2006 at 07:15:44PM +0100, Ben Morrow wrote:
No, it doesn't. What gave you the idea it did? |
From @druud62Yitzchak Scott-Thoennes schreef:
Maybe from this: <quote src="encoding"> By default, if strings operating under byte semantics and strings with The encoding pragma changes this to use the specified encoding instead. -- "Gewoon is een tijger." |
From @davidnicol
So are you saying we should add 0xA0 to the \s class so there is consistency, |
From @druud62"David Nicol" schreef:
1. I vote for auto-downgrading as default. -- "Gewoon is een tijger." |
From @davidnicolOn 7/25/06, Dr.Ruud <rvtol+news@isolution.nl> wrote:
your request is to remove the non-breaking space from the UTF8 \s class. That seems reasonable, but treating non-breaking space as space no nonbreakingspaceinwhitespaceclass; Entering software architect mode: the regex parser would have to check that flag in the hints table at regex So a pragma-compliant locale would provide a list of its NBSPs that are Or in general do we want to leave NBSPs out of \s rather than including all If \s stopped matching NBSP what would that do? Would NBSP still match \b? I have more questions than answers. |
From @druud62David Nicol:
Well, I would like to see the orthogonality recognized. A non-breaking
I consider that a Good Thing. -- "Gewoon is een tijger." |
From [email protected]Quoth rvtol+news@isolution.nl ("Dr.Ruud"):
Well, the Unicode consortium didsagrees with you: they consider nbsp to
Really it doesn't. \w is by definition the set of characters valid in
Well, they can be either, and in the case of *my* filenames they're I was never arguing that \s matching nbsp was inherently wrong (and, of Ben -- |
From [email protected]On Втр. Июл. 25 10:03:11 2006, davidnicol@gmail.com wrote:
I think 0xA0 should be excluded from \s class in both encodings It is used only to glue words together. For example, compacting spaces This problem is already exists in module HTML::TreeBuilder v3.13 line 1103 |
From @demerphqOn 7/26/06, Ben Morrow <benmorrow@tiscali.co.uk> wrote:
Note that \s is explicitly documented to match different things in Ive been too busy to investigate this further. Yves -- |
From @druud62Ben Morrow schreef:
And I don't want to change that. No wait, that's not entirely true, I am
Use a zero-width one. <g>
Yes, but functionally. When such a filename doesn't fit on the remaining See also RFC 1738 about line break and hyphen, they should have made the
Yes, that is the most practical. The best candidate for non-breaking -- "Gewoon is een tijger." |
From [email protected]On Wed, 26 Jul 2006 16:32:34 +0200, "Dr.Ruud" wrote
According to Word Boundaries by Unicode Standard Annex #29, NO-BREAK SPACE is treated as a letter (ALetter); the position between Therefore (if I understand correctly) The Unicode Consortium recommends Regards |
From @druud62SADAHIRO Tomoyuki schreef:
Thanks, also for an hour well spent on the Unicode site again. Perl 5 is With Unicode 4.0.1 (March 2004), a category 'Cf' (Format) is added (the The U+00A0 is still in 'Zs' (Space Separator), and marked with 'CS' If \s gets \xA0, then I guess that split(//) will split on \xA0 as well. -- "Gewoon is een tijger." |
From [email protected]Dr.Ruud <rvtol+news@isolution.nl> writes:
UTF-8 \s and \w come from Unicode standard tables. Personally I thing treating it as \s or at least matching \b make
Not everywhere.
|
From [email protected]This is the same bug that I reported last year: I have also submitted a patch for the HTML::TreeBuilder problem: |
[email protected] - Status changed from 'open' to 'stalled' |
From @cpansproutThis has been resolved in perl 5.14.0, as long as ‘use 5.14.0’ or ‘use (Or /foo/u or ‘use re '/u'’.) |
@cpansprout - Status changed from 'stalled' to 'resolved' |
Migrated from rt.perl.org#39912 (status was 'resolved')
Searchable as RT39912$
The text was updated successfully, but these errors were encountered: