Regular expression whitespace character has different meanings in UTF-8 and single-byte modes #8541

p5pRT · 2006-07-23T13:34:04Z

Migrated from rt.perl.org#39912 (status was 'resolved')

Searchable as RT39912$

p5pRT · 2006-07-23T13:34:05Z

From [email protected]

Created by [email protected]

non-breaking space (hex 0xA0) is considered as whitespace
in regexp (\s) if scalar has an UTF8 flag on.

Example 1:

-- cut here --
use Encode;
use strict;

my ($s);

$s = " \t\xA0";
#Encode::_utf8_on($s); # uncomment this line
#$s = Encode::decode('windows-1251', $s); # this works too
$s =~ s/\s/!/g;

print "<$s>\n";
-- cut here --

Execute this script and you'll see: '<!! >' and if you'll uncomment
marked string you'll see: '<!!!>'

Perl Info


Flags:
    category=core
    severity=low

Site configuration information for perl v5.8.8:

Configured by ALT Linux Team at Wed May 17 00:52:57 MSD 2006.

Summary of my perl5 (revision 5 version 8 subversion 8) configuration:
  Platform:
    osname=linux, osvers=2.4.26-std-smp-alt13, archname=i386-linux-thread-multi
    uname='linux mash.office.altlinux.ru 2.4.26-std-smp-alt13 #1 smp sun nov 13 23:35:06 msk 2005 i686 gnulinux '
    config_args='-de -rs -Darchname=i386-linux -Dd_dosuid -Ud_csh -Dlibswanted=dl m c crypt db ndbm gdbm -Duseshrplib -Dlibperl=libperl.so.5.8 -Dcc=gcc -Doptimize=-pipe -Wall -Os -march=i586 -mtune=pentium4 -D_GNU_SOURCE -momit-leaf-frame-pointer -Dcccdlflags=-fPIC -DPIC -Dccdlflags=-rdynamic -Wl,-O1 -Dlddlflags=-shared -Wl,-O1 -Dldflags=-Wl,-O1 -Dprefix=/usr -Dprivlib=/usr/lib/perl5 -Darchlib=/usr/lib/perl5/i386-linux -Dvendorprefix=/usr -Dvendorlib=/usr/lib/perl5/vendor_perl -Dvendorarch=/usr/lib/perl5/vendor_perl/i386-linux -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dvendorman1dir=/usr/share/man/man1 -Dvendorman3dir=/usr/share/man/man3 -Dsiteprefix=/usr/local -Dsitelib=/usr/local/lib/perl5/site_perl/5.8.8 -Dsitearch=/usr/local/lib/perl5/site_perl/5.8.8/i386-linux -Dsiteman1dir=/usr/local/man/man1 -Dsite_man3dir=/usr/local/man/man3 -Dcf_by=ALT Linux Team -Dcf_email=Alexey Tourbin <[email protected]> -Dmyhostname=localhost -Dperladmin=root@localhost -Dmyuname=
 Linux 2.4.26-std-smp-alt13 i686 -Dnewmyuname=Linux 2.4.26-std-smp-alt13 i686 -Dinc_version_list=5.8.7/i386-linux 5.8.6/i386-linux 5.8.5/i386-linux 5.8.4/i386-linux 5.8.3/i386-linux 5.8.2/i386-linux 5.8.1/i386-linux 5.8.0/i386-linux 5.8.7 5.8.6 5.8.5 5.8.4 5.8.3 5.8.2 5.8.1 5.8.0 5.6.1 5.6.0 -Dpager=/usr/bin/less -isR -Di_shadow -Di_syslog -Dusethreads -Duseithreads -Duselargefiles -Di_db -Di_gdbm -Di_ndbm -Di_sdbm -Ui_odbm'
    hint=recommended, useposix=true, d_sigaction=define
    usethreads=define use5005threads=undef useithreads=define usemultiplicity=define
    useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
    use64bitint=undef use64bitall=undef uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='gcc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -I/usr/include/gdbm',
    optimize='-pipe -Wall -Os -march=i586 -mtune=pentium4 -D_GNU_SOURCE -momit-leaf-frame-pointer',
    cppflags='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -fno-strict-aliasing -pipe -I/usr/local/include -I/usr/include/gdbm'
    ccversion='', gccversion='4.1.0 20060515 (ALT Linux, build 4.1.0-alt3)', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='gcc', ldflags ='-Wl,-O1 -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib
    libs=-ldl -lm -lpthread -lc -lcrypt -ldb -lgdbm
    perllibs=-ldl -lm -lpthread -lc -lcrypt
    libc=/lib/libc-2.3.6.so, so=so, useshrplib=true, libperl=libperl.so.5.8
    gnulibc_version='2.3.6'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-rdynamic -Wl,-O1'
    cccdlflags='-fPIC -DPIC', lddlflags='-shared -Wl,-O1 -L/usr/local/lib'

Locally applied patches:
    


@INC for perl v5.8.8:
    /etc/perl5
    /usr/lib/perl5/i386-linux
    /usr/lib/perl5
    /usr/local/lib/perl5/site_perl/5.8.8/i386-linux
    /usr/local/lib/perl5/site_perl/5.8.8
    /usr/local/lib/perl5/site_perl
    /usr/lib/perl5/vendor_perl/i386-linux
    /usr/lib/perl5/vendor_perl
    /usr/lib/perl5/vendor_perl
    .


Environment for perl v5.8.8:
    HOME=/root
    LANG=POSIX
    LANGUAGE (unset)
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/root/bin:/sbin:/usr/sbin:/usr/local/sbin:/bin:/usr/bin:/usr/local/bin:/usr/X11R6/bin
    PERL_BADLANG (unset)
    SHELL=/bin/bash

p5pRT · 2006-07-24T07:06:54Z

From @demerphq

On 7/23/06, via RT shadrinovdd @ ystu. ru <perlbug-followup@perl.org> wrote:

# New Ticket Created by shadrinovdd@ystu.ru
# Please include the string: [perl #39912]
# in the subject line of all future correspondence about this issue.
# <URL: http://rt.perl.org/rt3/Ticket/Display.html?id=39912 >

This is a bug report for perl from shadrinovdd@ystu.ru,
generated with the help of perlbug 1.35 running under perl v5.8.8.

-----------------------------------------------------------------
[Please enter your report here]

non-breaking space (hex 0xA0) is considered as whitespace
in regexp (\s) if scalar has an UTF8 flag on.

I dont believe that this can be characterized as a bug. Unicode has
its own specific meaning for "whitespace" that is a little different
from the definition that perl has always used for ASCII.

--
perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT · 2006-07-24T07:06:55Z

The RT System itself - Status changed from 'new' to 'open'

p5pRT · 2006-07-24T12:42:29Z

From @druud62

shadrinovdd @ ystu . ru schreef:

non-breaking space (hex 0xA0) is considered as whitespace
in regexp (\s) if scalar has an UTF8 flag on.

Sure is funny:

perl -MEncode -wle '
for (0..0xBF) {
$_ = chr;
Encode::_utf8_on $_;
/\A\s\z/ and printf "%.2x\n", ord
}
'
09
0a
0c
0d
20
Malformed UTF-8 character (unexpected continuation byte 0x89, with no
preceding start byte) in ord at -e line 5.
00
Malformed UTF-8 character (unexpected continuation byte 0x8a, with no
preceding start byte) in ord at -e line 5.
00
Malformed UTF-8 character (unexpected continuation byte 0x8b, with no
preceding start byte) in ord at -e line 5.
00
Malformed UTF-8 character (unexpected continuation byte 0x8d, with no
preceding start byte) in ord at -e line 5.
00
Malformed UTF-8 character (unexpected continuation byte 0x8e, with no
preceding start byte) in ord at -e line 5.
00

perl -MEncode -wle '
use charnames ();
for(0..0xBF) {
$c = chr;
Encode::_utf8_on $c;
$c =~ /\A\s\z/ and printf "%.2x %s\n", $_, charnames::viacode($_)
}
'
09 CHARACTER TABULATION
0a LINE FEED (LF)
0c FORM FEED (FF)
0d CARRIAGE RETURN (CR)
20 SPACE
89 CHARACTER TABULATION WITH JUSTIFICATION
8a LINE TABULATION SET
8b PARTIAL LINE FORWARD
8d REVERSE LINE FEED
8e SINGLE SHIFT TWO

--
Affijn, Ruud

"Gewoon is een tijger."

p5pRT · 2006-07-24T13:57:15Z

From [email protected]

On Mon, 24 Jul 2006 14:26:58 +0200, "Dr.Ruud" wrote

Sure is funny:

perl -MEncode -wle '
for (0..0xBF) {
$_ = chr;
Encode::_utf8_on $_;
/\A\s\z/ and printf "%.2x\n", ord
}
'
Malformed UTF-8 character (unexpected continuation byte 0x89, with no
preceding start byte) in ord at -e line 5.
00

Don't use Encode::_utf8_on. The document of Encode.pm says:

=item _utf8_on(STRING)
[INTERNAL] Turns on the UTF-8 flag in STRING. The data in STRING is
B<not> checked for being well-formed UTF-8. Do not use unless you
^^^^^^^^^^^^^^ ^^^^^^^^^^
B<know> that the STRING is well-formed UTF-8. ......

Instead, use utf8::upgrade.

use charnames ();
for(0..0xBF) {
$c = chr;
utf8::upgrade($c);
$c =~ /\A\s\z/ and printf "%.2x %s\n",
$_, charnames::viacode(unpack 'U', $c)
}
__END__
09 CHARACTER TABULATION
0a LINE FEED (LF)
0c FORM FEED (FF)
0d CARRIAGE RETURN (CR)
20 SPACE
85 NEXT LINE (NEL)
a0 NO-BREAK SPACE

The argument to charnames::viacode should be a Unicode code point
but not a native code point, and then unpack 'U' is preferred to ord.
(Don't worry, EBCDIC won't be on you.)

Regards,
SADAHIRO Tomoyuki

p5pRT · 2006-07-24T16:11:38Z

From [email protected]

Quoth demerphq@gmail.com (demerphq):

On 7/23/06, via RT shadrinovdd @ ystu. ru <perlbug-followup@perl.org> wrote:

# New Ticket Created by shadrinovdd@ystu.ru
# Please include the string: [perl #39912]
# in the subject line of all future correspondence about this issue.
# <URL: http://rt.perl.org/rt3/Ticket/Display.html?id=39912 >

This is a bug report for perl from shadrinovdd@ystu.ru,
generated with the help of perlbug 1.35 running under perl v5.8.8.

-----------------------------------------------------------------
[Please enter your report here]

non-breaking space (hex 0xA0) is considered as whitespace
in regexp (\s) if scalar has an UTF8 flag on.

I dont believe that this can be characterized as a bug. Unicode has
its own specific meaning for "whitespace" that is a little different
from the definition that perl has always used for ASCII.

Nevertheless, it is hardly sensible for regexps to match differently
depending on whether the string has the (internal-only) utf8 flag on. A
utf8 string that just happens to only contain characters below 0x100
should behave the same as a non-utf8 string, I'd have thought.

Ben

--
The cosmos, at best, is like a rubbish heap scattered at random.
Heraclitus
benmorrow@tiscali.co.uk

p5pRT · 2006-07-24T16:42:55Z

From @druud62

SADAHIRO Tomoyuki schreef:

Dr.Ruud:

Malformed UTF-8 character (unexpected continuation byte 0x89, with no
preceding start byte) in ord at -e line 5.
00

Don't use Encode::_utf8_on. [...]
Instead, use utf8::upgrade.

OK, thanks.

--
Affijn, Ruud

"Gewoon is een tijger."

p5pRT · 2006-07-24T16:48:26Z

From @demerphq

On 7/24/06, Ben Morrow <benmorrow@tiscali.co.uk> wrote:

Quoth demerphq@gmail.com (demerphq):

On 7/23/06, via RT shadrinovdd @ ystu. ru <perlbug-followup@perl.org> wrote:

# New Ticket Created by shadrinovdd@ystu.ru
# Please include the string: [perl #39912]
# in the subject line of all future correspondence about this issue.
# <URL: http://rt.perl.org/rt3/Ticket/Display.html?id=39912 >

This is a bug report for perl from shadrinovdd@ystu.ru,
generated with the help of perlbug 1.35 running under perl v5.8.8.

-----------------------------------------------------------------
[Please enter your report here]

non-breaking space (hex 0xA0) is considered as whitespace
in regexp (\s) if scalar has an UTF8 flag on.

I dont believe that this can be characterized as a bug. Unicode has
its own specific meaning for "whitespace" that is a little different
from the definition that perl has always used for ASCII.

Nevertheless, it is hardly sensible for regexps to match differently
depending on whether the string has the (internal-only) utf8 flag on. A
utf8 string that just happens to only contain characters below 0x100
should behave the same as a non-utf8 string, I'd have thought.

I don't think I agree. ASCII is a seven bit encoding. Not an eight bit
encoding, therefore any character in the range 0x80-0xFF is not well
defined. (There are multiple "extended ascii char sets") However once
you convert it to unicode code points from a given encoding then the
chars (codepoints) in that range are well defined as unicode, er,
defines them. :-) We can't assume that any given extended ascii char
set is in use, so we have to stick to semantics that apply only to the
formal seven bit encoding. Here is an interesting and relevent quote:

While ASCII basically uses 7-bit codes, Unicode and the UCS use
relatively abstract "code points": non-negative integer numbers that
map, using different encoding forms and schemes, to sequences of one
or more 8-bit bytes. To permit backward compatibility, Unicode and the
UCS assign the first 128 code points to the same characters as ASCII.
One can therefore think of ASCII as a 7-bit encoding scheme for a very
small subset of Unicode and of the UCS.

</blockquote>

The char classes in unicode and the char classes in traditional perl
are (pretty much necessarily) different. Therefore we have a problem.
Either we change the old definition (assuming we ignore that there are
different Extended Ascii encodings and somehow deal with that, (and we
will ignore EBCDIC in all of variants as well)) and break old code
that depends on that definition, or we produce a regex engine that
does unicode matching contrary to the prescribed methodology. The
latter is undesirable for a number of reasons, not least being that
Perl's regex engine is a pretty much a reference implementation. The
former is obviously undesirable because of back-compat.

So it seems to me the only sensible approach is that which we have
taken which is that the special char classes match different things in
unicode mode versus non-unciode mode, and expect people using unicode
to know what the unicode standard dictates in terms of behaviour.
After all its a spec, and if you use it you should be aware of what it
says.

Hope that clears things up for you.

Cheers,
Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT · 2006-07-24T18:18:44Z

From [email protected]

Quoth demerphq@gmail.com (demerphq):

On 7/24/06, Ben Morrow <benmorrow@tiscali.co.uk> wrote:

Quoth demerphq@gmail.com (demerphq):

On 7/23/06, via RT shadrinovdd @ ystu. ru <perlbug-followup@perl.org> wrote:

# New Ticket Created by shadrinovdd@ystu.ru
# Please include the string: [perl #39912]
# in the subject line of all future correspondence about this issue.
# <URL: http://rt.perl.org/rt3/Ticket/Display.html?id=39912 >

This is a bug report for perl from shadrinovdd@ystu.ru,
generated with the help of perlbug 1.35 running under perl v5.8.8.

-----------------------------------------------------------------
[Please enter your report here]

non-breaking space (hex 0xA0) is considered as whitespace
in regexp (\s) if scalar has an UTF8 flag on.

I dont believe that this can be characterized as a bug. Unicode has
its own specific meaning for "whitespace" that is a little different
from the definition that perl has always used for ASCII.

Nevertheless, it is hardly sensible for regexps to match differently
depending on whether the string has the (internal-only) utf8 flag on. A
utf8 string that just happens to only contain characters below 0x100
should behave the same as a non-utf8 string, I'd have thought.

Of course, you are right about this. However, Perl 5.8 treats 8bit
(not-utf8) strings as being in ISO8859-1, so they *are* well defined.

The char classes in unicode and the char classes in traditional perl
are (pretty much necessarily) different. Therefore we have a problem.
<snip>

So it seems to me the only sensible approach is that which we have
taken which is that the special char classes match different things in
unicode mode versus non-unciode mode, and expect people using unicode
to know what the unicode standard dictates in terms of behaviour.
After all its a spec, and if you use it you should be aware of what it
says.

I couldn't agree more, and it is definitely a problem. However, do you
not agree that this:

#!/usr/bin/perl

$_ = chr 0xA0;
print ord, /\s/ ? "yes" : "no";

$_ .= chr 0x100;
chop;
print ord, /\s/ ? "yes" : "no";

is extremely unintuitive, to say the least? Adding and removing a non-\s
character to a string makes it suddenly match \s? Especially when this
could happen in a module that has been upgraded to support Unicode
without the user being aware of it.

The nearest thing to a solution I can see is to have regexps
special-case characters 0x80-0x100 normally, and then have a pragma,
perhaps C<use locale 'unicode'> since locale already changes matching
behaviour for 8bit chars, that (lexically) switches to Unicode char
classes on those characters.

An alternative would be to ensure that strings are always downgraded if
they can be, but I suspect that would be unacceptably slow in some
situations.

Basically I'm arguing that since users are not supposed to concern
themselves with the state of the utf8 flag, user-visible behaviour
should not change on account of it.

Ben

--
Joy and Woe are woven fine,
A Clothing for the Soul divine William Blake
Under every grief and pine 'Auguries of Innocence'
Runs a joy with silken twine. benmorrow@tiscali.co.uk

p5pRT · 2006-07-24T19:16:54Z

From @druud62

Ben Morrow schreef:

\#\!/usr/bin/perl

$\_ = chr 0xA0;
print ord\, /\\s/ ? "yes" : "no";

$\_ \.= chr 0x100;
chop;
print ord\, /\\s/ ? "yes" : "no";

is extremely unintuitive

Yes. 0xA0 is the NO-BREAK SPACE in ISO-8859-1, but it is not
acknowledged as whitespace (it isn't a separator).

An alternative would be to ensure that strings are always downgraded
if they can be, but I suspect that would be unacceptably slow in some
situations.

Well, "use downgrade" could be the default, so "no downgrade" is
available.

--
Affijn, Ruud

"Gewoon is een tijger."

p5pRT · 2006-07-24T19:55:03Z

From @rgarcia

On 24/07/06, Ben Morrow <benmorrow@tiscali.co.uk> wrote:

I couldn't agree more, and it is definitely a problem. However, do you
not agree that this:
\#\!/usr/bin/perl

$\_ = chr 0xA0;
print ord\, /\\s/ ? "yes" : "no";

$\_ \.= chr 0x100;
chop;
print ord\, /\\s/ ? "yes" : "no";
is extremely unintuitive, to say the least? Adding and removing a non-\s
character to a string makes it suddenly match \s? Especially when this
could happen in a module that has been upgraded to support Unicode
without the user being aware of it.

The module encoding::warnings is designed to produce warnings against
this kind of situation. With it, you get :

$ perl test.pl
Bytes implicitly upgraded into wide characters as iso-8859-1 at test.pl line 3
160yes
160yes

(However, with blead, I get a segfault instead...)

p5pRT · 2006-07-25T01:33:24Z

From @ysth

On Mon, Jul 24, 2006 at 07:15:44PM +0100, Ben Morrow wrote:

Quoth demerphq@gmail.com (demerphq):

On 7/24/06, Ben Morrow <benmorrow@tiscali.co.uk> wrote:

Quoth demerphq@gmail.com (demerphq):

On 7/23/06, via RT shadrinovdd @ ystu. ru <perlbug-followup@perl.org> wrote:

# New Ticket Created by shadrinovdd@ystu.ru
# Please include the string: [perl #39912]
# in the subject line of all future correspondence about this issue.
# <URL: http://rt.perl.org/rt3/Ticket/Display.html?id=39912 >

This is a bug report for perl from shadrinovdd@ystu.ru,
generated with the help of perlbug 1.35 running under perl v5.8.8.

-----------------------------------------------------------------
[Please enter your report here]

non-breaking space (hex 0xA0) is considered as whitespace
in regexp (\s) if scalar has an UTF8 flag on.

I dont believe that this can be characterized as a bug. Unicode has
its own specific meaning for "whitespace" that is a little different
from the definition that perl has always used for ASCII.

Nevertheless, it is hardly sensible for regexps to match differently
depending on whether the string has the (internal-only) utf8 flag on. A
utf8 string that just happens to only contain characters below 0x100
should behave the same as a non-utf8 string, I'd have thought.

<snip ASCII is 7bit, so 8bit chars aren't well defined>

Of course, you are right about this. However, Perl 5.8 treats 8bit
(not-utf8) strings as being in ISO8859-1, so they *are* well defined.

No, it doesn't. What gave you the idea it did?

p5pRT · 2006-07-25T11:40:57Z

From @druud62

Yitzchak Scott-Thoennes schreef:

Ben Morrow:

Perl 5.8 treats 8bit
(not-utf8) strings as being in ISO8859-1, so they *are* well defined.

No, it doesn't. What gave you the idea it did?

Maybe from this:

<quote src="encoding">
Implicit upgrading for byte strings

By default, if strings operating under byte semantics and strings with
Unicode character data are concatenated, the new string will be created
by decoding the byte strings as ISO 8859-1 (Latin-1).

The encoding pragma changes this to use the specified encoding instead.
</quote>

--
Affijn, Ruud

"Gewoon is een tijger."

p5pRT · 2006-07-25T17:03:11Z

From @davidnicol

non-breaking space (hex 0xA0) is considered as whitespace
in regexp (\s) if scalar has an UTF8 flag on.

So are you saying we should add 0xA0 to the \s class so there is consistency,
or document the current behavior as a known corner situation, or what?

p5pRT · 2006-07-25T18:27:41Z

From @druud62

"David Nicol" schreef:

shadrinovdd:

non-breaking space (hex 0xA0) is considered as whitespace
in regexp (\s) if scalar has an UTF8 flag on.

So are you saying we should add 0xA0 to the \s class so there is
consistency, or document the current behavior as a known corner
situation, or what?

1. I vote for auto-downgrading as default.
2. Non-breaking characters don't belong in any \s at all.

--
Affijn, Ruud

"Gewoon is een tijger."

p5pRT · 2006-07-25T19:07:20Z

From @davidnicol

On 7/25/06, Dr.Ruud <rvtol+news@isolution.nl> wrote:

2. Non-breaking characters don't belong in any \s at all.

your request is to remove the non-breaking space from the UTF8 \s class.

That seems reasonable, but treating non-breaking space as space
seems reasonable too. perhaps a pragma is needed here?

no nonbreakingspaceinwhitespaceclass;

Entering software architect mode:

the regex parser would have to check that flag in the hints table at regex
compilation time to choose which whitespace list to use, and locale tables,
if they are to respect this pragma, would have to provide two whitespace lists,
or maybe the pragma is only in effect in the default locale and all locales
would then have to provide alternate versions with NBSP reclassified as
non-white-space. Which would be bad because it would open the way to geometric
increase in the number of locales if more optimizations happen that way too.

So a pragma-compliant locale would provide a list of its NBSPs that are
in whitespace by default, but get removed when "no NBSPinS" has been
declared.

Or in general do we want to leave NBSPs out of \s rather than including all
of a locale's whitespace chars in \s, treating the report as a bug and
repairing it?

If \s stopped matching NBSP what would that do? Would NBSP still match \b?

I have more questions than answers.

p5pRT · 2006-07-26T01:18:54Z

From @druud62

David Nicol:

Dr.Ruud:

2. Non-breaking characters don't belong in any \s at all.

your request is to remove the non-breaking space from the UTF8 \s
class.

Well, I would like to see the orthogonality recognized. A non-breaking
space is 'more graphical' than a breaking space. It's not interpunction,
it's not a separator, it is (almost) a word character. So it belongs to
\w. Spaces inside filenames are non-breaking space.

I have more questions than answers.

I consider that a Good Thing.

--
Affijn, Ruud

"Gewoon is een tijger."

p5pRT · 2006-07-26T04:28:25Z

From [email protected]

Quoth rvtol+news@isolution.nl ("Dr.Ruud"):

David Nicol:

Dr.Ruud:

2. Non-breaking characters don't belong in any \s at all.

your request is to remove the non-breaking space from the UTF8 \s
class.

Well, I would like to see the orthogonality recognized. A non-breaking
space is 'more graphical' than a breaking space. It's not interpunction,
it's not a separator, it is (almost) a word character.

Well, the Unicode consortium didsagrees with you: they consider nbsp to
be Zs, which is 'Separator, space'. And they get to set the rules :).

So it belongs to \w.

Really it doesn't. \w is by definition the set of characters valid in
Perl identifiers: allowing $ab cd where the space just happens to be nb
would be *seriously* confusing.

Spaces inside filenames are non-breaking space.

Well, they can be either, and in the case of *my* filenames they're
always plain ol' ascii 32...

I was never arguing that \s matching nbsp was inherently wrong (and, of
course, \p{Zs} shoud *always* match nbsp), simply that the matching
behaviour of a string shouldn't change based on a non-user-visible flag,
so \s should either match nbsp or not, for both utf8 and ascii strings.

Ben

--
The cosmos, at best, is like a rubbish heap scattered at random.
Heraclitus
benmorrow@tiscali.co.uk

p5pRT · 2006-07-26T05:02:59Z

From [email protected]

On Втр. Июл. 25 10:03:11 2006, davidnicol@gmail.com wrote:

non-breaking space (hex 0xA0) is considered as whitespace
in regexp (\s) if scalar has an UTF8 flag on.

So are you saying we should add 0xA0 to the \s class so there is
consistency,
or document the current behavior as a known corner situation, or what?

I think 0xA0 should be excluded from \s class in both encodings
(single-byte and UTF8). It has meaning 'NO-BREAK' - not separator.

It is used only to glue words together. For example, compacting spaces
in text document would break formatting rules otherwise.

This problem is already exists in module HTML::TreeBuilder v3.13 line 1103

p5pRT · 2006-07-26T05:43:20Z

From @demerphq

On 7/26/06, Ben Morrow <benmorrow@tiscali.co.uk> wrote:

I was never arguing that \s matching nbsp was inherently wrong (and, of
course, \p{Zs} shoud *always* match nbsp), simply that the matching
behaviour of a string shouldn't change based on a non-user-visible flag,
so \s should either match nbsp or not, for both utf8 and ascii strings.

Note that \s is explicitly documented to match different things in
utf8 and in non-utf8. And that goes for other special chars like \w
and things. See perlre. However it doesnt say anything about 0xA0.

Ive been too busy to investigate this further.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT · 2006-07-26T14:37:08Z

From @druud62

Ben Morrow schreef:

Dr.Ruud:

David Nicol:

Dr.Ruud:

2. Non-breaking characters don't belong in any \s at all.

your request is to remove the non-breaking space from the UTF8 \s
class.

Well, I would like to see the orthogonality recognized. A
non-breaking space is 'more graphical' than a breaking space. It's
not interpunction, it's not a separator, it is (almost) a word
character.

Well, the Unicode consortium didsagrees with you: they consider nbsp
to be Zs, which is 'Separator, space'. And they get to set the rules
:).

And I don't want to change that. No wait, that's not entirely true, I am
waiting for them to add a few glyphs to the Arabic set, but such things
take years.

So it belongs to \w.

Really it doesn't. \w is by definition the set of characters valid in
Perl identifiers: allowing $ab cd where the space just happens to be
nb would be *seriously* confusing.

Use a zero-width one. <g>

Spaces inside filenames are non-breaking space.

Well, they can be either, and in the case of *my* filenames they're
always plain ol' ascii 32...

Yes, but functionally. When such a filename doesn't fit on the remaining
space in a line of text, I prefer it to entirely go to the next line.
Same with long URLs. If it then still doesn't fit, the choice will be
hard on where to break it: if you do it on a space than the space
becomes invisble.

See also RFC 1738 about line break and hyphen, they should have made the
unencoded hyphen (and optional following whitespace) to be ignored in at
least the URL%2Dpath.

\s should either match nbsp or not, for both utf8 and ascii
strings.

Yes, that is the most practical. The best candidate for non-breaking
space in ASCII is the underscore; ITYM ISO-8859-1 (or Latin-1).

--
Affijn, Ruud

"Gewoon is een tijger."

p5pRT · 2006-07-26T15:22:55Z

From [email protected]

On Wed, 26 Jul 2006 16:32:34 +0200, "Dr.Ruud" wrote

Well, I would like to see the orthogonality recognized. A
non-breaking space is 'more graphical' than a breaking space. It's
not interpunction, it's not a separator, it is (almost) a word
character.

Well, the Unicode consortium didsagrees with you: they consider nbsp
to be Zs, which is 'Separator, space'. And they get to set the rules
:).

And I don't want to change that. No wait, that's not entirely true, I am
waiting for them to add a few glyphs to the Arabic set, but such things
take years.

According to Word Boundaries by Unicode Standard Annex #29,
(cf. http://www.unicode.org/reports/tr29/#Word_Boundaries )

NO-BREAK SPACE is treated as a letter (ALetter); the position between
an alphabet and a NBSP is not a word boundary.

Therefore (if I understand correctly) The Unicode Consortium recommends
\b should not match the position between an alphabet and a NBSP.
(cf. http://www.unicode.org/reports/tr18/#Default_Word_Boundaries )

Regards
SADAHIRO Tomoyuki

p5pRT · 2006-07-26T16:56:19Z

From @druud62

SADAHIRO Tomoyuki schreef:

Dr.Ruud:

Well, I would like to see the orthogonality recognized. A
non-breaking space is 'more graphical' than a breaking space. It's
not interpunction, it's not a separator, it is (almost) a word
character.

Well, the Unicode consortium didsagrees with you: they consider nbsp
to be Zs, which is 'Separator, space'. And they get to set the rules
:).

And I don't want to change that. No wait, that's not entirely true,
I am waiting for them to add a few glyphs to the Arabic set, but
such things take years.

According to Word Boundaries by Unicode Standard Annex #29,
(cf. http://www.unicode.org/reports/tr29/#Word_Boundaries )

NO-BREAK SPACE is treated as a letter (ALetter); the position between
an alphabet and a NBSP is not a word boundary.

Therefore (if I understand correctly) The Unicode Consortium
recommends \b should not match the position between an alphabet and a
NBSP. (cf.
http://www.unicode.org/reports/tr18/#Default_Word_Boundaries )

Thanks, also for an hour well spent on the Unicode site again. Perl 5 is
mentioned quite often.

With Unicode 4.0.1 (March 2004), a category 'Cf' (Format) is added (the
Zero-width spaces are in it).

The U+00A0 is still in 'Zs' (Space Separator), and marked with 'CS'
(Common number Separator). So maybe you don't need to put a soft-hyphen
after the non-break spaces to allow breaks in your big numbers. :)

If \s gets \xA0, then I guess that split(//) will split on \xA0 as well.
http://mail.python.org/pipermail/python-bugs-list/2006-January/031513.html

--
Affijn, Ruud

"Gewoon is een tijger."

p5pRT · 2006-07-28T09:04:36Z

From [email protected]

Dr.Ruud <rvtol+news@isolution.nl> writes:

David Nicol:

Dr.Ruud:

2. Non-breaking characters don't belong in any \s at all.

your request is to remove the non-breaking space from the UTF8 \s
class.

Well, I would like to see the orthogonality recognized. A non-breaking
space is 'more graphical' than a breaking space. It's not interpunction,
it's not a separator, it is (almost) a word character.

UTF-8 \s and \w come from Unicode standard tables.
In this case that standard copies iso-8859-1.

Personally I thing treating it as \s or at least matching \b make
it easier to spell-check such cases.

So it belongs to
\w. Spaces inside filenames are non-breaking space.

Not everywhere.

I have more questions than answers.

I consider that a Good Thing.

p5pRT · 2006-07-28T16:46:46Z

From [email protected]

This is the same bug that I reported last year:
http://rt.perl.org/rt3//Public/Bug/Display.html?id=36839

I have also submitted a patch for the HTML::TreeBuilder problem:
http://rt.cpan.org/Public/Bug/Display.html?id=17481

p5pRT · 2008-05-12T11:27:22Z

[email protected] - Status changed from 'open' to 'stalled'

p5pRT · 2011-05-01T06:12:29Z

From @cpansprout

This has been resolved in perl 5.14.0, as long as ‘use 5.14.0’ or ‘use
feature "unicode_string"’ is turned on.

(Or /foo/u or ‘use re '/u'’.)

p5pRT · 2011-05-01T06:12:30Z

@cpansprout - Status changed from 'stalled' to 'resolved'

p5pRT closed this as completed May 1, 2011

p5pRT added the Severity Low label Oct 18, 2019

p5pRT added affects-5.8 distro-Linux type-core labels Oct 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regular expression whitespace character has different meanings in UTF-8 and single-byte modes #8541

Regular expression whitespace character has different meanings in UTF-8 and single-byte modes #8541

p5pRT commented Jul 23, 2006

p5pRT commented Jul 23, 2006

p5pRT commented Jul 24, 2006

p5pRT commented Jul 24, 2006

p5pRT commented Jul 24, 2006

p5pRT commented Jul 24, 2006

p5pRT commented Jul 24, 2006

p5pRT commented Jul 24, 2006

p5pRT commented Jul 24, 2006

p5pRT commented Jul 24, 2006

p5pRT commented Jul 24, 2006

p5pRT commented Jul 24, 2006

p5pRT commented Jul 25, 2006

p5pRT commented Jul 25, 2006

p5pRT commented Jul 25, 2006

p5pRT commented Jul 25, 2006

p5pRT commented Jul 25, 2006

p5pRT commented Jul 26, 2006

p5pRT commented Jul 26, 2006

p5pRT commented Jul 26, 2006

p5pRT commented Jul 26, 2006

p5pRT commented Jul 26, 2006

p5pRT commented Jul 26, 2006

p5pRT commented Jul 26, 2006

p5pRT commented Jul 28, 2006

p5pRT commented Jul 28, 2006

p5pRT commented May 12, 2008

p5pRT commented May 1, 2011

p5pRT commented May 1, 2011

Regular expression whitespace character has different meanings in UTF-8 and single-byte modes #8541

Regular expression whitespace character has different meanings in UTF-8 and single-byte modes #8541

Comments

p5pRT commented Jul 23, 2006

p5pRT commented Jul 23, 2006

From [email protected]

Created by [email protected]

p5pRT commented Jul 24, 2006

From @demerphq

p5pRT commented Jul 24, 2006

p5pRT commented Jul 24, 2006

From @druud62

p5pRT commented Jul 24, 2006

From [email protected]

p5pRT commented Jul 24, 2006

From [email protected]

p5pRT commented Jul 24, 2006

From @druud62

p5pRT commented Jul 24, 2006

From @demerphq

p5pRT commented Jul 24, 2006

From [email protected]

p5pRT commented Jul 24, 2006

From @druud62

p5pRT commented Jul 24, 2006

From @rgarcia

p5pRT commented Jul 25, 2006

From @ysth

p5pRT commented Jul 25, 2006

From @druud62

p5pRT commented Jul 25, 2006

From @davidnicol

p5pRT commented Jul 25, 2006

From @druud62

p5pRT commented Jul 25, 2006

From @davidnicol

p5pRT commented Jul 26, 2006

From @druud62

p5pRT commented Jul 26, 2006

From [email protected]

p5pRT commented Jul 26, 2006

From [email protected]

p5pRT commented Jul 26, 2006

From @demerphq

p5pRT commented Jul 26, 2006

From @druud62

p5pRT commented Jul 26, 2006

From [email protected]

p5pRT commented Jul 26, 2006

From @druud62

p5pRT commented Jul 28, 2006

From [email protected]

p5pRT commented Jul 28, 2006

From [email protected]

p5pRT commented May 12, 2008

p5pRT commented May 1, 2011

From @cpansprout

p5pRT commented May 1, 2011