Full Unicode based on UTF-16 proposal

Steven L. steves_list at hotmail.com
Sat Mar 17 17:08:15 PDT 2012

Eric Corry wrote:

>> I further objected because I think the /u flag would be better used as a
>> ASCII/Unicode mode switcher for \d\w\b. My proposal for this is based on
>> Python's re.UNICODE or (?u) flag, which does the same thing except that 
>> it
>> also covers \s (which is already Unicode-based in ES).
> I am rather skeptical about treating \d like this.  I think "any digit
> including rods and roman characters but not decimal points/commas"
> http://en.wikipedia.org/wiki/Numerals_in_Unicode#Counting-rod_numerals
> would be needed much less often than the digits 0-9, so I think
> hijacking \d for this case is poor use of name space.  The \d escape
> in perl does not cover other Unicode numerals, and even with the
> [:name:] syntax there appears to be no way to get the Unicode
> numerals: 
> http://search.cpan.org/~flora/perl-5.14.2/pod/perlrecharclass.pod#POSIX_Character_Classes
>  This suggests to me that it's not very useful.

I know from experience that it's common for Arabic speakers to want to match 
both 0-9 and Arabic-Indic digits. The same seems true for Hindi/Devanagari 
digits, and probably others. Even if it wasn't often useful, IMO this change 
is necessary for congruity with Unicode-enabled \w and \b (I'll get to 
that), and would likely never be detrimental since /u would be opt-in and 
it's easy to explicitly use [0-9] when that's what you want.

For the record, I am proposing that /\d/u be equivalent to /\p{Nd}/, not 
/\p{N}/. I.e., it should not match any Unicode number, but rather any 
Unicode decimal digit (see 
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5Cp%7BNd%7D for the 
list). And as Norbert noted, that is in fact what Perl's \d matches.

Comparison with other regex flavors:

* \w == [A-Za-z0-9_] -- ES-current, Java, PCRE, Ruby, Python (default).
* \w == [\p{L}\p{Nd}_] -- .NET, Perl, Python (with (?u)).

* \b matches between ASCII \w\W -- ES-current, PCRE, Ruby, Python (default).
* \b matches between Unicode \w\W -- Java, .NET, Perl, Python (with (?u)).

* \d == [0-9] -- ES-current, Java, PCRE, Ruby, Python (default).
* \d == \p{Nd} -- .NET, Perl, Python (with (?u)).

* \s == [\x09-\x0D] -- Java, PCRE, Ruby, Python (default).
* \s == [\x09–\x0D\p{Z}] -- ES-current, .NET, Perl, Python (with (?u)).

Note that Java's \w and \b are inconsistent.

Unicode-based \w and \b are incredibly useful, and it is very common for 
users to sometimes want them to be Unicode-based--thus, an opt-in flag 
offers the best of both worlds. In fact, I'd go so far as to say they are 
broken without Unicode support. Consider, e.g., /a\b/.test('naïve'), which 
currently returns true.

Unicode-based \d would not only help international users/apps, it is also 
important because otherwise Unicode-based \w and \b would have to use 
[\p{L}0-9_] rather than [\p{L}\p{Nd}_], which breaks portability with .NET, 
Perl, Python, and Java. If, conversely, Unicode-enabled \w and \b used 
[\p{L}\p{Nd}_] but \d used [0-9], then among other consequences (including 
user confusion), [^\W\d_] could not be used equivalently to \p{L}.

> And instead of changing the meaning of \w, which will be confusing, I
> think that [:alnum:] as in perl would work fine.

[:alnum:] in Perl, PCRE, Ruby, Tcl, POSIX/GNU BRE/ERE, etc. matches only 
[A-Za-z0-9]. Making it Unicode-based in ES would be confusing. It also works 
only within character classes. IMO, the POSIX-style [[:name:]] syntax is 
clumsy and confusing, not to mention backward incompatible. It would 
potentially also be confusing if ES supports only [:alnum:] without adding 
the rest of the (not-very-useful) POSIX regex class names.

> \b is a little tougher.  The Unicode rewrite would be
> (?:(?<![:alnum:])(?=[:alnum:])|(?<=[:alnum:])(?![:alnum:])) which is
> obviously too verbose.  But if we take \b for this then the ASCII
> version has to be written as
> (?:(?<!\w)(?=\w)|(?<=\w)(?!\w)) which is also more than a little
> annoying.  However, often you don't need that if you have negative
> lookbehind because you can write something
> like
> /(?<!\w)word(?=!\w)/    // Negative look-behind for a \w and negative
> look-ahead for \w at the end.
> which isn't _too_ bad, even if it is much worse than
> /\bword\b/

I've already started to explain above why I think Unicode-based \b is 
important and useful. I'll just add the footnote that relying on lookbehind 
would in all likelihood perform less efficiently than \b (depending on 
implementation optimizations).

>> Indeed. My response was rushed and poorly formed. My apologies.
> Gratefully accepted with the hope that my next rushed and poorly
> formed response will also be forgiven!

Consider it done. ;-P

--Steven Levithan

More information about the es-discuss mailing list