Full Unicode based on UTF-16 proposal

Norbert Lindenberg ecmascript at norbertlindenberg.com
Sat Mar 17 12:15:25 PDT 2012


On Mar 17, 2012, at 10:20 , Erik Corry wrote:

> 2012/3/17 Steven L. <steves_list at hotmail.com>:
>> Eric Corry wrote:
>>> 
>>> However I think we probably do want the /u modifier on regexps to
>>> control the new backward-incompatible behaviour.  There may be some
>>> way to relax this for regexp literals in opted in Harmony code, but
>>> for new RegExp(...) and for other string literals I think there are
>>> rather too many inconsistencies with the old behaviour.
>> 
>> 
>> Disagree with adding /u for this purpose and disagree with breaking backward
>> compatibility to let `/./.exec(s)[0].length == 2`.
> 
> Care to enlighten us with any thinking behind this disagreeing?
> 
>> Instead, if this is
>> deemed an important enough issue, there are two ways to match any Unicode
>> grapheme that match existing regex library precedent:
>> 
>> From Perl and PCRE:
>> 
>> \X
> 
> This doesn't work inside [].  Were you envisioning the same restriction in JS?
> 
> Also it matches a grapheme cluster, which is may be useful but is
> completely different to what the dot does.
> 
>> From Perl, PCRE, .NET, Java, XML Schema, and ICU (among others):
>> 
>> \P{M}\p{M}*
>> 
>> Obviously \X is prettier, but because it's fairly rare for people to care
>> about this, IMO the more widely compatible solution that uses Unicode
>> categories is Good Enough if Unicode category syntax is on the table for
>> ES6.
>> 
>> Norbert Lindenberg wrote:
>>> 
>>> \uxxxx[\uyyyy-\uzzzz] is interpreted as [\uxxxx\uyyyy-\uxxxx\uzzzz]
> 
> Norbert, this just happens automatically if unmatched surrogates are
> just treated as if they were normal code units.

I don't see how. In the actual matching process, the new design only looks at code points, not code units. Without this transformation, it would see surrogate code points in the pattern, but supplementary code points in the text to be matched. Enhancing the matching process to recognize surrogate code points and insert them into the continuation might work, but wouldn't be any prettier than this transformation.

>>> [\uwwww-\uxxxx][\uyyyy-\uzzzz] is interpreted as
>>> [\uwwww\uyyyy-\uxxxx\uzzzz]
> 
> Norbert, this will have different semantics to the current
> implementations unless the second range is the full trail surrogate
> range.

True. I think if we restrict the transformation to that specific case it'll still cover normal usage of this pattern.

> I agree with Steven that these two cases should just be left alone,
> which means they will continue to work the way they have until now.
> 
>> Some people will want a way to match arbitrary Unicode code
>> points rather than graphemes anyway, so leaving \uhhhh alone lets that use
>> case continue working. This would still allow modifying the handling of
>> literal astral/supplementary characters in RegExps. If it can be handled
>> sensibly, I'm all for treating literal characters in RegExps as discrete
>> graphemes rather than splitting them into surrogate pairs.
> 
> You seem to be confusing graphemes and unicode code points.  Here is
> the same text 3 times:
> 
> Four UTF-16 code units:
> 
> 0x0020 0xD800 0xDF30 0x0308
> 
> Three Unicode code points:
> 
> 0x20 0x10330 0x308
> 
> Two Graphemes
> 
> " " "¨"  <-- This is an attempt to show a Gothic Ahsa with an umlaut.
> My mail program probably screwed it up.

Mac Mail is usually Unicode-friendly, so let's try again:
" 𐌰̈"

> The proposal you are responding to is all about adding Unicode code
> point handling to regexps.  It is not about adding grapheme support,
> which is a rather different issue.

Correct - thanks for the explanation!



More information about the es-discuss mailing list