Full Unicode based on UTF-16 proposal

Norbert Lindenberg ecmascript at norbertlindenberg.com
Sat Mar 17 11:54:16 PDT 2012


Steven, sorry, I wasn't aware of your proposal for /u when I inserted the note on this flag into my proposal. My proposal was inspired by the use of /u in PHP, where it switches from byte mode to UTF-8 mode. We'll have to see whether it makes sense to combine the two under one flag or use two - fortunately, Unicode still has a few other characters.

Norbert


On Mar 17, 2012, at 11:22 , Steven L. wrote:

> Eric Corry wrote:
>>> Disagree with adding /u for this purpose and disagree with breaking backward
>>> compatibility to let `/./.exec(s)[0].length == 2`.
>> 
>> Care to enlighten us with any thinking behind this disagreeing?
> 
> Sorry for the rushed and overly ebullient message. I disagreed with /u for switching from code unit to code point mode because in the moment I didn't think a code point mode necessary or particularly beneficial. Upon further reflection, I rushed into this opinion and will be more closely examining the related issues.
> 
> I further objected because I think the /u flag would be better used as a ASCII/Unicode mode switcher for \d\w\b. My proposal for this is based on Python's re.UNICODE or (?u) flag, which does the same thing except that it also covers \s (which is already Unicode-based in ES). Therefore, I think that if a flag is added that only switches from code unit to code point mode, it should not be "u". Presumably, flag /u could simultaneously affect \d\w\b and switch to code point mode. I haven't yet thought enough about combining these two proposals to hold a strong opinion on the matter.
> 
>>> there are two ways to match any Unicode
>>> grapheme that match existing regex library precedent:
>>> 
>>> From Perl and PCRE:
>>> \X
>> 
>> This doesn't work inside [].  Were you envisioning the same restriction in JS?
>> 
>> Also it matches a grapheme cluster, which is may be useful but is
>> completely different to what the dot does.
> 
> You are of course correct. And yes, I was envisioning the same restriction within character classes. But I'm not a strong proponent of \X, especially if support for Unicode categories is added.
> 
>> I agree with Steven that these two cases should just be left alone,
>> which means they will continue to work the way they have until now.
> 
> Glad to hear it.
> 
>> You seem to be confusing graphemes and unicode code points.
>> [...]
>> The proposal you are responding to is all about adding Unicode code
>> point handling to regexps.  It is not about adding grapheme support,
>> which is a rather different issue.
> 
> Indeed. My response was rushed and poorly formed. My apologies.
> 
> --Steven Levithan
> 



More information about the es-discuss mailing list