Full Unicode based on UTF-16 proposal

Norbert Lindenberg ecmascript at norbertlindenberg.com
Fri Mar 16 17:04:46 PDT 2012


Thanks for your comments - a few replies below.

Norbert


On Mar 16, 2012, at 1:55 , Erik Corry wrote:

> However I think we probably do want the /u modifier on regexps to
> control the new backward-incompatible behaviour.  There may be some
> way to relax this for regexp literals in opted in Harmony code, but
> for new RegExp(...) and for other string literals I think there are
> rather too many inconsistencies with the old behaviour.

Before asking developers to add /u, we should really have some evidence that not doing so would cause actual compatibility issues for real applications. Do you know of any examples?

Good point about Harmony code, although it seems opt-in got replaced by being part of a module.

> The algorithm given for codePointAt never returns NaN.  It should
> probably do that for indices that hit a trail surrogate that has a
> lead surrogate preceeding it.

NaN is not a valid code point, so it shouldn't be returned. If we want to indicate access to a trailing surrogate code unit as an error, we should throw an exception.

> Perhaps it is outside the scope of this proposal, but it would also
> make a lot of sense to add some named character classes to RegExp.

It would make a lot of sense, but is outside the scope of this proposal. One step at a time :-)

> If we are makig a /u modifier for RegExp it would also be nice to get
> rid of the incorrect case independent matching rules.  This is the
> section that says: "If ch's code unit value is greater than or equal
> to decimal 128 and cu's code unit value is less than decimal  128,
> then return ch."

And the exception for "ß" and other characters whose upper case equivalent has more than one code point ("If u does not consist of a single character, return ch." in the Canonicalize algorithm in ES 5.1).


> 2012/3/16 Norbert Lindenberg <ecmascript at norbertlindenberg.com>:
>> Based on my prioritization of goals for support for full Unicode in ECMAScript [1], I've put together a proposal for supporting the full Unicode character set based on the existing representation of text in ECMAScript using UTF-16 code unit sequences:
>> http://norbertlindenberg.com/2012/03/ecmascript-supplementary-characters/index.html
>> 
>> The detailed proposed spec changes serve to get a good idea of the scope of the changes, but will need some polishing.
>> 
>> Comments?
>> 
>> Thanks,
>> Norbert
>> 
>> [1] https://mail.mozilla.org/pipermail/es-discuss/2012-February/020721.html
>> 
>> _______________________________________________
>> es-discuss mailing list
>> es-discuss at mozilla.org
>> https://mail.mozilla.org/listinfo/es-discuss



More information about the es-discuss mailing list