Full Unicode based on UTF-16 proposal

Norbert Lindenberg ecmascript at norbertlindenberg.com
Sun Mar 25 23:11:47 PDT 2012


Perfectly valid concerns.

My thinking here is that normally applications want to deal with code points, but we force them to deal with UTF-16 and additional flags because we need them for compatibility. Within modules, where we know that compatibility is not an issue, I'd rather give applications by default what they need.

Looking back at Java, supporting supplementary characters was fairly painless for many applications despite UTF-16 because Java already had a rich API performing all kinds of operations on strings, so many applications had little need to look at individual characters in the first place. We went through the entire Java SE API and fixed all those operations to use code point semantics (look for "under the hood" at [1] for details). We were also able to switch regular expressions to code point semantics without any flags because regular expressions never worked on binary data and developers hadn't created funky workarounds to support supplementary characters yet. JavaScript today has more constraints, but for new development it would still be good to get as close as possible to that experience.

Norbert

[1] http://java.sun.com/developer/technicalArticles/Intl/Supplementary/


On Mar 24, 2012, at 23:56 , David Herman wrote:

> On Mar 24, 2012, at 4:32 PM, Norbert Lindenberg wrote:
> 
>> One concern: I think code point based matching should be the default for regex literals within modules (where we know the code is written for Harmony).
> 
> This idea makes me nervous. Partly because I think we should keep the set of semantic changes between non-module code and module code reasonable small, and partly because the idea of your proposal is to continue to treat strings as sequences of 16-bit code units, not Unicode code points-- which means that quietly switching regexps to be closer to operating at the level of code points seems like it creates a kind of impedance mismatch. It feels more appropriate to me to require programmers to declare explicitly that they're dealing with a string at the level of code points, using the (quite concise) /u flag. That way they're saying "yes, I know this string is just a sequence of 16-bit code points, but it may contain non-BMP data, and I would like to match its contents with a regexp that deals with code points."
> 
> (Again, I'm still new to the finer points of Unicode, so I'm prepared to be shown I'm thinking about it wrong.)
> 
> Dave
> 



More information about the es-discuss mailing list