Re: Question about the “full Unicode in strings” strawman

Allen Wirfs-Brock allen at wirfs-brock.com
Wed Jan 25 11:55:20 PST 2012


On Jan 25, 2012, at 10:59 AM, Gillam, Richard wrote:

>> The current 16-bit character strings are sometimes uses to store non-Unicode binary data and can be used with non-Unicode character encoding with up to 16-bit chars.  21 bits is sufficient for Unicode but perhaps is not enough for other useful encodings. 32-bit seems like a plausable unit.
> 
> How would an eight-digit \u escape sequence work from an implementation standpoint?  I'm assuming most implementations right now use 16-bit unsigned values as the individual elements of a String.  If we allow arbitrary 32-bit values to be placed into a String, how would you make that work?  There seem to only be a few options:
> 
> a) Change the implementation to use 32-bit units.
> 
> b) Change the implementation to use either 32-bit units as needed, with some sort of internal flag that specifies the unit size for an individual string.
> 
> c) Encode the 32-bit values somehow as a sequence of 16-bit values.
> 
> If you want to allow full generality, it seems like you'd be stuck with option a or option b.  Is there really enough value in doing this?

This issue is somewhat address in the proposal in the implementation impacts section http://wiki.ecmascript.org/doku.php?id=strawman:support_full_unicode_in_strings#possible_implementation_impacts 

My assumption is that most implementation would choose b.  Although the other would all be valid implementation approaches.  Note that some implementations already use multiple alternative internal string representations in order to optimize various scenarios. 
> 
> If, on the other hand, the idea is just to make it easier to include non-BMP Unicode characters in strings, you can accomplish this by making a long \u sequence just be shorthand for the equivalent sequence in UTF-16:  \u10ffff would be exactly equivalent to \udbff\udfff.  You don't have to change the internal format of the string, the indexes of individual characters stay the same, etc.

The primary intent of the proposal was to extend ES Strings to support a uniform represent of all Unicode characters, including non-BMP.  That means that any Unicode character should occupy exactly one element position within a String value.  Interpreting \u{10ffff} as an UTF-16 encoding does not satisfy that objective.  In particular, under that approach "\{10ffff}".length would be 2 while a uniform character representation should yield a length of 1.

When this proposal was originally floated, the much of debated seemed to be about whether such a uniform character representation was desirable or even useful.  See the thread starting at https://mail.mozilla.org/pipermail/es-discuss/2011-May/014252.html also https://mail.mozilla.org/pipermail/es-discuss/2011-May/014316.html and  

Allen

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20120125/1ee046f7/attachment.html>


More information about the es-discuss mailing list