Full Unicode strings strawman

Allen Wirfs-Brock allen at wirfs-brock.com
Tue May 17 11:24:28 PDT 2011


On May 17, 2011, at 10:43 AM, Boris Zbarsky wrote:

> On 5/17/11 1:40 PM, Brendan Eich wrote:
>> On May 17, 2011, at 10:37 AM, Boris Zbarsky wrote:
>> 
>>> On 5/17/11 1:27 PM, Brendan Eich wrote:
>>>> On May 17, 2011, at 10:22 AM, Boris Zbarsky wrote:
>>>> 
>>>>> Yes.  And right now that's how it works and actual JS authors typically don't have to worry about encoding issues.  I don't agree with Allen's claim that "in the long run JS in the browser is going to have to be able to deal with arbitrary encodings".  Having the _capability_ might be nice, but forcing all web authors to think about it seems like a non-starter.
>>>> 
>>>> Allen said "be able to", not "forcing". Big difference. I think we three at least are in agreement here.
>>> 
>>> I think we're in agreement on the sentiment, but perhaps not on where on the "able to" to "forcing" spectrum this strawman falls.
>> 
>> Where do you read "forcing"? Not in the words you cited.
> 
> In the substance of having strings in different encodings around at the same time.  If that doesn't force developers to worry about encodings, what does, exactly?

This already occurs in JS.  For example, the encodeURI function produces a string whose character are the UTF-8 encoding of a UTF-16 string (including recognition of surrogate pairs).  So we have at least three string encodings that are explicit dealt with in the ES spec.  UCS-2 without surrogate pair recognition (what string literal produce and string methods process), UTF-16 with surrogate pairs (produced by decodeURI and in browser reality also returned as "DOMStrings"), UTF-8 (produced by encodeURI).  Any JS program that uses encodeURI/decodeURI or retrieves strings from the DOM should be worry about such encoding differences. Particular if they combine strings produced from these different sources.

> 
>>>> I mean UTF-16 flowing through, but as you say that happens now -- but (I reply) only if JS doesn't mess with things in a UCS-2 way (indexing 16-bits at a time, ignoring surrogates). And JS code does generally assume 16 bits are enough.
>>>> 
>>>> With Allen's proposal we'll finally have some new APIs for JS developers to use.
>>> 
>>> That doesn't answer my questions....
>> 
>> Ok, full Unicode means non-BMP characters not being wrongly treated as two uint16 units and miscounted, separated or partly deleted by splicing and slicing, etc.
>> 
>> IOW, JS grows to treat strings as "full Unicode", not uint16 vectors. This is a big deal!
> 
> OK, but still allows sticking non-Unicode gunk into the strings, right?  So they're still vectors of "something".  Whatever that something is.

Conceptually unsigned 32-bit values. The actual internal representation is likely to be something else. Interpretation of those values is left to the functions (both built-in and application) that operate upon them.  Most built-in string methods do not apply any interpretation and will happily process strings as vectors of arbitrary uint32 values.  Some built-ins (encodeURI/decodeURI, toUpperCase/toLowerCase) explicitly deal with Unicode characters or various Unicode encodings and these have to be explicitly defined to deal with non-Unicode character values or invalid encodes.  These functions already are defined for ES5 in this manner WRT the representation of strings as vectors of arbitrary uint16 values.

Allen

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20110517/103e26d1/attachment.html>


More information about the es-discuss mailing list