Full Unicode strings strawman
bzbarsky at MIT.EDU
Tue May 17 11:51:42 PDT 2011
On 5/17/11 2:24 PM, Allen Wirfs-Brock wrote:
>> In the substance of having strings in different encodings around at
>> the same time. If that doesn't force developers to worry about
>> encodings, what does, exactly?
> This already occurs in JS. For example, the encodeURI function produces
> a string whose character are the UTF-8 encoding of a UTF-16 string
> (including recognition of surrogate pairs).
Last I checked, encodeURI output a pure ASCII string. Am I just missing
something? The ASCII string happens to be the %-escaping of the UTF-8
representation of the Unicode string you get by assuming that the
initial JS string is a UTF-16 representation of said Unicode string.
But at no point here is the author dealing with UTF-8.
>> OK, but still allows sticking non-Unicode gunk into the strings,
>> right? So they're still vectors of "something". Whatever that
>> something is.
> Conceptually unsigned 32-bit values. The actual internal representation
> is likely to be something else.
I don't care about the internal representation; I'm interested in the
> Interpretation of those values is left to the functions (both built-in and application) that operate upon them.
OK. That includes user-written functions, of course, which currently
only have to deal with UTF-16 (and maybe UCS-2 if you want to be very
> Most built-in string methods do not apply any interpretation and will
> happily process strings as vectors of arbitrary uint32 values. Some
> built-ins (encodeURI/decodeURI, toUpperCase/toLowerCase) explicitly deal
> with Unicode characters or various Unicode encodings and these have to
> be explicitly defined to deal with non-Unicode character values or
> invalid encodes.
That seems fine. This is not where problems lie.
> These functions already are defined for ES5 in this
> manner WRT the representation of strings as vectors of arbitrary uint16
More information about the es-discuss