Full Unicode strings strawman

Wes Garland wes at page.ca
Tue May 17 12:29:39 PDT 2011


On 17 May 2011 14:39, Boris Zbarsky <bzbarsky at mit.edu> wrote:

> On 5/17/11 2:12 PM, Wes Garland wrote:
>
>> That said, you can encode these code points with utf-8; for example,
>> 0xdc08 becomes 0xed 0xb0 0x88.
>>
>
> By the same argument, you can encode them in UTF-16.  The byte sequence
> above is not valid UTF-8.  See "How do I convert an unpaired UTF-16
> surrogate to UTF-8?" at http://unicode.org/faq/utf_bom.html which says:
>

You are comparing apples and oranges. Which happen to look a lot alike. So
maybe apples and nectarines.

But the point remains, the FAQ entry you quote talks about encoding a lone
surrogate, i.e. a code unit, which is not a complete code point. You can
only convert complete code points from one encoding to another. Just like
you can't represent part of a UTF-8 code sub-sequence in any other encoding.
The fact that code point X is not representable in UTF-16 has no bearing on
its status as a code point, nor its convertability to UTF-8.  The problem is
that UTF-16 cannot represent all possible code points.


> See above.  You're allowing number arrays that may or may not be
> interpretable as Unicode strings, period.
>

No, I'm not.  Any sequence of Unicode code points is a valid Unicode string.
It does not matter whether any of those code points are reserved, nor does
it matter if it can be represented in all encodings.

>From page 90 of the Unicode 6.0 specification, in the Conformance chapter:

> *D80 Unicode string:* A code unit sequence containing code units of a
> particular Unicode
> encoding form.
> • In the rawest form, Unicode strings may be implemented simply as arrays
> of
> the appropriate integral data type, consisting of a sequence of code units
> lined
> up one immediately after the other.
> • A single Unicode string must contain only code units from a single
> Unicode
> encoding form. It is not permissible to mix forms within a string.
>


Not sure what "(D80)" is supposed to mean.
>

Sorry, "(D80)" means "per definition D80 of The Unicode Standard, Version
6.0"


>
> This hypothesis is worth testing before being blindly inflicted on the web.
>
>
I don't think anybody in this discussion is talking about blindly inflicting
anything on the web.  I *do* think this proposal is a good one, and
certainly a better way forward than insisting that every JS developer,
everywhere, understand and implement (over and over again) the details of
encoding Unicode as UTF-16. Allen's point about URI escaping being right on
target here.


>
>  If we redefine JS Strings to be arrays of Unicode code points, then the
>> JS application can use JS Strings as arrays uint21 -- but round-tripping
>> the high-surrogate code points through a UTF-16 layer would not work.
>>
>
> OK, that seems like a breaking change.
>

Yes, I believe it would be, certainly if done naively, but I am hopeful
somebody can figure out how to overcome this.  Hopeful because I think that
fixing the JS Unicode problem is a really big deal. "What happens if the guy
types a non-BMP character?" is a question which should not have to be
answered over and over again in every code review.  And I still maintain
that 99.99% of JS developers never give it first, let alone second, thought.

Maybe, and maybe not.  We (Mozilla) have had some proposals to actually use
> UTF-8 throughout, including in the JS engine; it's quite possible to
> implement an API that looks like a 16-bit array on top of UTF-8 as long as
> you allow invalid UTF-8 that's needed to represent surrogates and the like.
>

I understand by this that in the Moz proposals, you mean that the "invalid"
UTF-8 sequences are actually valid UTF-8 Strings which encode code points in
the range 0xd800-0xdfff, and that these code points were translated directly
(and purposefully incorrectly) as UTF-16 code units when viewed as 16-bit
arrays.

If JS Strings were arrays of Unicode code points, this conversion would be a
non-issue; UTF-8 sequence 0xed 0xb0 0x88 becomes Unicode code point 0xdc08,
with no incorrect conversion taking place.  The only problem is if there is
an intermediate component somewhere that insists on using UTF-16..at that
point we just can't represent code point 0xdc08 at all.  But that code point
will never appear in text; it will only appear for users using the String to
store arbitrary data, and their need has already been met..

Wes

-- 
Wesley W. Garland
Director, Product Development
PageMail, Inc.
+1 613 542 2787 x 102
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20110517/98e6e6a4/attachment.html>


More information about the es-discuss mailing list