Full Unicode strings strawman

Shawn Steele Shawn.Steele at microsoft.com
Thu May 19 10:27:50 PDT 2011


> > - Java kept their strings encoded exactly as they were (a sequence of 16-bit code units) and provided extra APIs for the cases where you want to extract a code point.

> Bloaty.

? defining UTF-16 instead of UCS-2 introduces zero bloat.  In fact, it pretty much works anyway, it's just not "official".  The only "bloat" would be helpers to handle 21-bit forms if desired.  OTOH, replacing UCS-2 with 21 bit Unicode would require new functions to avoid breaking stuff that works today.  Indeed, I believe that to handle 21-bit forms without breaking would require more bloat than helpers for UTF-16 could possible add.

> This is analogous but diferent in degree. Going from bytes to Unicode characters is different from going from uint16 to Unicode by ~15 to 5 bits.

There is no "Unicode".  It's UTF-32 or UTF-16.  "Unicode" is an abstract collection of code points that have to be encoded in some form to be useful.  Unless you also want to propose a UTF-21, which immediately gets really scary to me.

> unformatted 16-bit data

I think this is the crux of the problem?  A desire to allow binary data that isn't Unicode to impersonate a string?  I'd much rather let those hacks continue to use UTF-16 rather than try to legitimize a hacky data store by forcing awkward behavior on the character strings.

> The significant new APIs in Allen's proposal (apart from transcoding helpers that might be useful already, where JS hackers wrongly assume BMP only), are for generating strings from full Unicode code points: String.fromCode, etc.

Those are all possible even with UTF-16.  In fact, they're probably less breaking because you don't need a new API to get a string from a code point, the existing API works fine, it'd just emit a pair.  Only by using UTF-32 are some of these necessary.

> We have the "non-BMP-chars encoded as pairs" problem already. The proposal does not increase its incidence, it merely adds ways to make strings that don't have this problem.

No, we don't.  They're already perfectly valid UTF-16, which most software already knows how to handle.  We don't need to add another way of describing a supplementary character, which would mean that everything that handled strings would have to test not only U+D800, U+DC00, but also realize that U+10000 is equivalent.  That approach would introduce a terrifying number of encoding and conversion bugs.  (I own encodings at Microsoft, that kind of thing is truly evil and causes endless amounts of support pain.)
 
> The other problem, of mixing strings-with-pairs and strings-without, is real. But it is not obvious to me why it's worse than duplicating all the existing string APIs. Developers still have to choose. Without a new "ustring" type they can still mix. Do duplicated full-Unicode APIs really pay their way?

Nothing needs to change.  At it's basic level, "all" that needs done is to change from UCS-2 to UTF-16, which, in practice, is pretty much a no-op.  Whether or not additional helpers were desirable is orthogonal.  Some, I think, are low-hanging and don't cause much duplication (to/from code point).  Others are harder (walking a string at boundaries, although graphemes are more interesting than surrogate pairs anyway).

> > 2. Widening characters to 21 bits doesn't really help much.  As stated earlier in this thread, you still want to treat clumps of combining characters together with the character to which they combine, worry about various normalized forms, etc.

> Do you? I mean: do web developers?
They "blindly" use UTF-8 or UTF-16 in their HTML and expect it to "just work".  My editor is going to show me a pretty glyph, not a surrogate pair, when I type the string anyway.

> You can use non-BMP characters in HTML today, pump them through JS, back into the DOM, and render them beautifully on all the latest browsers
Exactly - pretty much everyone treats JavaScript as handling UTF-16 already, and it "just works".  The only annoying parts are specifying a code point in a string literal or a regular expression.  Those could be easily extended, even exactly as Allen suggested, even if we're officially UTF-16.

> How do you support non-BMP characters without API bloat? 
There is no bloat.  

> The crucial win of Allen's proposal comes down the road, when someone in a certain locale *can* do s.indexOf(nonBMPChar) and win.
s.indexOf("\U+10000"), who cares that it ends up as UTF-16?  You can already do it, today, with s.indexOf("𐀀").  It happens that 𐀀 looks like d800 + dc00, but it still works.  Today.  This is no different than most other languages.

-Shawn


More information about the es-discuss mailing list