Full Unicode strings strawman

Phillips, Addison addison at lab126.com
Tue May 17 12:00:45 PDT 2011

Note: The W3C Internationalization Core WG published a set of "requirements" in this area for consideration by ES some time ago. It lives here:


The section on 'locale related behavior' is being separately addressed.

I think that:

1. Changing references from UCS-2 to UTF-16 makes sense, although the spec, IIRC, already *says* UTF-16.
2. Allowing unpaired surrogates is a *requirement*. Yes, such a string is "ill-formed", but there are too many cases in which one might wish to have such "broken" strings for scripting purposes.
3. We should have escape syntax for supplementary characters (such as \U0010000). Looking up the surrogate pair for a given Unicode character is extremely inconvenient and is not self-documenting.

As Shawn notes, basically, there are three ways that one might wish to access strings:

- as grapheme clusters (visual units of text)
- as Unicode scalar values (logical units of text, i.e. characters)
- as code units (encoding units of text)

The example I use in the Unicode conference internationalization tutorial is a box on a Web site with an ES controlled message underneath it saying "You have 200 characters remaining."

I think it is instructive to look at how Java managed this transition. In some cases the "200" represents the number of storage units I have available (as in my backing database), in which case String.length is what I probably want. In some cases I want to know how many Unicode characters there are (Java solves this with the codePointCount(), codePointBefore(), and codePointAt() methods). These are relatively rare operations, but they have occasional utility. Or I may want grapheme clusters (Java attempts to solve this with BreakIterators and I tend to favor doing the same thing in JavaScript---default grapheme clusters are better than nothing, but language-specific grapheme clusters are more useful).

If we follow the above, providing only minimal additional methods for accessing codepoints when necessary, this also limits the impact of adding supplementary character support to the language. Regex probably works the way one supposes (both \U0010000 and \ud800\udc00 find the surrogate pair \ud800\udc00 and one can still find the low surrogate \udc00 if one wishes too). And existing scripts will continue to function without alteration. However, new scripts can be written that use supplementary characters. 



Addison Phillips
Globalization Architect (Lab126)
Chair (W3C I18N WG)

Internationalization is not a feature.
It is an architecture.

> -----Original Message-----
> From: Shawn Steele [mailto:Shawn.Steele at microsoft.com]
> Sent: Tuesday, May 17, 2011 11:09 AM
> To: Brendan Eich; Boris Zbarsky
> Cc: es-discuss
> Subject: RE: Full Unicode strings strawman
> I would much prefer changing "UCS-2" to "UTF-16", thus formalizing that
> surrogate pairs are permitted.  That'd be very difficult to break any existing
> code and would still allow representation of everything reasonable in Unicode.
> That would enable Unicode, and allow extending string literals and regular
> expressions for convenience with the U+10FFFF style notation (which would be
> equivalent to the surrogate pair).  The character code manipulation functions
> could be similarly augmented without breaking anything (and maybe not
> needing different names?)
> You might want to qualify the UTF-16 as allowing, but strongly discouraging,
> lone surrogates for those people who didn't realize their binary data wasn't a
> string.
> The sole disadvantage would be that iterating through a string would require
> consideration of surrogates, same as today.  The same caution is also necessary
> to avoid splitting Ä (U+0041 U+0308) into its component A and   ̈ parts.  I
> wouldn't be opposed to some sort of helper functions or classes that aided in
> walking strings, preferably with options to walk the graphemes (or whatever),
> not just the surrogate pairs.  FWIW: we have such a helper for surrogates
> in .Net and "nobody uses them".  The most common feedback is that it's not
> that helpful because it doesn't deal with the graphemes.
> - Shawn
> Shawn.Steele at Microsoft.com
> Senior Software Design Engineer
> Microsoft Windows

More information about the es-discuss mailing list