Full Unicode strings strawman

Shawn Steele Shawn.Steele at microsoft.com
Tue May 17 11:09:29 PDT 2011


I would much prefer changing "UCS-2" to "UTF-16", thus formalizing that surrogate pairs are permitted.  That'd be very difficult to break any existing code and would still allow representation of everything reasonable in Unicode.  

That would enable Unicode, and allow extending string literals and regular expressions for convenience with the U+10FFFF style notation (which would be equivalent to the surrogate pair).  The character code manipulation functions could be similarly augmented without breaking anything (and maybe not needing different names?)

You might want to qualify the UTF-16 as allowing, but strongly discouraging, lone surrogates for those people who didn't realize their binary data wasn't a string.

The sole disadvantage would be that iterating through a string would require consideration of surrogates, same as today.  The same caution is also necessary to avoid splitting Ä (U+0041 U+0308) into its component A and   ̈ parts.  I wouldn't be opposed to some sort of helper functions or classes that aided in walking strings, preferably with options to walk the graphemes (or whatever), not just the surrogate pairs.  FWIW: we have such a helper for surrogates in .Net and "nobody uses them".  The most common feedback is that it's not that helpful because it doesn't deal with the graphemes.

- Shawn

Shawn.Steele at Microsoft.com
Senior Software Design Engineer
Microsoft Windows



More information about the es-discuss mailing list