Full Unicode strings strawman

Shawn Steele Shawn.Steele at microsoft.com
Mon May 16 14:23:14 PDT 2011

I'm having some (ok, a great deal of) confusion between the DOM Encoding and the JavaScript encoding and whatever.  I'd assumed that if I had a web page in some encoding, that it was converted to UTF-16 (well, UCS-2), and that's what the JavaScript engine did it's work on.  I confess to not having done much encoding stuff in JS in the last decade.

In UTF-8, individually encoded surrogates are illegal (and a security risk).  Eg: you shouldn't be able to encode D800/DC00 as two 3 byte sequences, they should be a single 6 byte sequence.  Having not played with the js encoding/decoding in quite some time, I'm not sure what they do in that case, but hopefully it isn't illegal UTF-8.  (You also shouldn't be able to have half a surrogate pair in UTF-16, but many things are pretty lax about that.)


From: Allen Wirfs-Brock [mailto:allen at wirfs-brock.com]
Sent: Monday, May 16, 2011 12:53 PM
To: Shawn Steele
Cc: es-discuss at mozilla.org
Subject: Re: Full Unicode strings strawman

On May 16, 2011, at 11:34 AM, Shawn Steele wrote:

Thanks for making a strawman
(see my very last sentence below as it may impact the interpreation of some of the rest of these responses)

Unicode Escape Sequences
Is it possible for U+ to accept either 4, 5, or 6 digit sequences?   Typically when I encounter U+ notation the leading zero is omitted, and I see BMP characters quite often.  Obviously BMP could use the U notation, however it seems like it'd be annoying to the occasional user to know that U is used for some and U+ for others.  Seems like it'd be easier for developers to remember that U+ is "the new way" and U is "the old way that doesn't always work".

The ES string literal notation does't really accommodate  variable length subtokens without explicit terminators.  What would be the rules for parsing "\u+12345678".  How do we know if the programmer meant "\u1234"+"5678" or "\u0012"+"345678" or ...

There have been past proposals for a syntax like \u{xxxxxx} that could have 1to 6 hex digits.  In the past proposal the assumption was that it would produce UTF-16 surrogate pairs but in this context we could adopt it instead of \u+ to produce a single character.  The disadvantage is that it is a slightly long sequence for actual large code points.  On the other hand perhaps it is more readable?  "\u+123456\u+123456" vs. "\u{123456}\u{123456}" ??

String Position
It's unclear to me if the string indices can be "changed" from UTF-16 to UTF-32 positions.  Although UTF-32 indices are clearly desirable, I think that many implementations currently allow UTF-16 codepoints U+D800 through U+DFFF.  In other words, I can already have Javascript strings with full Unicode range data in them.  Existing applications would then have indices that pointed to the UTF-16, not UTF-32 index.  Changing the definition of the index to UTF-32 would break those applications I think.

No it wouldn't break anything, at least when applied to existing data.  Your existing code is explicitly doing UTF-16 processing.  Somebody had to do the processing to create the surrogate pairs in the string. As long as you use that same agent to are still going to bet UTF-16 encoded strings. Even though the underlying character values could hold single characters with codepoints > \uffff the actual string won't unless unless somebody actually constructed the string to contain such values.  That presumably doesn't happen for existing code.

The place where existing code might break is if somebody explicitly constructs a string (using \u+ literals or String.fromCodepoint) that contains non-BMP characters and passes it to routines that that only expect 16-bits characters.  For this reason, any existing host routines that convert external data resources to ES strings that contain surrogate pairs should probably continue to do so.  New routines should be provided that produce single characters instead of pairs for non-BMP pointpoints.  However, the definition of such routines is outside the scope of the ES specification.

Finally, note that just as current strings can contain16-bit character values that are not valid Unicode code points, the expanded full unicode strings can also contain 21-bit character values that are not valid Unicode codepoints.

You also touch on that with charCodeAt/codepointAt, which resolves the problem with the output type, but doesn't address the problem with the indexing.  Similar to the way you differentiated charCode/codepoint, it may be necessary to differentiate charCode/codepoint indices.  IMO .fromCharCode doesn't have this problem since it used to fail, but now works, which wouldn't be breaking.  Unless we're concerned that now it can return a different UTF-16 length than before.

Again, nothing changes.  Code that expects to deal with multi-character encodings can still do so.   What "magically" changes is that code that act Unicode like codepoints are only 16-bits (ie, the code doesn't correctly deal with surrogate pairs) will now work with full 21-bit characters.

I don't like the "21" in the name of decodeURI21.

Suggestions for better names are always welcome.

  Also, the "trick" I think, is encoding to surrogate pairs (illegally, since UTF8 doesn't allow that) vs decoding to UTF16.  It seems like decoding can safely detect input supplementary characters and properly decode them, or is there something about encoding that doesn't make that state detectable?

I think I missing the distinction you are making between surrogate pairs and UTF-16.  I think I've been using the terms interchangeably.  I may be munging up the terminology.


From: es-discuss-bounces at mozilla.org<mailto:es-discuss-bounces at mozilla.org> [mailto:es-discuss-bounces at mozilla.org] On Behalf Of Allen Wirfs-Brock
Sent: Monday, May 16, 2011 11:12 AM
To: es-discuss at mozilla.org<mailto:es-discuss at mozilla.org>
Subject: Full Unicode strings strawman

I tried to post a pointer to this strawman on this list a few weeks ago, but apparently it didn't reach the list for some reason.

Feed back would be appreciated:



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20110516/31aa4cb2/attachment.html>

More information about the es-discuss mailing list