easy handling of UTF16 surrogates & well-formed strings

Phillips, Addison addison at lab126.com
Wed Nov 14 09:05:53 PST 2012


You might want to check out Norbert's proposal [1]

Addison

Addison Phillips
Globalization Architect (Lab126)
Chair (W3C I18N WG)

Internationalization is not a feature.
It is an architecture.


[1] http://norbertlindenberg.com/2012/05/ecmascript-supplementary-characters/index.html

> -----Original Message-----
> From: Roger Andrews [mailto:roger.andrews at mail104.co.uk]
> Sent: Wednesday, November 14, 2012 6:07 AM
> To: es-discuss at mozilla.org
> Subject: easy handling of UTF16 surrogates & well-formed strings
> 
> This is rather long but the idea is to make handling UTF16 surrogates easier for
> the casual user without harming the ability of UTF16 experts to delve into
> details if surrogates are not well-paired (and hence the string is not well-
> formed).
> 
> Under the current definitions (ed. 6_10-26-12) surprising things happen.
> E.g. a string converted to an array of codepoints with 'codePointAt' then back to
> a string with 'fromCodePoint' is not equal to the original string if it contains
> well-formed surrogate pairs.
> 
> Here are some thoughts from a JavaScript enthusiast playing with Unicode
> outside the BMP.
> 
> 
> String.prototype.codePointAt
> ----------------------------
> 
> The current definition of codePointAt has results:
>    out-of-bounds                  -> Undefined
>    normal BMP char                -> the codepoint
>    lead surrogate of a good pair  -> the codepoint
>    trail surrogate of a good pair -> codeunit in [0xDC00:0xDFFF] !!ambiguous
>    bad trail surrogate            -> codeunit in [0xDC00:0xDFFF]
>    bad lead surrogate             -> codeunit in [0xD800:0xDBFF]
> 
> Note that a well-paired trail surrogate still results in a value even though the
> previous codeunit "subsumed" it.  So, if a caller is indexing down the string then
> it should take the well-paired trail surrogate value out of the sequence.
> 
> UTF16 experts can write code to check these possibilities; but for general
> usability lets have:
>    Undefined for the trail surrogate of a good pair, and
>    NaN for bad surrogate.
> 
> Then codePointAt would do the work for the casual user and experts can probe
> the string with charCodeAt (or codeUnitAt if it exists) if they really want to
> know the situation of bad surrogates.
> 
> [Unchanged, users are called upon to write code patterns like the messy....
> 
>     // if the indexed position is part of a well-formed surrogate pair
>     // then result is either the entire code-point (for lead surrogates)
>     //                or undefined (for trail surrogates)
>     // result is NaN for bad surrogates
>     // (result is always undefined for out-of-bounds position)
> 
>     cp = str.codePointAt( pos );
>     if (0xDC00 <= cp  &&  cp <= 0xDFFF) {
>         cu = str.charCodeAt( pos-1 );
>         if (0xD800 <= cu  &&  cu <= 0xDBFF) {
>             cp =  undefined;      // trail surrogate of good pair
>         }
>     }
>     if (0xD800 <= cp  &&  cp <= 0xDFFF) {
>         cp = NaN;                 // bad surrogate
>     }
> 
> ]
> 
> 
> String.prototype.charCodeAt / String.prototype.codeUnitAt
> ---------------------------
> 
> The existing charCodeAt returns NaN  (not Undefined) if the indexed position is
> out-of-bounds, unlike codePointAt.
> 
> For consistency, there could be a method 'codeUnitAt' which behaves like (and
> is named like) codePointAt; i.e. returns Undefined for out-of-bounds.
> 
> 
> String.prototype.charAt / String.prototype.unicodeCharAt
> -----------------------
> 
> The existing charAt does not handle UTF16 surrogate pairs.
> 
> For consistency with the above, there could be a method 'unicodeCharAt'
> which returns the 1- or 2-char string corresponding to the 'codePointAt'
> value and empty-string for out-of-bounds or a well-paired trail surrogate.
> Note that an array of such strings could be joined to form the original string.
> 
> What to return for a bad surrogate?  Null?  Undefined?
> 
> 
> String.fromCodePoint
> --------------------
> 
> The current definition of fromCodePoint does not convert a sequence produced
> by codePointAt back to the original string.
> 
> This is really due to codePointAt returning a trail surrogate value after a well-
> formed pair (which were just converted to a single codepoint).
> 
> If codePointAt is changed to return Undefined for a good trail surrogate then
> fromCodePoint should simply ignore Undefined arguments.  Currently I think it
> throws RangeError (or maybe converts Undefined values to NUL chars?).
> 
> 
> String.fromCharCode / String.fromCodeUnit
> -------------------
> 
> The existing fromCharCode converts undefined,null,NaN,Infinity values into
> NUL chars (U+0000), and maps other naughty values into valid chars.
> 
> For consistency, there could be a function 'fromCodeUnit' which behaves like
> (and is named like) fromCodePoint; i.e. throws RangeError for naughty values.
> This function should also have arity = 0 like fromCodePoint.
> 
> If fromCodePoint is changed to ignore Undefined arguments then so should
> fromCodeUnit.
> 
> 
> String.isWellFormed
> -------------------
> 
> To enable a user easily to detect a well-/ill-formed string how about a friendly
> predicate:
>    String.isWellFormed( str )
> 
> Without this, the following regexp should test a string for well-formedness (no
> warranty implied):
>    /^(?:[\uD800-\uDBFF][\uDC00-\uDFFF]|[\u0000-\uD7FF\uE000-\uFFFF])*$/
> 
> 
> String.prototype.repair
> -----------------------
> 
> Following on from isWellFormed, what is the user to do with an ill-formed
> string?  Here is one suggestion: a 'repair' method which replaces improper
> surrogates with something (like the Unicode replacement character U+FFFD).
> (Alternatively, the user may want to give up and throw an Error, see next.)
> 
> [Here is a possible implementation which UTF16 experts could shim in....
> 
>     var re_badsurrogate =
> /[\uD800-\uDBFF](?![\uDC00-\uDFFF])|([^\uD800-\uDBFF])[\uDC00-
> \uDFFF]|^[\uDC00-\uDFFF]/g;
> 
>     String.prototype.repair = function (replacer)
>     {
>         if (arguments.length == 0)  replacer = "\uFFFD";
> 
>         return this.replace( re_badsurrogate, "$1"+replacer );
>     };
> 
> ]
> 
> 
> StringError (& URI functions)
> -----------
> 
> The existing encodeURI & encodeURIComponent throw URIError if given an ill-
> formed string.  (The URI decode function similar both for ill-formed strings and
> improper use of percent-coding.)
> 
> A new Error, called StringError, could be thrown by URI functions and user
> functions which reject an ill-formed string *because* it is ill-formed, (rather
> than trying to repair it).
> 
> To avoid changing the existing URI functions, versions using StringError could
> be moved from global namespace to a "URI" namespace (ala "JSON"):
>   URI.encodeComponent, ...
> This seems quite neat, and declutters the global namespace too.
> 
> 



More information about the es-discuss mailing list