easy handling of UTF16 surrogates & well-formed strings

Roger Andrews roger.andrews at mail104.co.uk
Wed Nov 14 06:06:41 PST 2012


This is rather long but the idea is to make handling UTF16 surrogates
easier for the casual user without harming the ability of UTF16 experts to
delve into details if surrogates are not well-paired (and hence the string
is not well-formed).

Under the current definitions (ed. 6_10-26-12) surprising things happen.
E.g. a string converted to an array of codepoints with 'codePointAt' then
back to a string with 'fromCodePoint' is not equal to the original string
if it contains well-formed surrogate pairs.

Here are some thoughts from a JavaScript enthusiast playing with Unicode
outside the BMP.


String.prototype.codePointAt
----------------------------

The current definition of codePointAt has results:
   out-of-bounds                  -> Undefined
   normal BMP char                -> the codepoint
   lead surrogate of a good pair  -> the codepoint
   trail surrogate of a good pair -> codeunit in [0xDC00:0xDFFF] !!ambiguous
   bad trail surrogate            -> codeunit in [0xDC00:0xDFFF]
   bad lead surrogate             -> codeunit in [0xD800:0xDBFF]

Note that a well-paired trail surrogate still results in a value even though
the previous codeunit "subsumed" it.  So, if a caller is indexing down the
string then it should take the well-paired trail surrogate value out of the
sequence.

UTF16 experts can write code to check these possibilities; but for general
usability lets have:
   Undefined for the trail surrogate of a good pair, and
   NaN for bad surrogate.

Then codePointAt would do the work for the casual user and experts can probe
the string with charCodeAt (or codeUnitAt if it exists) if they really want
to know the situation of bad surrogates.

[Unchanged, users are called upon to write code patterns like the messy....

    // if the indexed position is part of a well-formed surrogate pair
    // then result is either the entire code-point (for lead surrogates)
    //                or undefined (for trail surrogates)
    // result is NaN for bad surrogates
    // (result is always undefined for out-of-bounds position)

    cp = str.codePointAt( pos );
    if (0xDC00 <= cp  &&  cp <= 0xDFFF) {
        cu = str.charCodeAt( pos-1 );
        if (0xD800 <= cu  &&  cu <= 0xDBFF) {
            cp =  undefined;      // trail surrogate of good pair
        }
    }
    if (0xD800 <= cp  &&  cp <= 0xDFFF) {
        cp = NaN;                 // bad surrogate
    }

]


String.prototype.charCodeAt / String.prototype.codeUnitAt
---------------------------

The existing charCodeAt returns NaN  (not Undefined) if the indexed position
is out-of-bounds, unlike codePointAt.

For consistency, there could be a method 'codeUnitAt' which behaves like
(and is named like) codePointAt; i.e. returns Undefined for out-of-bounds.


String.prototype.charAt / String.prototype.unicodeCharAt
-----------------------

The existing charAt does not handle UTF16 surrogate pairs.

For consistency with the above, there could be a method 'unicodeCharAt'
which returns the 1- or 2-char string corresponding to the 'codePointAt'
value and empty-string for out-of-bounds or a well-paired trail surrogate.
Note that an array of such strings could be joined to form the original
string.

What to return for a bad surrogate?  Null?  Undefined?


String.fromCodePoint
--------------------

The current definition of fromCodePoint does not convert a sequence produced
by codePointAt back to the original string.

This is really due to codePointAt returning a trail surrogate value after
a well-formed pair (which were just converted to a single codepoint).

If codePointAt is changed to return Undefined for a good trail surrogate
then fromCodePoint should simply ignore Undefined arguments.  Currently I
think it throws RangeError (or maybe converts Undefined values to NUL
chars?).


String.fromCharCode / String.fromCodeUnit
-------------------

The existing fromCharCode converts undefined,null,NaN,Infinity values into
NUL chars (U+0000), and maps other naughty values into valid chars.

For consistency, there could be a function 'fromCodeUnit' which behaves like
(and is named like) fromCodePoint; i.e. throws RangeError for naughty
values.  This function should also have arity = 0 like fromCodePoint.

If fromCodePoint is changed to ignore Undefined arguments
then so should fromCodeUnit.


String.isWellFormed
-------------------

To enable a user easily to detect a well-/ill-formed string how about a
friendly predicate:
   String.isWellFormed( str )

Without this, the following regexp should test a string for well-formedness
(no warranty implied):
   /^(?:[\uD800-\uDBFF][\uDC00-\uDFFF]|[\u0000-\uD7FF\uE000-\uFFFF])*$/


String.prototype.repair
-----------------------

Following on from isWellFormed, what is the user to do with an ill-formed
string?  Here is one suggestion: a 'repair' method which replaces improper
surrogates with something (like the Unicode replacement character U+FFFD).
(Alternatively, the user may want to give up and throw an Error, see next.)

[Here is a possible implementation which UTF16 experts could shim in....

    var re_badsurrogate =
/[\uD800-\uDBFF](?![\uDC00-\uDFFF])|([^\uD800-\uDBFF])[\uDC00-\uDFFF]|^[\uDC00-\uDFFF]/g;

    String.prototype.repair = function (replacer)
    {
        if (arguments.length == 0)  replacer = "\uFFFD";

        return this.replace( re_badsurrogate, "$1"+replacer );
    };

]


StringError (& URI functions)
-----------

The existing encodeURI & encodeURIComponent throw URIError if given an
ill-formed string.  (The URI decode function similar both for ill-formed
strings and improper use of percent-coding.)

A new Error, called StringError, could be thrown by URI functions and user
functions which reject an ill-formed string *because* it is ill-formed,
(rather than trying to repair it).

To avoid changing the existing URI functions, versions using StringError
could be moved from global namespace to a "URI" namespace (ala "JSON"):
  URI.encodeComponent, ...
This seems quite neat, and declutters the global namespace too.
 



More information about the es-discuss mailing list