easy handling of UTF16 surrogates & well-formed strings

Roger Andrews roger.andrews at mail104.co.uk
Wed Nov 14 13:49:38 PST 2012


Thanks for the ref to Norbert's proposal.
(I have been interested in i18n since writing an international telephony 
switch control system in 1987.)

Norbert's proposal has much interesting info about formats, locales, 
case-mapping & much else, but says little about the String.* functions or 
how the user can handle an ill-formed string (thinking from the perspective 
of a lowly software engineer working to achieve some task, rather than a 
top-down architect).

Head:  4.3.20 Surrogate pair
The proposal does confirm that an unpaired surrogate makes a UTF16 sequence 
ill-formed.
Head:  5.3 Text Interpretation
The proposal confirms that a valid surrogate pair is interpreted as a single 
codepoint, not a codepoint followed by an unpaired surrogate (as 
String.prototype.codePointAt does).

Towards the end of the page, in section Code Point Based String Accessors,
the proposal defines String.fromCodePoint and String.prototype.codePointAt 
in effectively the same manner as ES6 (ed. 6_10-26-12) - although the length 
property (arity) of fromCodePoint differs from ES6's.

This definition of codePointAt has the same usability issues as ES6's (ed. 
6_10-26-12);
i.e. it returns a value in [0xDC00:0xDFFF] for both the 2nd member of a 
surrogate pair and an unpaired surrogate.
It returns a value in [0xD800:0xDFFF] for an unpaired surrogate - maybe it 
would be friendlier to the casual user to return NaN (UTF16 experts can 
probe the location with charCodeAt / codeUnitAt if they care to).

My original post tried to point to anomalies in:
   String.prototype.codePointAt   (of ES6)
   String.prototype.charCodeAt   (suggest String.prototype.codeUnitAt 
instead)
   String.prototype.charAt   (suggest String.prototype.unicodeCharAt too)
   String.fromCodePoint   (of ES6)
   String.fromCharCode   (suggest String.fromCodeUnit instead)
and floated:
   String.isWellFormed
   String.prototype.repair
   StringError   (& suggest URI functions mods)

Thanks again for the ref.


--------------------------------------------------
From: "Phillips, Addison" <addison at lab126.com>
Sent: Wednesday, November 14, 2012 5:05 PM
To: "Roger Andrews" <roger.andrews at mail104.co.uk>; <es-discuss at mozilla.org>
Subject: RE: easy handling of UTF16 surrogates & well-formed strings

> You might want to check out Norbert's proposal [1]
>
> Addison
>
> Addison Phillips
> Globalization Architect (Lab126)
> Chair (W3C I18N WG)
>
> Internationalization is not a feature.
> It is an architecture.
>
>
> [1] 
> http://norbertlindenberg.com/2012/05/ecmascript-supplementary-characters/index.html
>
>> -----Original Message-----
>> From: Roger Andrews [mailto:roger.andrews at mail104.co.uk]
>> Sent: Wednesday, November 14, 2012 6:07 AM
>> To: es-discuss at mozilla.org
>> Subject: easy handling of UTF16 surrogates & well-formed strings
>>
>> This is rather long but the idea is to make handling UTF16 surrogates 
>> easier for
>> the casual user without harming the ability of UTF16 experts to delve 
>> into
>> details if surrogates are not well-paired (and hence the string is not 
>> well-
>> formed).
>>
>> Under the current definitions (ed. 6_10-26-12) surprising things happen.
>> E.g. a string converted to an array of codepoints with 'codePointAt' then 
>> back to
>> a string with 'fromCodePoint' is not equal to the original string if it 
>> contains
>> well-formed surrogate pairs.
>>
>> Here are some thoughts from a JavaScript enthusiast playing with Unicode
>> outside the BMP.
>>
>>
>> String.prototype.codePointAt
>> ----------------------------
>>
>> The current definition of codePointAt has results:
>>    out-of-bounds                  -> Undefined
>>    normal BMP char                -> the codepoint
>>    lead surrogate of a good pair  -> the codepoint
>>    trail surrogate of a good pair -> codeunit in [0xDC00:0xDFFF] 
>> !!ambiguous
>>    bad trail surrogate            -> codeunit in [0xDC00:0xDFFF]
>>    bad lead surrogate             -> codeunit in [0xD800:0xDBFF]
>>
>> Note that a well-paired trail surrogate still results in a value even 
>> though the
>> previous codeunit "subsumed" it.  So, if a caller is indexing down the 
>> string then
>> it should take the well-paired trail surrogate value out of the sequence.
>>
>> UTF16 experts can write code to check these possibilities; but for 
>> general
>> usability lets have:
>>    Undefined for the trail surrogate of a good pair, and
>>    NaN for bad surrogate.
>>
>> Then codePointAt would do the work for the casual user and experts can 
>> probe
>> the string with charCodeAt (or codeUnitAt if it exists) if they really 
>> want to
>> know the situation of bad surrogates.
>>
>> [Unchanged, users are called upon to write code patterns like the 
>> messy....
>>
>>     // if the indexed position is part of a well-formed surrogate pair
>>     // then result is either the entire code-point (for lead surrogates)
>>     //                or undefined (for trail surrogates)
>>     // result is NaN for bad surrogates
>>     // (result is always undefined for out-of-bounds position)
>>
>>     cp = str.codePointAt( pos );
>>     if (0xDC00 <= cp  &&  cp <= 0xDFFF) {
>>         cu = str.charCodeAt( pos-1 );
>>         if (0xD800 <= cu  &&  cu <= 0xDBFF) {
>>             cp =  undefined;      // trail surrogate of good pair
>>         }
>>     }
>>     if (0xD800 <= cp  &&  cp <= 0xDFFF) {
>>         cp = NaN;                 // bad surrogate
>>     }
>>
>> ]
>>
>>
>> String.prototype.charCodeAt / String.prototype.codeUnitAt
>> ---------------------------
>>
>> The existing charCodeAt returns NaN  (not Undefined) if the indexed 
>> position is
>> out-of-bounds, unlike codePointAt.
>>
>> For consistency, there could be a method 'codeUnitAt' which behaves like 
>> (and
>> is named like) codePointAt; i.e. returns Undefined for out-of-bounds.
>>
>>
>> String.prototype.charAt / String.prototype.unicodeCharAt
>> -----------------------
>>
>> The existing charAt does not handle UTF16 surrogate pairs.
>>
>> For consistency with the above, there could be a method 'unicodeCharAt'
>> which returns the 1- or 2-char string corresponding to the 'codePointAt'
>> value and empty-string for out-of-bounds or a well-paired trail 
>> surrogate.
>> Note that an array of such strings could be joined to form the original 
>> string.
>>
>> What to return for a bad surrogate?  Null?  Undefined?
>>
>>
>> String.fromCodePoint
>> --------------------
>>
>> The current definition of fromCodePoint does not convert a sequence 
>> produced
>> by codePointAt back to the original string.
>>
>> This is really due to codePointAt returning a trail surrogate value after 
>> a well-
>> formed pair (which were just converted to a single codepoint).
>>
>> If codePointAt is changed to return Undefined for a good trail surrogate 
>> then
>> fromCodePoint should simply ignore Undefined arguments.  Currently I 
>> think it
>> throws RangeError (or maybe converts Undefined values to NUL chars?).
>>
>>
>> String.fromCharCode / String.fromCodeUnit
>> -------------------
>>
>> The existing fromCharCode converts undefined,null,NaN,Infinity values 
>> into
>> NUL chars (U+0000), and maps other naughty values into valid chars.
>>
>> For consistency, there could be a function 'fromCodeUnit' which behaves 
>> like
>> (and is named like) fromCodePoint; i.e. throws RangeError for naughty 
>> values.
>> This function should also have arity = 0 like fromCodePoint.
>>
>> If fromCodePoint is changed to ignore Undefined arguments then so should
>> fromCodeUnit.
>>
>>
>> String.isWellFormed
>> -------------------
>>
>> To enable a user easily to detect a well-/ill-formed string how about a 
>> friendly
>> predicate:
>>    String.isWellFormed( str )
>>
>> Without this, the following regexp should test a string for 
>> well-formedness (no
>> warranty implied):
>>    /^(?:[\uD800-\uDBFF][\uDC00-\uDFFF]|[\u0000-\uD7FF\uE000-\uFFFF])*$/
>>
>>
>> String.prototype.repair
>> -----------------------
>>
>> Following on from isWellFormed, what is the user to do with an ill-formed
>> string?  Here is one suggestion: a 'repair' method which replaces 
>> improper
>> surrogates with something (like the Unicode replacement character 
>> U+FFFD).
>> (Alternatively, the user may want to give up and throw an Error, see 
>> next.)
>>
>> [Here is a possible implementation which UTF16 experts could shim in....
>>
>>     var re_badsurrogate =
>> /[\uD800-\uDBFF](?![\uDC00-\uDFFF])|([^\uD800-\uDBFF])[\uDC00-
>> \uDFFF]|^[\uDC00-\uDFFF]/g;
>>
>>     String.prototype.repair = function (replacer)
>>     {
>>         if (arguments.length == 0)  replacer = "\uFFFD";
>>
>>         return this.replace( re_badsurrogate, "$1"+replacer );
>>     };
>>
>> ]
>>
>>
>> StringError (& URI functions)
>> -----------
>>
>> The existing encodeURI & encodeURIComponent throw URIError if given an 
>> ill-
>> formed string.  (The URI decode function similar both for ill-formed 
>> strings and
>> improper use of percent-coding.)
>>
>> A new Error, called StringError, could be thrown by URI functions and 
>> user
>> functions which reject an ill-formed string *because* it is ill-formed, 
>> (rather
>> than trying to repair it).
>>
>> To avoid changing the existing URI functions, versions using StringError 
>> could
>> be moved from global namespace to a "URI" namespace (ala "JSON"):
>>   URI.encodeComponent, ...
>> This seems quite neat, and declutters the global namespace too.
>>
>>
>
> 


More information about the es-discuss mailing list