Full Unicode based on UTF-16 proposal

Norbert Lindenberg ecmascript at norbertlindenberg.com
Sun Mar 25 23:31:57 PDT 2012


Let's see:

- Conversion to UTF-8: If the string isn't well-formed, you wouldn't refuse to convert it, so isValid doesn't really help. You still have to look at all code units, and convert unpaired surrogates to the UTF-8 sequence for U+FFFD.

- Conversion from UTF-8: For security reasons, you have to check for well-formedness before conversion, in particular to catch non-shortest forms [1].

- HTML form data: Same situation as conversion to UTF-8.

- Base64 encodes binary data, so UTF-16 well-formedness rules don't apply.

I don't think we'd add API just to flag an issue - that's what documentation is for.

Norbert

[1] http://www.unicode.org/reports/tr36/#UTF-8_Exploit



On Mar 25, 2012, at 1:57 , Roger Andrews wrote:

> I use something like String.isValid functionality in a transcoder that
> converts Strings to/from UTF-8, HTML Formdata (MIME type
> application/x-www-form-urlencoded -- not the same as URI encoding!), and
> Base64.
> 
> Admittedly these currently use 'encodeURI' to do the work, or it just drops
> out naturally when considering UTF-8 sequences.
> 
> (I considered testing the regexp
> /^(?:[\u0000-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF])*$/
> against the input string.)
> 
> Maybe the function is too obscure for general use, although its presence does flag up the surrogate-pair issue to developers.
> 
> --------------------------------------------------
> From: "Norbert Lindenberg" <ecmascript at norbertlindenberg.com>
>> 
>> It's easy to provide this function, but in which situations would it be
>> useful? In most cases that I can think of you're interested in far more
>> constrained definitions of validity:
>> - what are valid ECMAScript identifiers?
>> - what are valid BCP 47 language tags?
>> - what are the characters allowed in a certain protocol?
>> - what are the characters that my browser can render?
>> 
>> Thanks,
>> Norbert
>> 
>> 
>> On Mar 24, 2012, at 12:12 , David Herman wrote:
>> 
>>> On Mar 23, 2012, at 11:45 AM, Roger Andrews wrote:
>>> 
>>>> Concerning UTF-16 surrogate pairs, how about a function like:
>>>>    String.isValid( str )
>>>> to discover whether surrogates are used correctly in 'str'?
>>>> 
>>>> Something like Array.isArray().
>>> 
>>> No need for it to be a class method, since it only operates on strings.
>>> We could simply have String.prototype.isValid(). Note that it would work
>>> for primitive strings as well, thanks to JS's automatic promotion
>>> semantics.
>>> 
>>> Dave
>>> 



More information about the es-discuss mailing list