ECMAScript collation question

Norbert Lindenberg ecmascript at norbertlindenberg.com
Fri Aug 31 09:56:39 PDT 2012


OK, so the Unicode conformance question hinges on "must be able to do" versus "must do".

The question for ECMAScript then is whether we should stick with "must do" (the current state of the specifications) or change to "must be able to do".

The changes for "must be able to do" would be:

- In the Language specification, remove the description of String.prototype.localeCompare, and require implementations to follow the Internationalization API specification at least for this method, or better provide the complete Internationalization API. That way, localeCompare acquires support for the normalization property in options, and the -kk- key in the Unicode locale extensions.

- In the Internationalization API specification, make support for the normalization property and the -kk- key mandatory (it's currently optional), but drop the separate requirement that canonically equivalent strings compare as 0.

This would give applications control over the trade-off between performance and full canonical equivalence, and let implementations select the default per locale.

But trading off correctness for performance in this way doesn't seem quite right. Especially for search usage, it could mean that you're staring at a Vietnamese or Arabic word in a list and the search functions says it's not there because you typed an indistinguishable but different string into the search box.

Thanks,
Norbert


On Aug 31, 2012, at 8:24 , Nebojša Ćirić wrote:

> This is what Markus had to say (he implemented most of the collation for ICU):
> 
> "http://www.unicode.org/reports/tr10/#Avoiding_Normalization
> 
> Step 1 of the algorithm: http://www.unicode.org/reports/tr10/#Step_1
> which has a note:
> 	• Conformant implementations may skip this step in certain circumstances: see Section 6.5, Avoiding Normalization for more information.
> See also http://www.unicode.org/reports/tr10/#Parametic_Tailoring
> -> attribute "normalization", see the description there
> (this whole table 14 will soon move to the LDML spec, leaving only a link in this place)"
> 
> So the question is:
> 
> 1. Do we change i18n API default for normalization to always be true, with some performance penalty?
> 2. Update ES 262 spec with info Markus passed (if possible)?
> 
> 
> 2012/8/30 Mark Davis ☕ <mark at macchiato.com>
> ICU is always able to compare them as being equal, just by setting the parameter.
> 
> Even if the parameter isn't set, it uses an FCD sort (see http://unicode.org/notes/tn5/) and canonical closure, which handles most cases of canonical equivalence. The default is turned on for languages where the normal+auxiliary exemplar sets contains characters that would show a difference even with an FCD+closure sort, and can be turned on always if desired (at some cost in performance; 30% sounds high though).
> 
> Mark
> 
> — Il meglio è l’inimico del bene —
> 
> 
> 
> On Thu, Aug 30, 2012 at 6:30 PM, Norbert Lindenberg <ecmascript at norbertlindenberg.com> wrote:
> In particular, a conformant implementation must be able to compare any two canonical-equivalent strings as being equal, for all Unicode characters supported by that implementation."
> 
> 
> 
> 
> -- 
> Nebojša Ćirić



More information about the es-discuss mailing list