i18n collator options

Shawn Steele Shawn.Steele at microsoft.com
Thu Jan 20 16:14:27 PST 2011


For UTF-16 order do you use like the Turkish casing if it was a turkish locale?

-Shawn

From: mark.edward.davis at gmail.com [mailto:mark.edward.davis at gmail.com] On Behalf Of Mark Davis ?
Sent: Poʻahā, Ianuali 20, 2011 4:05 hours
To: Shawn Steele
Cc: es-discuss at mozilla.org; Peter Constable; Derek Murman
Subject: Re: i18n collator options

(BTW I haven't gotten added to es-discuss yet, so one of you might forward these 3 messages to there.)


Mark

— Il meglio è l’inimico del bene —

On Thu, Jan 20, 2011 at 14:48, Shawn Steele <Shawn.Steele at microsoft.com<mailto:Shawn.Steele at microsoft.com>> wrote:
For case-insensitive UTF-16 order, how do you get the casing mappings?

We use the Unicode mappings.

Normalization should maybe be deferred to after v0.5.  It’s not a direct option for me (I could normalize first though), so it’d require thinking.

This is not a major case, so I agree on deferring.


I’d prefer describing options that described the behavior you’ll get as opposed to the strength, which kind of bundles stuff together.  I guess that runs into the IgnoreDiacritics/IgnoreWidth issue though.

Yes, we can't support all of the options completely orthogonally. In practice, we've never seen a need to distinguish the widths. So I'd suggest the intersection of the two:


•         Ordinal – (code point based non-linguistic comparison....mutually exclusive with any other option.

•         IgnoreCase – Ignore case - on/off/default

•         IgnoreDiacritics – Ignore diacritics/nonspacing characters - on/off/default

•         SortDigitsAsNumbers – Eg: 12 comes before 101 - on/off/default



The following I don't think is a high priority; that is, the default for the language should be fine.

•         IgnoreKanaType – Treat Hiragana and Katakana the same - on/off/default





-Shawn


From: mark.edward.davis at gmail.com<mailto:mark.edward.davis at gmail.com> [mailto:mark.edward.davis at gmail.com<mailto:mark.edward.davis at gmail.com>] On Behalf Of Mark Davis ?
Sent: Poʻahā, Ianuali 20, 2011 12:42 hours
To: Shawn Steele
Cc: es-discuss at mozilla.org<mailto:es-discuss at mozilla.org>; Peter Constable; Derek Murman
Subject: Re: i18n collator options

On #1 (delaying on others):

In ICU, the following are very easy:

  1.  code point order and/or UTF-16 order. Options:

     *   case-sensitivity: off, on
     *   normalization: none, nfc, nfc, nfkc, nfkd

  1.  language-sensitive. Options:

     *   Strength: default, primary (ignore accents, case, compat variants), secondary (ignore case, variants), tertiary (ignore minor variants), identical
     *   Numeric: default, off, on (eg, xyz12 > xyz2)
     *   Case: default, force-upper-first, force-lower-first
     *   Punctuation: default, ignore, don't ignore
     *   Case level: default, on, off (to get "ignore accents but not case", use strength:primary + case-level:on)
     *   Hiragana level: default, on, off (with off, hiragana not distinguished from katakana)
     *   (there are other options, but they are less important)
Under language-sensitive, the default for each option may vary according to the language.

You can try out the functionality at http://goo.gl/GQuI

Compared to Windows (if I read your options correctly), the only real issue is that some of the options are not orthogonal: in particular, you can't have the equivalent of IgnoreDiacritics=true and IgnoreWidth=false. So if someone were to ask for that combination, the best we could supply would be IgnoreDiacritics=true and IgnoreWidth=true.


Mark

— Il meglio è l’inimico del bene —
On Thu, Jan 20, 2011 at 10:56, Shawn Steele <Shawn.Steele at microsoft.com<mailto:Shawn.Steele at microsoft.com>> wrote:
The i18n group said we’d figure out collator options by email.  This is an email ☺

The strawman used the “strength” term for collation options, however there seemed to be a general feeling that descriptive flags would be more useful.  So here’s an attempt at some flags.  These “I” can do fairly easily on Windows, but I don’t know how they’d work in ICU.


•         Ordinal – (code point based non-linguistic comparison.  This sort of defeats the purpose of passing in a locale, however it is a very common scenario for some people.  Eg: I don’t want to compare passwords in a linguistic fashion.)  This should basically be mutually exclusive with any other option.

•         IgnoreCase – Ignore case for case sensitive scripts.  Hopefully don’t ignore anything else, but some frameworks may have trouble with that

•         IgnoreDiacritics – Ignore diacritics/nonspacing characters.

•         IgnoreKanaType – Treat Hiragana and Katakana the same.

•         IgnoreWidth – Treat CJK full and halfwidth characters the same.

•         SortDigitsAsNumbers – Eg: 12 comes before 101.

Are these all “doable” for other frameworks? (ICU?)

Assuming that we go with the “no options == default options for the locale” model, then I’m a bit confused how we set these.  Seems like there are a few possibilities:


A)     { ignoreCase: true } (etc) – the problem is that if you wanted explicit behavior (overriding the defaults), you’d have to specify everything?  Or specifying any one thing would get rid of all the default behavior (so you only had what you explicitly requested), which might make it hard to get some specific behavior along with default behavior?  Maybe a decent way would be:

(Assuming that setting any of the “flag” options causes all of the locale compare flag defaults to then be false)

a.       To change some locale to be case sensitive and ignore width, maintaining the other behavior

var c1 = new Collator( en );

var options = c1.options;

options.ignoreCase = false;

options.ignoreWidth = true;

var c2 = new Collator(options);

b.      To explicitly set case sensitive and ignore width, losing any other default behavior of the locale:

var c1 = new Collator( { localeInfo: en, ignoreCase: false, ignoreWidth: true );

(We’d ignore whatever the defaults were for en, making the ignoreCase:false redundant.  If any other flags (kana, etc) had defaults for the locale, then they’d be lost (false)).



B)      { flags: IgnoreCase | IgnoreWidth } (etc, someone can fix the syntax for me ☺)  -- Specify only the flags that you want to use.  You wouldn’t have to specify everything?

Thoughts?  Suggestions?

- Shawn

 
http://blogs.msdn.com/shawnste



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20110121/3b88b575/attachment-0001.html>


More information about the es-discuss mailing list