Comments on internationalization API

Mark Davis ☕ mark at macchiato.com
Fri Jul 22 12:58:08 PDT 2011


You make some good points (and many that I agree with), but the main issue
is that we are having to produce a model that all the browser vendors can
sign up to. That necessitates some compromises, including some areas where
we can't have a concrete specification because the implementors want the
freedom to implement the functionality in different ways.

If you want to engage more, there is a F2F next week. Cira can get you
details.

Mark
*— Il meglio è l’inimico del bene —*


On Thu, Jul 21, 2011 at 17:14, Norbert Lindenberg <
ecmascript at norbertlindenberg.com> wrote:

> Hi Mark,
>
> Thanks for your comments! Replies to some of them below. I also noticed
> some additional issues:
>
> 19. DateTimeFormat.prototype.getMonths needs a second parameter {boolean}
> standalone, default value false.
>
> 20. There needs to be a way to determine the actual language, region, and
> options of a Collator, NumberFormat, or DateTimeFormat. E.g., if I request
> ar-MA-u-ca-islamic, did I get exactly what I requested, or
> ar-MA-u-ca-islamicc, ar-MA-u-ca-gregory, ar-u-ca-gregory, or yet something
> else?
>
> Best regards,
> Norbert
>
>
> On Jul 20, 2011, at 9:46 , Mark Davis ☕ wrote:
>
> > I have comments on some of these.
> >
> > Mark
> > — Il meglio è l’inimico del bene —
> >
> >
> > On Tue, Jul 19, 2011 at 01:29, Norbert Lindenberg <
> ecmascript at norbertlindenberg.com> wrote:
> >> Hi all,
> >>
> >> I'm sorry for not having been able to contribute to the
> internationalization API earlier. I finally have reviewed the straw man [1],
> and am pleased to see that it contains a good subset of internationalization
> functionality to start with. Number and date formatting and collation are
> issues that most applications have to deal with. Collation especially, but
> also date formatting with support for multiple time zones and calendars are
> hard to implement as downloadable libraries.
> >>
> >> I have some comments on the details though:
> >>
> >> 1. In the background section, it might be useful to add that with
> Node.js server-side JavaScript is seeing a rebound, and applications don't
> really want to have to call out to a non-JavaScript server in order to
> handle basic internationalization.
> >>
> >> 2. In the goals section, I'd qualify the "reuse of objects" goal as a
> reuse of implementation data structures, or even better replace it with
> measurable performance goals. Reuse of objects that are visible to
> applications has security and privacy implications, especially when loading
> third party code (apps or ads) onto pages [2]. I'd recommend letting
> applications freely construct Collator, NumberFormat, and DateTimeFormat
> objects, but have these objects share implementation objects (such as ICU
> objects) as much as possible. If the API does return shared objects, the
> security issues need to be dealt with, e.g., by specifying that the shared
> objects are immutable.
> >
> > I think it is reasonable to rephrase this as "implementation data
> structures".
> >
> >> 3. I'm very uncomfortable with the LocaleInfo class. It seems to pretend
> being the central source of all locale-related information, but can't live
> up to that claim because its design is limited to number and date formatting
> and collation. Developers will need to create other functionality such as
> text segmentation, spelling checking, message lookup, shoe size conversion,
> etc. LocaleInfo appears to perform some magic to derive regions, currencies,
> and possibly time zones, but doesn't specify it, and makes none of it
> available to other internationalization classes. It also does duty as a
> namespace, which looks odd in an EcmaScript standard that otherwise doesn't
> know namespaces.
> >
> > I don't think it is ideal; I share some of your qualms about it. However,
> it is what we were able to compromise on. Because the LocaleInfo class does
> do the resolution, and that information is available after creation, the
> information is available for other services. And people could (being ES)
> hang services off of their own LocaleInfo class.
>
> So is this the current recommendation?: A library that provides word break
> and line break functionality should be based on a class MyLocaleInfo, which
> provides WordBreak and LineBreak classes whose constructors clients should
> not call, and wordBreak and lineBreak functions that return objects of these
> classes. An application that uses multiple such libraries (providing
> different sets of internationalized functionality) has to create objects of
> all their LocaleInfo classes so that it can request objects of the classes
> that it actually needs.
>
> What value do these LocaleInfo classes add, compared to having constructors
> of the actually needed classes that can be called directly?
>
> Also, the LocaleInfo API, as currently documented, doesn't provide any
> information that a third party internationalization library could use. Some
> comments sound like there should be a property "options", but this property
> and the derivation of its values aren't actually documented.
>
> >> Other internationalization libraries have a core that anybody can build
> on to create internationalization functionality. In Java, for example, the
> Locale and Currency classes handles a variety of identifier mappings, while
> the ResourceBundle class handles loading of localized data with fallbacks
> [3]. In the Yahoo User Interface library, the Intl module does language
> negotiation and collaborates with the YUI loader in loading localized data
> [4]. I'd suggest separating similar functionality in LocaleInfo from the
> formatting and collation functionality and making it available to all. I
> suspect though that some of the current magic will turn out to be misguided
> when looked at in the clear light of a specification and will need to be
> discarded.
> >>
> >> 4. Language IDs in the library should be those of BCP 47, not of Unicode
> LDML. The two are similar, but there are subtle differences, as described in
> the LDML spec: LDML excludes some BCP 47 tags and subtags, adds a separator
> and the root locale, and changes the semantics of some tags [5]. Since BCP
> 47 is the dominant standard for language identification, internationalized
> applications have to support it. If an implementation of the
> internationalization API is based on LDML, it should handle the mapping
> from/to BCP 47 itself rather than burdening applications with it.
> >
> > Every LDML language ID is also a BCP 47 language tag. LDML eliminates
> some of the deadwood in BCP47 (the old irregular forms) but has the same
> expressive power and somewhat more. There are some codes that are not
> defined in BCP47 that turn out to be very important for implementations,
> like the Unknown region.
> >
> > I'm well familiar with both, being an author of each.
>
> I don't like to argue with the author of these specs, but their actual
> content doesn't seem to fully agree with what you say. If I read UTR 35
> section 3 correctly, "de_DE" is a valid LDML language ID; if I read RFC 5646
> section 2.1 correctly, it is not a BCP 47 language tag.
>
> But the real issue is that referencing LDML in this API while all other
> specs that application developers work with reference BCP 47 means that
> application developers have to read both specs, figure out the differences
> between them, decide which of these differences matter to their
> applications, and then implement the necessary compatibility mechanisms. I
> think that's a major and totally unnecessary hurdle to adoption of this API.
> These details are much better dealt with underneath the API.
>
> >> 5. The specification mentions that a few Unicode extensions in BCP 47
> (-u-ca-, -u-co-, can be used for specific purposes, but is silent on whether
> other extension are encouraged/allowed/ignored/illegal. This should be
> clarified.
> >
> > Agreed. What it should add is one line saying that the implementation of
> any other BCP47 extensions are implementation dependent.
>
> No, I think this needs to be decided and documented for each extension
> separately. For applications, there is a big difference between extensions
> that affect the presentation and those that change the meaning of data. For
> presentation, I want the API to make its best effort in selecting the right
> presentation for the user, but its understandable and acceptable that
> different implementations will differ in their results. For extensions that
> affect the meaning of data, on the other hand, the behavior of the library
> must be totally predictable. I think -u-cu- falls into that group, and as
> discussed under item 9 below, this extension should be ignored or illegal.
> Similarly, some collator extensions affect which strings compare as equal,
> and therefore which subset will be selected from a set of strings -
> applications may want to have control over this. And I'd rather see -u-tz-
> ignored and replaced with a separate time zone specification - see item 12
> below.
>
> >> 6. Region IDs should be those of ISO 3166. The straw man references
> "LDML region subtags" instead; I haven't been able to find a definition of
> this term.
> >
> > No. ISO 3166 IDs are notoriously badly managed; they cavalierly reuse
> codes for different countries over time. That is one of the reasons why
> BCP47 had to put in place a registry and mechanism for dealing with the
> instabilities introduced by ISO. The LDML region subtags should be more
> property phrased as "unicode_region_subtag". They are based on BCP47 but add
> (at the time of this writing) 2 codes.
>
> OK, how about saying "Region IDs are country codes following the rules
> established in RFC 5646, section 2.2.4, rules 2 and 4.C. This means that as
> of July 2011, they match ISO 3166-1 alpha-2 country codes, but will use UN
> M.49 numeric codes for new countries or areas instead if and where ISO
> 3166-1 reassigns formerly used codes to such countries or areas."?
>
> >> If "ZZ" is really necessary for the API, then it should be called out
> directly in the API spec. But what information does "ZZ" convey that
> EcmaScript's "undefined" doesn't?
> >
> > You can't write (de-undefined) as a valid language subtag / code.
>
> But in that case I can just write "de", no?
>
> >> 7. The priority list matching algorithm is not well specified. It
> doesn't seem to match the BCP 47 Lookup algorithm however [6], and I'd
> expect that algorithm to be available at least as a baseline (enhancements
> might be offered as well).
> >
> > That algorithm is not particularly good. It could be mentioned as one of
> the possible algorithms, however.
>
> I agree it's not particularly good, but it's relatively easy to understand
> and can be the starting point for better ones. In any case, clear
> specifications are required so that application developers know what they
> can expect.
>
> >> 8. The specifications of NumberFormat and DateTimeFormat list several
> optional features: Support for scientific notation in NumberFormat; support
> for various styles and skeletons in DateTimeFormat. How can applications
> find out which of these optional features are supported by an actual
> implementation?
> >
> > I don't think there is a mechanism currently. It is a 'best effort'.
>
> There should be a well-defined mechanism, so that developers can find out
> where they can rely on the implementation of the API and where they have to
> roll their own implementation.
>
> >> 9. Currency formatting should require applications to explicitly specify
> the currency, using an ISO 4217 currency code, when constructing a currency
> number format. Currencies are really part of the value; they're not a
> presentation preference. Imagine a European e-commerce site calculating its
> prices in euro, but then displaying the values with the Korean won symbol
> just because the user configured his browser to send "Accept-Language:
> de-DE-u-cu-KRW" or ""Accept-Language: de-KR"... [7].
> >
> > No argument there. However, applications also want to be able to access
> the default currency for a given country. We tossed around different ideas
> for doing that, and came up with the current mechanism.
>
> How about:
> /**
>  * Returns the ISO 4217 country code default currency for
>  * the country or territory identified by the given region
>  * ID. Returns undefined if the region ID is not for a
>  * currently existing country or territory, or if the
>  * country or territory does not have a default currency.
>  * Throws an error if no argument is provided, if the first
>  * argument is not a string, or if the string is not a
>  * well-formed region ID.
>  */
> LocaleInfo.prototype.defaultCurrencyFor(regionID)
>
> >> 10. Are the limits described for the NumberFormat parameters defaults or
> hard limits? It doesn't seem to make sense to impose hard limits such as
> "max 3 fraction digits, min 0".
> >
> > That should be clarified. These are defaults, not hard limits.
> >
> >
> >> 11. The description of the DateTimeFormat constructors refers to
> "LocaleInfo.prototype.numberFormat".
> >>
> >> 12. DateTimeFormat needs to provide a way for applications to specify
> the time zone, identified by a tz database identifier [8]. Browser-side code
> may need this capability to enforce a site-dependent time zone (e.g., a US
> financial site has to display quotes in New York City time), while
> server-side code may have to use the user's time zone. While it's possible
> to encode the time zone as part of a language ID (e.g., "en-AU-u-tz-auldh"
> to add Australia/Lord_Howe to Australian English), languages and time zones
> are really orthogonal concepts that should be kept separate, and the tz
> database identifiers are the most widely used identifiers for time zones.
> >
> > I firmly agree. However, the committee was split on how to do this, and
> decided to do that in a follow-up.
>
> This seems like an deficiency that would seriously limit adoption of the
> library, especially server-side. What are the issues that the committee
> couldn't agree on?
>
> >> 13. DateTimeFormat also needs to let applications specify whether and
> how to include a time zone display name in the output. In CLDR, that's
> typically tied to the time style - long and full have the time zone, while
> short and medium don't. In reality, applications need to indicate the time
> zone to users if (and only if) it's not obvious from the context, and that's
> orthogonal to whether they want seconds.
> >
> > Ditto.
> >
> >
> >> 14. There are a few additional DateTimeFormat skeletons that I think
> would be commonly used in applications:
> >> - MMMdEEE, MMMMdEEEE: month, day, weekday in either abbreviated or full
> width; intended for dates in the current year.
> >> - jmm: hour and minute, in 12-hour or 24-hour format as appropriate for
> the locale.
> >> - jjjmmm: hour and minute, and if necessary am/pm, but with the
> appropriate characters for hour and minute rather than a colon in languages
> where that's commonly used, such as Chinese/Japanese/Korean: 오후 11시 5분.
> Falls back to jmm in other languages.
> >> - z, zzzz: time zone names.
> >> Other notes:
> >> - yyyyMMMMd, "era only if necessary": should explain what that means,
> e.g., "era only for those calendars that need eras in order to uniquely
> identify all years after 1900".
> >> - It must be possible to combine skeletons for date, time, and time zone
> (at most one each).
> >
> > Agreed, but we were just able to agree on a core set. Others could be
> supplied, but the result would be a 'best-effort' according to the
> implementation.
>
> If supporting all desirable skeletons is too much, would it be possible to
> make the result of the underlying calendar and time zone calculations
> available so that third parties can implement the formats they (or their
> customers) need? Formatting is actually not that hard; a number of libraries
> and applications have implemented it, but usually they rely on
> Date.prototype.get[FullYear|Month|Date|Hours|Minutes|Seconds] and so are
> tied to the Gregorian calendar and the runtime's default time zone. The
> following function might help:
>
> /**
>  * Returns date and time components for the given Date
>  * object based on this format's time zone and calendar.
>  * @param {Date} date the date to be interpreted
>  * @return {Object} object with the following properties:
>  *    - era: integer, the era in this format's calendar;
>  *      can be used to index into the array returned by getEras
>  *    - needsEra: boolean, whether the calendar used by this
>  *      format has had more than one era since the Gregorian
>  *      1900-01-01 and therefore needs an era indicator to
>  *      disambiguate recent years
>  *    - year: integer, the year within the era
>  *    - month: integer, 0-based, the month within the year;
>  *      can be used to index into the array returned by getMonths
>  *    - date: integer, 1-based, the day within the month
>  *    - weekday: integer, 0-based, the day of the week; can be
>  *      used to index into the array returned by getWeekdays
>  *    - hours: integer, 0-based, the hour within the day
>  *    - minutes: integer, 0-based, the minutes within the hour
>  *    - seconds: integer, 0-based, the seconds within the minute
>  *    - milliseconds: integer, 0-based, the milliseconds within
>  *      the second
>  *    - inDST: boolean, whether the given time is within daylight
>  *      saving time
>  */
> DateTimeFormat.prototype.localTime(date)
>
> >> 15. It seems that the correct handling of missing dateStyle or timeStyle
> parameters would be to omit the date or time from the formatted output.
> >
> > I agree, I think we should fix that.
> >
> >
> >> 16. DateTimeFormat.prototype.getAmPm is described as "array of eras".
> Beyond that typo, is this function really useful, given that many locales
> don't have am/pm strings, and LDML has deprecated the corresponding element?
> >
> > am/pm is still used in LDML; there is just an alternate element that is
> preferred (dayPeriods). However, I think the result should be a map, eg
> > var am = x.getAmPm()["am"]
> >
> >
> >> 17. Error handling needs to be specified in detail. I assume this will
> be done once the functionality is settled, so I won't go into much detail
> now. However, contrary to the current statement "invalid language ids or
> non-string elements should be ignored" (in priority lists), I think the
> library should throw errors for erroneous input. Language tags should at
> least be String objects and well-formed according to BCP 47 [9]. Similarly,
> an exception should be thrown if some value other than a Date object is
> passed into DateTimeFormat.prototype.format. Note that exceptions in
> EcmaScript do not oblige the direct caller to use try/catch - they're like
> unchecked exceptions in Java.
> >
> > The group debated how to handle exceptions; there are pluses and minuses
> to using a 'best-effort' approach vs throwing an exception. The feeling I
> got was that people are generally less in favor of exceptions if there can
> be a graceful recovery.
> >
> >
> >> 18. I know there has been a proposal for and discussion of MessageFormat
> functionality - is there a record of why it got removed from the strawman?
> >
> > Again, there was not agreement, and so we postponed it.
>
> What were the issues?
>
> >> References:
> >>
> >> [1] http://wiki.ecmascript.org/doku.php?id=strawman:i18n_api, version
> 2011-07-01.
> >> [2] http://code.google.com/p/google-caja/wiki/GlobalObjectPoisoning
> >> [3]
> http://download.oracle.com/javase/6/docs/technotes/guides/intl/overview.html#locale
> >> [4] http://developer.yahoo.com/yui/3/intl/
> >> [5]
> http://unicode.org/reports/tr35/#Unicode_Language_and_Locale_Identifiers
> >> [6] http://tools.ietf.org/html/rfc4647#section-3.4
> >> [7] http://finance.yahoo.com/currency-converter/?amt=1&from=EUR&to=KRW
> >> [8] http://www.twinsun.com/tz/tz-link.htm
> >> [9] http://tools.ietf.org/html/rfc5646#section-2.2.9
> >>
> >> Best regards,
> >> Norbert
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20110722/81f86573/attachment-0001.html>


More information about the es-discuss mailing list