Comments on internationalization API

Norbert Lindenberg ecmascript at norbertlindenberg.com
Tue Jul 19 01:29:28 PDT 2011


Hi all,

I'm sorry for not having been able to contribute to the internationalization API earlier. I finally have reviewed the straw man [1], and am pleased to see that it contains a good subset of internationalization functionality to start with. Number and date formatting and collation are issues that most applications have to deal with. Collation especially, but also date formatting with support for multiple time zones and calendars are hard to implement as downloadable libraries.

I have some comments on the details though:

1. In the background section, it might be useful to add that with Node.js server-side JavaScript is seeing a rebound, and applications don't really want to have to call out to a non-JavaScript server in order to handle basic internationalization.

2. In the goals section, I'd qualify the "reuse of objects" goal as a reuse of implementation data structures, or even better replace it with measurable performance goals. Reuse of objects that are visible to applications has security and privacy implications, especially when loading third party code (apps or ads) onto pages [2]. I'd recommend letting applications freely construct Collator, NumberFormat, and DateTimeFormat objects, but have these objects share implementation objects (such as ICU objects) as much as possible. If the API does return shared objects, the security issues need to be dealt with, e.g., by specifying that the shared objects are immutable.

3. I'm very uncomfortable with the LocaleInfo class. It seems to pretend being the central source of all locale-related information, but can't live up to that claim because its design is limited to number and date formatting and collation. Developers will need to create other functionality such as text segmentation, spelling checking, message lookup, shoe size conversion, etc. LocaleInfo appears to perform some magic to derive regions, currencies, and possibly time zones, but doesn't specify it, and makes none of it available to other internationalization classes. It also does duty as a namespace, which looks odd in an EcmaScript standard that otherwise doesn't know namespaces.

Other internationalization libraries have a core that anybody can build on to create internationalization functionality. In Java, for example, the Locale and Currency classes handles a variety of identifier mappings, while the ResourceBundle class handles loading of localized data with fallbacks [3]. In the Yahoo User Interface library, the Intl module does language negotiation and collaborates with the YUI loader in loading localized data [4]. I'd suggest separating similar functionality in LocaleInfo from the formatting and collation functionality and making it available to all. I suspect though that some of the current magic will turn out to be misguided when looked at in the clear light of a specification and will need to be discarded.

4. Language IDs in the library should be those of BCP 47, not of Unicode LDML. The two are similar, but there are subtle differences, as described in the LDML spec: LDML excludes some BCP 47 tags and subtags, adds a separator and the root locale, and changes the semantics of some tags [5]. Since BCP 47 is the dominant standard for language identification, internationalized applications have to support it. If an implementation of the internationalization API is based on LDML, it should handle the mapping from/to BCP 47 itself rather than burdening applications with it.

5. The specification mentions that a few Unicode extensions in BCP 47 (-u-ca-, -u-co-, can be used for specific purposes, but is silent on whether other extension are encouraged/allowed/ignored/illegal. This should be clarified.

6. Region IDs should be those of ISO 3166. The straw man references "LDML region subtags" instead; I haven't been able to find a definition of this term. If "ZZ" is really necessary for the API, then it should be called out directly in the API spec. But what information does "ZZ" convey that EcmaScript's "undefined" doesn't?

7. The priority list matching algorithm is not well specified. It doesn't seem to match the BCP 47 Lookup algorithm however [6], and I'd expect that algorithm to be available at least as a baseline (enhancements might be offered as well).

8. The specifications of NumberFormat and DateTimeFormat list several optional features: Support for scientific notation in NumberFormat; support for various styles and skeletons in DateTimeFormat. How can applications find out which of these optional features are supported by an actual implementation?

9. Currency formatting should require applications to explicitly specify the currency, using an ISO 4217 currency code, when constructing a currency number format. Currencies are really part of the value; they're not a presentation preference. Imagine a European e-commerce site calculating its prices in euro, but then displaying the values with the Korean won symbol just because the user configured his browser to send "Accept-Language: de-DE-u-cu-KRW" or ""Accept-Language: de-KR"... [7].

10. Are the limits described for the NumberFormat parameters defaults or hard limits? It doesn't seem to make sense to impose hard limits such as "max 3 fraction digits, min 0".

11. The description of the DateTimeFormat constructors refers to "LocaleInfo.prototype.numberFormat".

12. DateTimeFormat needs to provide a way for applications to specify the time zone, identified by a tz database identifier [8]. Browser-side code may need this capability to enforce a site-dependent time zone (e.g., a US financial site has to display quotes in New York City time), while server-side code may have to use the user's time zone. While it's possible to encode the time zone as part of a language ID (e.g., "en-AU-u-tz-auldh" to add Australia/Lord_Howe to Australian English), languages and time zones are really orthogonal concepts that should be kept separate, and the tz database identifiers are the most widely used identifiers for time zones.

13. DateTimeFormat also needs to let applications specify whether and how to include a time zone display name in the output. In CLDR, that's typically tied to the time style - long and full have the time zone, while short and medium don't. In reality, applications need to indicate the time zone to users if (and only if) it's not obvious from the context, and that's orthogonal to whether they want seconds.

14. There are a few additional DateTimeFormat skeletons that I think would be commonly used in applications:
- MMMdEEE, MMMMdEEEE: month, day, weekday in either abbreviated or full width; intended for dates in the current year.
- jmm: hour and minute, in 12-hour or 24-hour format as appropriate for the locale.
- jjjmmm: hour and minute, and if necessary am/pm, but with the appropriate characters for hour and minute rather than a colon in languages where that's commonly used, such as Chinese/Japanese/Korean: 오후 11시 5분. Falls back to jmm in other languages.
- z, zzzz: time zone names.
Other notes:
- yyyyMMMMd, "era only if necessary": should explain what that means, e.g., "era only for those calendars that need eras in order to uniquely identify all years after 1900".
- It must be possible to combine skeletons for date, time, and time zone (at most one each).

15. It seems that the correct handling of missing dateStyle or timeStyle parameters would be to omit the date or time from the formatted output.

16. DateTimeFormat.prototype.getAmPm is described as "array of eras". Beyond that typo, is this function really useful, given that many locales don't have am/pm strings, and LDML has deprecated the corresponding element?

17. Error handling needs to be specified in detail. I assume this will be done once the functionality is settled, so I won't go into much detail now. However, contrary to the current statement "invalid language ids or non-string elements should be ignored" (in priority lists), I think the library should throw errors for erroneous input. Language tags should at least be String objects and well-formed according to BCP 47 [9]. Similarly, an exception should be thrown if some value other than a Date object is passed into DateTimeFormat.prototype.format. Note that exceptions in EcmaScript do not oblige the direct caller to use try/catch - they're like unchecked exceptions in Java.

18. I know there has been a proposal for and discussion of MessageFormat functionality - is there a record of why it got removed from the strawman?


References:

[1] http://wiki.ecmascript.org/doku.php?id=strawman:i18n_api, version 2011-07-01.
[2] http://code.google.com/p/google-caja/wiki/GlobalObjectPoisoning
[3] http://download.oracle.com/javase/6/docs/technotes/guides/intl/overview.html#locale
[4] http://developer.yahoo.com/yui/3/intl/
[5] http://unicode.org/reports/tr35/#Unicode_Language_and_Locale_Identifiers
[6] http://tools.ietf.org/html/rfc4647#section-3.4
[7] http://finance.yahoo.com/currency-converter/?amt=1&from=EUR&to=KRW
[8] http://www.twinsun.com/tz/tz-link.htm
[9] http://tools.ietf.org/html/rfc5646#section-2.2.9

Best regards,
Norbert



More information about the es-discuss mailing list