Comments on internationalization API

Norbert Lindenberg ecmascript at norbertlindenberg.com
Thu Jul 21 17:14:29 PDT 2011


Hi Mark,

Thanks for your comments! Replies to some of them below. I also noticed some additional issues:

19. DateTimeFormat.prototype.getMonths needs a second parameter {boolean} standalone, default value false.

20. There needs to be a way to determine the actual language, region, and options of a Collator, NumberFormat, or DateTimeFormat. E.g., if I request ar-MA-u-ca-islamic, did I get exactly what I requested, or ar-MA-u-ca-islamicc, ar-MA-u-ca-gregory, ar-u-ca-gregory, or yet something else?

Best regards,
Norbert


On Jul 20, 2011, at 9:46 , Mark Davis ☕ wrote:

> I have comments on some of these.
> 
> Mark
> — Il meglio è l’inimico del bene —
> 
> 
> On Tue, Jul 19, 2011 at 01:29, Norbert Lindenberg <ecmascript at norbertlindenberg.com> wrote:
>> Hi all,
>> 
>> I'm sorry for not having been able to contribute to the internationalization API earlier. I finally have reviewed the straw man [1], and am pleased to see that it contains a good subset of internationalization functionality to start with. Number and date formatting and collation are issues that most applications have to deal with. Collation especially, but also date formatting with support for multiple time zones and calendars are hard to implement as downloadable libraries.
>> 
>> I have some comments on the details though:
>> 
>> 1. In the background section, it might be useful to add that with Node.js server-side JavaScript is seeing a rebound, and applications don't really want to have to call out to a non-JavaScript server in order to handle basic internationalization.
>> 
>> 2. In the goals section, I'd qualify the "reuse of objects" goal as a reuse of implementation data structures, or even better replace it with measurable performance goals. Reuse of objects that are visible to applications has security and privacy implications, especially when loading third party code (apps or ads) onto pages [2]. I'd recommend letting applications freely construct Collator, NumberFormat, and DateTimeFormat objects, but have these objects share implementation objects (such as ICU objects) as much as possible. If the API does return shared objects, the security issues need to be dealt with, e.g., by specifying that the shared objects are immutable.
> 
> I think it is reasonable to rephrase this as "implementation data structures".
> 
>> 3. I'm very uncomfortable with the LocaleInfo class. It seems to pretend being the central source of all locale-related information, but can't live up to that claim because its design is limited to number and date formatting and collation. Developers will need to create other functionality such as text segmentation, spelling checking, message lookup, shoe size conversion, etc. LocaleInfo appears to perform some magic to derive regions, currencies, and possibly time zones, but doesn't specify it, and makes none of it available to other internationalization classes. It also does duty as a namespace, which looks odd in an EcmaScript standard that otherwise doesn't know namespaces.
> 
> I don't think it is ideal; I share some of your qualms about it. However, it is what we were able to compromise on. Because the LocaleInfo class does do the resolution, and that information is available after creation, the information is available for other services. And people could (being ES) hang services off of their own LocaleInfo class.

So is this the current recommendation?: A library that provides word break and line break functionality should be based on a class MyLocaleInfo, which provides WordBreak and LineBreak classes whose constructors clients should not call, and wordBreak and lineBreak functions that return objects of these classes. An application that uses multiple such libraries (providing different sets of internationalized functionality) has to create objects of all their LocaleInfo classes so that it can request objects of the classes that it actually needs.

What value do these LocaleInfo classes add, compared to having constructors of the actually needed classes that can be called directly?

Also, the LocaleInfo API, as currently documented, doesn't provide any information that a third party internationalization library could use. Some comments sound like there should be a property "options", but this property and the derivation of its values aren't actually documented.

>> Other internationalization libraries have a core that anybody can build on to create internationalization functionality. In Java, for example, the Locale and Currency classes handles a variety of identifier mappings, while the ResourceBundle class handles loading of localized data with fallbacks [3]. In the Yahoo User Interface library, the Intl module does language negotiation and collaborates with the YUI loader in loading localized data [4]. I'd suggest separating similar functionality in LocaleInfo from the formatting and collation functionality and making it available to all. I suspect though that some of the current magic will turn out to be misguided when looked at in the clear light of a specification and will need to be discarded.
>> 
>> 4. Language IDs in the library should be those of BCP 47, not of Unicode LDML. The two are similar, but there are subtle differences, as described in the LDML spec: LDML excludes some BCP 47 tags and subtags, adds a separator and the root locale, and changes the semantics of some tags [5]. Since BCP 47 is the dominant standard for language identification, internationalized applications have to support it. If an implementation of the internationalization API is based on LDML, it should handle the mapping from/to BCP 47 itself rather than burdening applications with it.
> 
> Every LDML language ID is also a BCP 47 language tag. LDML eliminates some of the deadwood in BCP47 (the old irregular forms) but has the same expressive power and somewhat more. There are some codes that are not defined in BCP47 that turn out to be very important for implementations, like the Unknown region.
> 
> I'm well familiar with both, being an author of each.

I don't like to argue with the author of these specs, but their actual content doesn't seem to fully agree with what you say. If I read UTR 35 section 3 correctly, "de_DE" is a valid LDML language ID; if I read RFC 5646 section 2.1 correctly, it is not a BCP 47 language tag.

But the real issue is that referencing LDML in this API while all other specs that application developers work with reference BCP 47 means that application developers have to read both specs, figure out the differences between them, decide which of these differences matter to their applications, and then implement the necessary compatibility mechanisms. I think that's a major and totally unnecessary hurdle to adoption of this API. These details are much better dealt with underneath the API.

>> 5. The specification mentions that a few Unicode extensions in BCP 47 (-u-ca-, -u-co-, can be used for specific purposes, but is silent on whether other extension are encouraged/allowed/ignored/illegal. This should be clarified.
> 
> Agreed. What it should add is one line saying that the implementation of any other BCP47 extensions are implementation dependent.

No, I think this needs to be decided and documented for each extension separately. For applications, there is a big difference between extensions that affect the presentation and those that change the meaning of data. For presentation, I want the API to make its best effort in selecting the right presentation for the user, but its understandable and acceptable that different implementations will differ in their results. For extensions that affect the meaning of data, on the other hand, the behavior of the library must be totally predictable. I think -u-cu- falls into that group, and as discussed under item 9 below, this extension should be ignored or illegal. Similarly, some collator extensions affect which strings compare as equal, and therefore which subset will be selected from a set of strings - applications may want to have control over this. And I'd rather see -u-tz- ignored and replaced with a separate time zone specification - see item 12 below.

>> 6. Region IDs should be those of ISO 3166. The straw man references "LDML region subtags" instead; I haven't been able to find a definition of this term.
> 
> No. ISO 3166 IDs are notoriously badly managed; they cavalierly reuse codes for different countries over time. That is one of the reasons why BCP47 had to put in place a registry and mechanism for dealing with the instabilities introduced by ISO. The LDML region subtags should be more property phrased as "unicode_region_subtag". They are based on BCP47 but add (at the time of this writing) 2 codes.

OK, how about saying "Region IDs are country codes following the rules established in RFC 5646, section 2.2.4, rules 2 and 4.C. This means that as of July 2011, they match ISO 3166-1 alpha-2 country codes, but will use UN M.49 numeric codes for new countries or areas instead if and where ISO 3166-1 reassigns formerly used codes to such countries or areas."?

>> If "ZZ" is really necessary for the API, then it should be called out directly in the API spec. But what information does "ZZ" convey that EcmaScript's "undefined" doesn't?
> 
> You can't write (de-undefined) as a valid language subtag / code.

But in that case I can just write "de", no?

>> 7. The priority list matching algorithm is not well specified. It doesn't seem to match the BCP 47 Lookup algorithm however [6], and I'd expect that algorithm to be available at least as a baseline (enhancements might be offered as well).
> 
> That algorithm is not particularly good. It could be mentioned as one of the possible algorithms, however.

I agree it's not particularly good, but it's relatively easy to understand and can be the starting point for better ones. In any case, clear specifications are required so that application developers know what they can expect.

>> 8. The specifications of NumberFormat and DateTimeFormat list several optional features: Support for scientific notation in NumberFormat; support for various styles and skeletons in DateTimeFormat. How can applications find out which of these optional features are supported by an actual implementation?
> 
> I don't think there is a mechanism currently. It is a 'best effort'.

There should be a well-defined mechanism, so that developers can find out where they can rely on the implementation of the API and where they have to roll their own implementation.

>> 9. Currency formatting should require applications to explicitly specify the currency, using an ISO 4217 currency code, when constructing a currency number format. Currencies are really part of the value; they're not a presentation preference. Imagine a European e-commerce site calculating its prices in euro, but then displaying the values with the Korean won symbol just because the user configured his browser to send "Accept-Language: de-DE-u-cu-KRW" or ""Accept-Language: de-KR"... [7].
> 
> No argument there. However, applications also want to be able to access the default currency for a given country. We tossed around different ideas for doing that, and came up with the current mechanism.

How about:
/**
 * Returns the ISO 4217 country code default currency for
 * the country or territory identified by the given region
 * ID. Returns undefined if the region ID is not for a
 * currently existing country or territory, or if the
 * country or territory does not have a default currency.
 * Throws an error if no argument is provided, if the first
 * argument is not a string, or if the string is not a
 * well-formed region ID.
 */
LocaleInfo.prototype.defaultCurrencyFor(regionID)

>> 10. Are the limits described for the NumberFormat parameters defaults or hard limits? It doesn't seem to make sense to impose hard limits such as "max 3 fraction digits, min 0".
> 
> That should be clarified. These are defaults, not hard limits.
>  
> 
>> 11. The description of the DateTimeFormat constructors refers to "LocaleInfo.prototype.numberFormat".
>> 
>> 12. DateTimeFormat needs to provide a way for applications to specify the time zone, identified by a tz database identifier [8]. Browser-side code may need this capability to enforce a site-dependent time zone (e.g., a US financial site has to display quotes in New York City time), while server-side code may have to use the user's time zone. While it's possible to encode the time zone as part of a language ID (e.g., "en-AU-u-tz-auldh" to add Australia/Lord_Howe to Australian English), languages and time zones are really orthogonal concepts that should be kept separate, and the tz database identifiers are the most widely used identifiers for time zones.
> 
> I firmly agree. However, the committee was split on how to do this, and decided to do that in a follow-up.

This seems like an deficiency that would seriously limit adoption of the library, especially server-side. What are the issues that the committee couldn't agree on?

>> 13. DateTimeFormat also needs to let applications specify whether and how to include a time zone display name in the output. In CLDR, that's typically tied to the time style - long and full have the time zone, while short and medium don't. In reality, applications need to indicate the time zone to users if (and only if) it's not obvious from the context, and that's orthogonal to whether they want seconds.
> 
> Ditto.
>  
> 
>> 14. There are a few additional DateTimeFormat skeletons that I think would be commonly used in applications:
>> - MMMdEEE, MMMMdEEEE: month, day, weekday in either abbreviated or full width; intended for dates in the current year.
>> - jmm: hour and minute, in 12-hour or 24-hour format as appropriate for the locale.
>> - jjjmmm: hour and minute, and if necessary am/pm, but with the appropriate characters for hour and minute rather than a colon in languages where that's commonly used, such as Chinese/Japanese/Korean: 오후 11시 5분. Falls back to jmm in other languages.
>> - z, zzzz: time zone names.
>> Other notes:
>> - yyyyMMMMd, "era only if necessary": should explain what that means, e.g., "era only for those calendars that need eras in order to uniquely identify all years after 1900".
>> - It must be possible to combine skeletons for date, time, and time zone (at most one each).
> 
> Agreed, but we were just able to agree on a core set. Others could be supplied, but the result would be a 'best-effort' according to the implementation.

If supporting all desirable skeletons is too much, would it be possible to make the result of the underlying calendar and time zone calculations available so that third parties can implement the formats they (or their customers) need? Formatting is actually not that hard; a number of libraries and applications have implemented it, but usually they rely on Date.prototype.get[FullYear|Month|Date|Hours|Minutes|Seconds] and so are tied to the Gregorian calendar and the runtime's default time zone. The following function might help:

/**
 * Returns date and time components for the given Date
 * object based on this format's time zone and calendar.
 * @param {Date} date the date to be interpreted
 * @return {Object} object with the following properties:
 *    - era: integer, the era in this format's calendar;
 *      can be used to index into the array returned by getEras
 *    - needsEra: boolean, whether the calendar used by this
 *      format has had more than one era since the Gregorian
 *      1900-01-01 and therefore needs an era indicator to
 *      disambiguate recent years
 *    - year: integer, the year within the era
 *    - month: integer, 0-based, the month within the year;
 *      can be used to index into the array returned by getMonths
 *    - date: integer, 1-based, the day within the month
 *    - weekday: integer, 0-based, the day of the week; can be
 *      used to index into the array returned by getWeekdays
 *    - hours: integer, 0-based, the hour within the day
 *    - minutes: integer, 0-based, the minutes within the hour
 *    - seconds: integer, 0-based, the seconds within the minute
 *    - milliseconds: integer, 0-based, the milliseconds within
 *      the second
 *    - inDST: boolean, whether the given time is within daylight
 *      saving time
 */
DateTimeFormat.prototype.localTime(date)

>> 15. It seems that the correct handling of missing dateStyle or timeStyle parameters would be to omit the date or time from the formatted output.
> 
> I agree, I think we should fix that.
>  
> 
>> 16. DateTimeFormat.prototype.getAmPm is described as "array of eras". Beyond that typo, is this function really useful, given that many locales don't have am/pm strings, and LDML has deprecated the corresponding element?
> 
> am/pm is still used in LDML; there is just an alternate element that is preferred (dayPeriods). However, I think the result should be a map, eg
> var am = x.getAmPm()["am"]
> 
> 
>> 17. Error handling needs to be specified in detail. I assume this will be done once the functionality is settled, so I won't go into much detail now. However, contrary to the current statement "invalid language ids or non-string elements should be ignored" (in priority lists), I think the library should throw errors for erroneous input. Language tags should at least be String objects and well-formed according to BCP 47 [9]. Similarly, an exception should be thrown if some value other than a Date object is passed into DateTimeFormat.prototype.format. Note that exceptions in EcmaScript do not oblige the direct caller to use try/catch - they're like unchecked exceptions in Java.
> 
> The group debated how to handle exceptions; there are pluses and minuses to using a 'best-effort' approach vs throwing an exception. The feeling I got was that people are generally less in favor of exceptions if there can be a graceful recovery.
>  
> 
>> 18. I know there has been a proposal for and discussion of MessageFormat functionality - is there a record of why it got removed from the strawman?
> 
> Again, there was not agreement, and so we postponed it.

What were the issues?

>> References:
>> 
>> [1] http://wiki.ecmascript.org/doku.php?id=strawman:i18n_api, version 2011-07-01.
>> [2] http://code.google.com/p/google-caja/wiki/GlobalObjectPoisoning
>> [3] http://download.oracle.com/javase/6/docs/technotes/guides/intl/overview.html#locale
>> [4] http://developer.yahoo.com/yui/3/intl/
>> [5] http://unicode.org/reports/tr35/#Unicode_Language_and_Locale_Identifiers
>> [6] http://tools.ietf.org/html/rfc4647#section-3.4
>> [7] http://finance.yahoo.com/currency-converter/?amt=1&from=EUR&to=KRW
>> [8] http://www.twinsun.com/tz/tz-link.htm
>> [9] http://tools.ietf.org/html/rfc5646#section-2.2.9
>> 
>> Best regards,
>> Norbert



More information about the es-discuss mailing list