Identifying ECMAScript identifiers
ecmascript at lindenbergsoftware.com
Mon Mar 11 18:45:09 PDT 2013
I added these functions to String because that seems the best place for them in the current arrangement. I'm aware of the proposal to modularize the standard library  and can well imagine that these functions will find a better home in that new scheme.
The other character classification scheme I'm looking into is based on Unicode character properties. The reasons why I separated out this proposal are:
- Tools operating on ECMAScript source code need to be aware of the ECMAScript version they use, for syntax, semantics, keywords, and, well, the characters allowed in identifiers. Some tools let their clients specify an ECMAScript version (e.g., "es5" in JSLint and JSHint), others may assume a fixed version. The characters in turn are tied to both Unicode versions and ECMAScript versions - for example, SpiderMonkey currently supports Unicode 6.2 characters, but restricted to the BMP because it hasn't been upgraded to ES6 identifiers yet.
- For Unicode character properties, on the other hand, clients generally need only the properties as of the latest known version, and in the few exceptions that I know of (such as the 2003 version of IDNA) only specific Unicode versions are needed. Requiring that a general API for Unicode character properties provide access to Unicode version-specific information would create a huge burden on implementors, but benefit no-one.
- It's difficult for tools developers to determine the correct set of characters to include as identifier characters. One particular difficulty is that the Unicode general category of a character can change in rare cases, so a character can move into or out of the categories that the ES3/ES5 specifications reference. For compatibility, characters shouldn't move out of the set of characters allowed for identifiers. (It turns out that browsers also get this wrong - all of them). (ES6 solves this problem by basing its identifier definition on Unicode Standard Annex 31, Unicode Identifier and Pattern Syntax, which defines special sets of characters Other_ID_Start and Other_ID_Continue and treats these characters as identifier characters even though their current general categories don't qualify them as such anymore.)
- For general Unicode processing, I think it's important to have support in regular expressions, because that's what many applications use for text processing. For tools operating on ECMAScript source code that seems less important, based on the data I collected .
So, rather than having one grand unified character classification API with support for both Unicode versions and regular expressions I think it's better to provide tailored APIs for different purposes.
On Mar 9, 2013, at 9:16 , Allen Wirfs-Brock wrote:
> Can you explain why you think these should be functions on String rather than part of a more general character classification facility that might be associated with some more specialized object? The latter approach would seem to be to have modularity advantages at both the implementation and usage level.
> On Mar 7, 2013, at 11:35 PM, Norbert Lindenberg wrote:
>> ECMAScript is used to implement a variety of tools that check code for conformance with the ECMAScript specification, minimize it, perform other transformations, or generate ECMAScript code. These tools have to be able to recognize ECMAScript identifiers, taking the identifier specification and the underlying Unicode specification into consideration - not quite easy given the ever-growing Unicode character set.
>> While looking at support for Unicode character properties in general, I realized that this use case is shaped differently from others, fundamental to ECMAScript, and amenable to a fairly simple solution, and so there's now a strawman:
>> I'd like to discuss this at next week's TC 39 meeting, but also invite earlier comments.
>> es-discuss mailing list
>> es-discuss at mozilla.org
More information about the es-discuss