Question about allowed characters in identifier names

Mathias Bynens mathias at
Sat Aug 24 02:02:54 PDT 2013

On 27 Feb 2012, at 22:58, Allen Wirfs-Brock <allen at> wrote:

> On Feb 26, 2012, at 1:55 AM, Mathias Bynens wrote:
>> For example, U+2F800 CJK COMPATIBILITY IDEOGRAPH-2F800 is a supplementary Unicode character in the [Lo] category, which leads me to believe it should be allowed in identifier names. After all, the spec says:
>> UnicodeLetter = any character in the Unicode categories “Uppercase letter (Lu)”, “Lowercase letter (Ll)”, “Titlecase letter (Lt)”, “Modifier letter (Lm)”, “Other letter (Lo)”, or “Letter number (Nl)”.
>> However, since JavaScript uses UCS-2 internally, this symbol is represented by a surrogate pair, i.e. two code units: `\uD87E\uDC00`.
>> The spec, however, defines “character” as follows:
>> Throughout the rest of this document, the phrase “code unit” and the word “character” will be used to refer to a 16-bit unsigned value used to represent a single 16-bit unit of text. The phrase “Unicode character” will be used to refer to the abstract linguistic or typographical unit represented by a single Unicode scalar value (which may be longer than 16 bits and thus may be represented by more than one code unit). The phrase “code point” refers to such a Unicode scalar value. “Unicode character” only refers to entities represented by single Unicode scalar values: the components of a combining character sequence are still individual “Unicode characters,” even though a user might think of the whole sequence as a single character.
>> So, based on this definition of “character” (code unit), U+2F800 should not be allowed in an identifier name after all.
>> I’m not sure if my interpretation of the spec is correct, though. Could anyone confirm or deny this? Are supplementary (non-BMP) Unicode characters allowed in identifiers or not? For example, is this valid JavaScript or not?
> Yes, this interpretation is consistent with my understanding of the requirements as expressed in the ES5 spec.   ES5 logically only works with UCS-2 characters corresponding to the BMP.
> Some (probably most) implementations pass UTF-16 encodings of supplemental characters to the JavaScript compiler.  According to the spec, these are processed as two UCS-2 characters neither of which would be a member of any of the above character categories.  Their use in an identifier context should result in a syntax error.  Within a string literal, the two UCS-2 characters would generate two string elements.
> This is something that I think can be clarified for the ES6 specification, independent of the on-going discussion of the possibility of 21-bit string elements.  My preference for the future is to simply define the input alphabet of ECMAScript as all Unicode characters independent of actual encoding.

That sounds nice.

> var \ud87e\udc00 would probably still be illegal because each \uXXXX define a separate character but: var \u{2f800} =42; schould be find as should the direct none escaped occurrence of that characters.

Wouldn’t this be confusing, though?

    global['\u{2F800}'] = 42; // would work (compatible with ES5 behavior)
    global['\uD87E\uDC00'] = 42; // would work, too, since `'\uD87E\uDC00' == '\u{2F800}'` (compatible with ES5 behavior)
    var \uD87E\uDC00 = 42; // would fail (compatible with ES5 behavior)
    var \u{2F800} = 42; // would work (as per your comment; incompatible with ES5 behavior)
    var 丽 = 42; // would work (as per your comment; incompatible with ES5 behavior)

Using astral symbols in identifiers would be backwards incompatible, even if the raw (unescaped) symbol is used. There’d be no way to use such an identifier in an ES5 environment. Is this a problem?

More information about the es-discuss mailing list