Question about allowed characters in identifier names

Mathias Bynens mathias at
Sun Feb 26 01:55:06 PST 2012

For example, U+2F800 CJK COMPATIBILITY IDEOGRAPH-2F800 is a supplementary
Unicode character in the [Lo] category, which leads me to believe it should
be allowed in identifier names. After all, the spec says:

UnicodeLetter = any character in the Unicode categories “Uppercase letter
> (Lu)”, “Lowercase letter (Ll)”, “Titlecase letter (Lt)”, “Modifier letter
> (Lm)”, “Other letter (Lo)”, or “Letter number (Nl)”.

However, since JavaScript uses UCS-2 internally, this symbol is represented
by a surrogate pair, i.e. two code units: `\uD87E\uDC00`.

The spec, however, defines “character” as follows:

Throughout the rest of this document, the phrase “code unit” and the word
> “character” will be used to refer to a 16-bit unsigned value used to
> represent a single 16-bit unit of text. The phrase “Unicode character” will
> be used to refer to the abstract linguistic or typographical unit
> represented by a single Unicode scalar value (which may be longer than 16
> bits and thus may be represented by more than one code unit). The phrase
> “code point” refers to such a Unicode scalar value. “Unicode character”
> only refers to entities represented by single Unicode scalar values: the
> components of a combining character sequence are still individual “Unicode
> characters,” even though a user might think of the whole sequence as a
> single character.

So, based on this definition of “character” (code unit), U+2F800 should not
be allowed in an identifier name after all.

I’m not sure if my interpretation of the spec is correct, though. Could
anyone confirm or deny this? Are supplementary (non-BMP) Unicode characters
allowed in identifiers or not? For example, is this valid JavaScript or not?

    // Using U+2F800 as an identifier
    var \ud87e\udc00 = 42; \ud87e\udc00

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the es-discuss mailing list