New ES6 spec draft

Gillam, Richard gillam at
Tue Jul 10 16:50:37 PDT 2012


A few comments on the i18n/Unicode-related stuff in the latest draft:

- p. 1, §2: It seems a little weird here to be specifying a particular version of the Unicode standard but not of ISO 10646.  Down in section 3, you _do_ nail down the version of 10646 and it's long, so I can see why you don't want all this verbiage in section 2 as well, but maybe you want more than you have?

- p. 14, §6: A few typos: "The phrase 'Unicode character referS to…"  The S is missing.  "Each source character being an abstract Unicode characterS…" The S is unnecessary.

- p. 14 §6: More substantively, do you really need to go into this level of detail as to what a "Unicode character" is?  I would think you could say something like "ECMAScript source text is a sequence of Unicode abstract code point values (or, in this spec, "Unicode characters").  The actual representation of those characters in bits (e.g., UTF-16 or UTF-32 or even a non-Unicode encoding) is implementation-dependent, but a conforming implementation must process source text as if it were an equivalent sequence of SourceCharacter values."  I think that for the purposes of this spec, how "Unicode code point" maps to a normal human's idea of "character" is irrelevant; you can define "character" to mean the same thing as Unicode means when it says "code point" and be done with it.  (This probably means you can ether get rid of the next paragraph, or at least that that paragraph is entirely informative.)

- p. 14, §6, ¶3: "Within other contexts, such an escape sequence contextually contributes one Unicode character."  This read kind of funny to me-- I didn't know what "contextually" meant.  Before, you had something like "Within a string literal, an escape sequence contributes one Unicode character tot he value of the literal", and I suspect "contextually" was intended to mean that the escape sequence contributes one character in whatever way is appropriate for the context.  I wonder if it would be better to say something like "In all other contexts, an escape sequence is treated identically to the Unicode character with the specified Unicode scalar value" or something like that.  (That probably need some wordsmithing, though, since \u000a would be equivalent to \n most of the time, not to a literal line-feed character, as you point out in the next paragraph.)

- p. 14, §6, ¶3: Should there be a pointer here to the actual definition of a Unicode escape sequence?

- p. 14, §6: I suppose it doesn't hurt to explain how to map an abstract Unicode value to UTF-16 here, but couldn't you just point to the definition of UTF-16 in the Unicode standard?

- p. 19, §7.6: I tend to agree with your comment here-- since this was nailed to Unicode 3.0 before, it seems better to stick with that when we're talking about "portability" (although a note explaining why it's not Unicode 5.1 might be helpful).

- p. 19, §7.6: "Return the String value consisting of the sequence of code units…" Do you want to say "UTF-16 code units" here and in other places where this occurs, and maybe define that up front?  Each of the Unicode encoding forms (UTF-8 and UTF-32 as well as UTF-16) has its own definition of "code unit."

- p. 24, §7.8.4: In earlier versions of ECMAScript, I could often specify a supplementary-plane character by using two Unicode escape sequences in a row, each representing a surrogate code unit value.  Can I still do that?  It seems like you'd have to support this for backward compatibility, but you're not really supposed to see bare surrogates in any context except for UTF-16 (I don't think they're strictly illegal, except in UTF-8, but the code point sequence <D800 DC00> isn't equivalent to U+10000, either.  I think you want some verbiage here clarifying how this is supposed to work.

- p. 25, "Early Errors": You say it's a Syntax Error if you specify a \u{} escape sequence with a value greater than 10FFFF.  Should \u{} escape sequences with values corresponding to the surrogate range also produce syntax errors?  If not, is \u{d800}\u{dc00} equivalent to \ud800\udc00 (which I presume is equivalent to \u{10000})?

- p. 210, § I like the idea of introducing fromCodeUnit(), making this function an alias of that one, and marking this function as obsolete.  But I'm also wondering if it would make more sense for this function to be called fromCodeUnits(), since you can specify a whole list of code units, and they all contribute to the string.

- p. 210, § Same thing: Maybe call this fromCodePoints()?  [Note also you have a copy-and-paste problem on the first line, where it still says "String.fromCharCode()".]

- p. 212, § I like the idea of adding a new name for this function, but I'm thinking maybe codeUnitAt().  Or do what Java did (IIRC): Add a new function called char32At(), which behaves like this one, except that if the index you give it points to one half of a surrogate pair, you return the code point value represented by the surrogate pair.  (If you don't do some sort of char32At() function, you're probably going to need a function that takes a sequence of UTF-16 code unit values and returns a sequence of Unicode code point values.)

- p. 212, § Same comments as above.  I think you're either going to need to add a charCode32At() or a function that converts sequences of code units to sequences of code points.  You might also need to add some kind of function to facilitate iterating through a string by code point.

- pp. 212-213, § Should you say "code units" instead of "string elements"?

- p. 220, §§ and Maybe this is a question for Norbert: Are we allowing somewhere for versions of toLocaleUpperCase() and toLocaleLowerCase() that let you specify the locale as a parameter instead of just using the host environment's default locale?

- p. 223, § First, did something go haywire with the numbering here?  Second, this sort of addresses my comment above, but if you can't put this and charCodeAt() (or whatever we called it) together in the spec, can you include a pointer in charCodeAt()'s description to here?  Third, it looks like this only works right with surrogate pairs if you specify the position of the first surrogate in the pair.  I think you want it to work right if you specify the position of either element in the pair.  (I think you may have a typo in step 11 as well: shouldn't that be "…or second > 0xDFFF"?)

- pp. 223-224, § How come you changed "characters" to "elements" in some spots and "code units" in others?  Is there a difference?  (I'm seeing this in some of the number-formatting stuff too.)

Thanks a lot…

--Rich Gillam

On Jul 8, 2012, at 6:22 PM, Allen Wirfs-Brock wrote:

Rev9 (July 8, 2012) of the ES6 Draft Specification is now available at

Changes in this version include:

 Quasi literal added to specification
Initial work at defining tail call semantics (still need to define tail positions in 13.7)
Initial pass at replacing native/host object terminology with ordinary/exotic objects
Clause 6 and others updated to clarify processing of full Unicode source code. Revised usage of “code unit” and “code point”
Specification of Identifiers updated to use current Unicode specification devices
\u{nnnnnn} Unicode code point escapes added
UTF-16 encoding for non-BMP characters in string literals now fully specified
Added functions: String.fromCodePoint, String.raw (a quasi tag function), String.prototype.codePointAt
ECMAScript now requires use of Unicode 5.1.0, normative references updated
A syntactic grammar notation was added for indicating when alternative lexical goals are required
Fixed ES5 missing explicitly setting length in several array functions
Fixed bugs: 368, 388-399, 402-405, 410-413, 415-416,418, 420-428, 430-439, 445-456,458-461 (thanks very much for all the bug reports)

Please report bugs you find at<>.
es-discuss mailing list
es-discuss at<mailto:es-discuss at>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the es-discuss mailing list