New ES6 spec draft

Allen Wirfs-Brock allen at wirfs-brock.com
Tue Jul 10 20:53:29 PDT 2012


On Jul 10, 2012, at 7:50 PM, Gillam, Richard wrote:

> Allen--
> 
> A few comments on the i18n/Unicode-related stuff in the latest draft:
> 
> - p. 1, §2: It seems a little weird here to be specifying a particular version of the Unicode standard but not of ISO 10646.  Down in section 3, you _do_ nail down the version of 10646 and it's long, so I can see why you don't want all this verbiage in section 2 as well, but maybe you want more than you have?

i'm not sure.  This was Norbert recommendation.  I'm liked want we did in ES5 where we specified version 3.0 as minimum version for which there was guaranteed interoperability but allowed use of more recent versions.  Input on what would make the most sense is appreciated.



> 
> - p. 14, §6: A few typos: "The phrase 'Unicode character referS to…"  The S is missing.  "Each source character being an abstract Unicode characterS…" The S is unnecessary.

ok, it would be good to report these sort of bugs at bugs.ecmascript.org just to make sure they don't get lost here.
> 
> - p. 14 §6: More substantively, do you really need to go into this level of detail as to what a "Unicode character" is?  I would think you could say something like "ECMAScript source text is a sequence of Unicode abstract code point values (or, in this spec, "Unicode characters").  The actual representation of those characters in bits (e.g., UTF-16 or UTF-32 or even a non-Unicode encoding) is implementation-dependent, but a conforming implementation must process source text as if it were an equivalent sequence of SourceCharacter values."  I think that for the purposes of this spec, how "Unicode code point" maps to a normal human's idea of "character" is irrelevant; you can define "character" to mean the same thing as Unicode means when it says "code point" and be done with it.  (This probably means you can ether get rid of the next paragraph, or at least that that paragraph is entirely informative.)

First, I have so say that there will probably be some controversy about this section in TC39.  Norbert proposal was that we specify SourceCharacter as always being UTF-16 encoded, while I went in the direction of essentially defining it as abstract characters  identified by code points. No doubt  there will be additional discussion about this.

There is enough confusion concerning ECMAScript source code (code point vs code unit, UTF-16 or not, etc.) in previous editions, that I wanted to be as clear as possible.  The key point is just that the ECMAScript specification is assigning a meanings to certain Unicode characters/character sequences and that meaning is independent of  file encoding processing that may take place within an implementation. 

> 
> - p. 14, §6, ¶3: "Within other contexts, such an escape sequence contextually contributes one Unicode character."  This read kind of funny to me-- I didn't know what "contextually" meant.  Before, you had something like "Within a string literal, an escape sequence contributes one Unicode character tot he value of the literal", and I suspect "contextually" was intended to mean that the escape sequence contributes one character in whatever way is appropriate for the context.  I wonder if it would be better to say something like "In all other contexts, an escape sequence is treated identically to the Unicode character with the specified Unicode scalar value" or something like that.  (That probably need some wordsmithing, though, since \u000a would be equivalent to \n most of the time, not to a literal line-feed character, as you point out in the next paragraph.)

I generally agree. Some of this section is suffering from too much in place editing, and it would be a good idea to step back and think deeper about what we really need to say rather than just wordsmithing.  

> 
> - p. 14, §6, ¶3: Should there be a pointer here to the actual definition of a Unicode escape sequence?

probably
> 
> - p. 14, §6: I suppose it doesn't hurt to explain how to map an abstract Unicode value to UTF-16 here, but couldn't you just point to the definition of UTF-16 in the Unicode standard?

This is an actual specification algorithm that is "called" from other algorithms within the ES spec.  It needs to be expressed  in a form that interfaces easily with those uses.  


> 
> - p. 19, §7.6: I tend to agree with your comment here-- since this was nailed to Unicode 3.0 before, it seems better to stick with that when we're talking about "portability" (although a note explaining why it's not Unicode 5.1 might be helpful).
> 
> - p. 19, §7.6: "Return the String value consisting of the sequence of code units…" Do you want to say "UTF-16 code units" here and in other places where this occurs, and maybe define that up front?  Each of the Unicode encoding forms (UTF-8 and UTF-32 as well as UTF-16) has its own definition of "code unit."

In theory, we define our use of "code unit" in the last paragraph of §6.  Norbert has also proposed that we add these terms to the "Terms and Definitions" section.
> 
> - p. 24, §7.8.4: In earlier versions of ECMAScript, I could often specify a supplementary-plane character by using two Unicode escape sequences in a row, each representing a surrogate code unit value.  Can I still do that?  It seems like you'd have to support this for backward compatibility, but you're not really supposed to see bare surrogates in any context except for UTF-16 (I don't think they're strictly illegal, except in UTF-8, but the code point sequence <D800 DC00> isn't equivalent to U+10000, either.  I think you want some verbiage here clarifying how this is supposed to work. A \uNNNN escape is a BMP code point so it will always contribute exactly one element to the string value. 

I believe that the algorithmic text of the spec. is clear in this regard.  But informative text could be added.

What actually happens idepends upon contextual details that your 4th item refers to.

In a string literal, each non-escape sequence character contributes one code point to the literal.  String values are made up of 16-bit elements, with non-BMP code points being UTF-16 encoded as two elements.   A \u{ } escape also represents one code point that, depending upon its value, will contribute one or two elements to the string value.  A \uNNNN escape represents a BMP code point so it is always represented as one string element.  If you have existing code that like " \ud800\udc00" you will get the same two element string value that you would get if you wrote "\u{10000}" or \u{d800}\u{dc00}".  They are all alternative ways of expressing the same two element string value.  The first form must be supported for backwards compatibility. 

Outside (I'm actually glossing over a couple of other exceptions) of string literals (and friends, such as quasis) we don't have to worry about UTF-16 encoding because we are dealing with more abstract concepts, such as "identifiers" that we can deal with at the level of Unicode characters.  We also don't have backwards compat. issues because in those contexts current implementations look at surrogate  pairs as two distinct "characters", either of which result in syntax errors in all non-literal context.  EG,  current implementations reject identifiers containing supplementary characters that are (according to Unicode) legal identifier characters. 

> 
> - p. 25, "Early Errors": You say it's a Syntax Error if you specify a \u{} escape sequence with a value greater than 10FFFF.  Should \u{} escape sequences with values corresponding to the surrogate range also produce syntax errors?  If not, is \u{d800}\u{dc00} equivalent to \ud800\udc00 (which I presume is equivalent to \u{10000})?

Exactly, they are all equivalent.  I want people to be able to stop using \uNNNN and just use \u{NNNN} if they need to express a BMP code point.  \uNNNN is there for backwards compat.

> 
> - p. 210, §15.5.3.2: I like the idea of introducing fromCodeUnit(), making this function an alias of that one, and marking this function as obsolete.  But I'm also wondering if it would make more sense for this function to be called fromCodeUnits(), since you can specify a whole list of code units, and they all contribute to the string.

singular form is following the convention established by fromCharCode.  It could change if nobody objects.



> 
> - p. 210, §15.5.3.3: Same thing: Maybe call this fromCodePoints()?  [Note also you have a copy-and-paste problem on the first line, where it still says "String.fromCharCode()".]  
> 
> - p. 212, §15.5.4.4: I like the idea of adding a new name for this function, but I'm thinking maybe codeUnitAt().  Or do what Java did (IIRC): Add a new function called char32At(), which behaves like this one, except that if the index you give it points to one half of a surrogate pair, you return the code point value represented by the surrogate pair.  (If you don't do some sort of char32At() function, you're probably going to need a function that takes a sequence of UTF-16 code unit values and returns a sequence of Unicode code point values.)

I also have thought about unicodeCharAt() or perhaps uCharAt().  
> 
> - p. 212, §15.5.4.5: Same comments as above.  I think you're either going to need to add a charCode32At() or a function that converts sequences of code units to sequences of code points.  You might also need to add some kind of function to facilitate iterating through a string by code point.

The intent is also to provide an iterator for strings that iterates at the Unicode character (one or two UTF-16 string elements) so the need for an individual Unicode char accessor may not be so great.

> 
> - pp. 212-213, §15.5.4.6: Should you say "code units" instead of "string elements"?

The current plan is to talk about "string elements" in all context where there isn't some associated Unicode semantics involved.  Most of these methods happily deal with string elements as simple 16-bit integer values.  




> 
> - p. 220, §§15.5.4.17 and 15.5.4.19: Maybe this is a question for Norbert: Are we allowing somewhere for versions of toLocaleUpperCase() and toLocaleLowerCase() that let you specify the locale as a parameter instead of just using the host environment's default locale?

this is covered by the I18N API spec. Right?



> 
> - p. 223, §15.5.4.5: First, did something go haywire with the numbering here?  Second, this sort of addresses my comment above, but if you can't put this and charCodeAt() (or whatever we called it) together in the spec, can you include a pointer in charCodeAt()'s description to here?  Third, it looks like this only works right with surrogate pairs if you specify the position of the first surrogate in the pair.  I think you want it to work right if you specify the position of either element in the pair.  (I think you may have a typo in step 11 as well: shouldn't that be "…or second > 0xDFFF"?)

It's supposed to be 15.5.4.25. also yes about the step 11 typo.  Norbert proposed this function so we should get his thoughts on the addressing issue.  As I wrote this I did think a bit about whether or not we need to provide some support for  backward iteration over strings.
> 
> - pp. 223-224, §15.5.5.2: How come you changed "characters" to "elements" in some spots and "code units" in others?  Is there a difference?  (I'm seeing this in some of the number-formatting stuff too.)

faulty multi-step editing.  they should all be "element"
> 
> Thanks a lot…




> 
> --Rich Gillam
>   Lab126
> 
> On Jul 8, 2012, at 6:22 PM, Allen Wirfs-Brock wrote:
> 
>> Rev9 (July 8, 2012) of the ES6 Draft Specification is now available at http://wiki.ecmascript.org/doku.php?id=harmony:specification_drafts 
>> 
>> Changes in this version include:
>>  Quasi literal added to specification
>> Initial work at defining tail call semantics (still need to define tail positions in 13.7)
>> Initial pass at replacing native/host object terminology with ordinary/exotic objects
>> Clause 6 and others updated to clarify processing of full Unicode source code. Revised usage of “code unit” and “code point”
>> Specification of Identifiers updated to use current Unicode specification devices
>> \u{nnnnnn} Unicode code point escapes added
>> UTF-16 encoding for non-BMP characters in string literals now fully specified
>> Added functions: String.fromCodePoint, String.raw (a quasi tag function), String.prototype.codePointAt
>> ECMAScript now requires use of Unicode 5.1.0, normative references updated
>> A syntactic grammar notation was added for indicating when alternative lexical goals are required
>> Fixed ES5 missing explicitly setting length in several array functions
>> Fixed bugs: 368, 388-399, 402-405, 410-413, 415-416,418, 420-428, 430-439, 445-456,458-461 (thanks very much for all the bug reports)
>> Please report bugs you find at bugs.ecmascript.org.
>> _______________________________________________
>> es-discuss mailing list
>> es-discuss at mozilla.org
>> https://mail.mozilla.org/listinfo/es-discuss
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20120710/efba1b17/attachment-0001.html>


More information about the es-discuss mailing list