Last call for consensus on format-control char. issues

Brendan Eich brendan at mozilla.org
Wed Jun 17 10:30:17 PDT 2009


We've dealt with users wishing for ZWNJ in string literals too. See  
resolved bug https://bugzilla.mozilla.org/show_bug.cgi?id=274152 -- I  
do not see why we should regress this for such users.

/be

On Jun 17, 2009, at 10:11 AM, Allen Wirfs-Brock wrote:

> Waldemar's notes from the May F2F don't record any decision on the  
> issue of <ZWNJ> and <ZWJ> in identifiers.  However, my personal  
> notes say that I need to "keep in identifiers and fix grammar" which  
> is also my recollection of what we decided at the meeting.
>
> The simplest implementation of  that decisions is to simply add  
> <ZWNJ> and <ZWJ> as alternatives for IdentifierPart. In addition,  
> the text in section 7.1 that says that format control characters can  
> occur in identifier presumably needs to be narrowed to say only  
> <ZWNJ> and <ZWJ>.
>
> At about the same time as the F2F David-Sarah made a more  
> comprehensive proposal (duplicated below) that in addition to  
> addressing <ZWNJ> and <ZWJ> also significantly refines the rules for  
> <BOM> including excluding them from strings literals and regular  
> expressions and making it a syntax error for a <BOM> to appear  
> within an identifier.
>
> I'm not a Unicode expert, but my sense is that David-Sarah's  
> proposal is sound and probably consistent with the original goals of  
> cleaning up class Cf in the specification. However, his rules for  
> <BOM> also seem like they could significantly complicate the lexical  
> analysis phase of implementations.
>
> My sense from the F2F is that the consensus was more in the  
> direction of my simple solution above (<ZWNJ> and <ZWJ> in  
> identifiers, <BOM> is whitespace) rather than David-Sarah's more  
> comprehensive treatment of <BOM>.
>
> I need to have a final decision on this so I can update the draft  
> accordingly. Based upon my recollection of the F2F I'm going to go  
> with the "simple solution" unless there is apparent consensus  
> otherwise.
>
> Final thoughts?
>
> Allen
>
>
>> -----Original Message-----
>> From: es5-discuss-bounces at mozilla.org [mailto:es5-discuss-
>> bounces at mozilla.org] On Behalf Of David-Sarah Hopwood
>> Sent: Thursday, May 28, 2009 5:44 PM
>> To: es5-discuss at mozilla.org
>> Subject: Grammar for IdentifierName does not allow <ZWNJ> and <ZWJ>
>>
>> John Cowan wrote:
>>> David-Sarah Hopwood scripsit:
>>>
>>>> The omission of format-control characters from <IdentifierName>
>> appears
>>>> to be just an oversight.
>>>
>>> -1
>>
>> Indeed, I had forgotten that we had already discussed this and come  
>> to
>> a different conclusion:
>>
>> <https://mail.mozilla.org/pipermail/es5-discuss/2009-April/002432.html 
>> >
>> <https://mail.mozilla.org/pipermail/es5-discuss/2009-April/002435.html 
>> >.
>>
>>> Allowing all of them causes the same kinds of problems as allowing
>>> BOM.  Most of them have little visible effect on the surrounding  
>>> text
>>> (especially Latin-script text) even in fully conformant Unicode
>> renderers,
>>> never mind renderers that muffle them.  The result is that "foobar"
>> and
>>> "foo<Cf>bar" look the same but aren't.
>>>
>>> Per Unicode 5.1, the only ones that actually affect the natural-
>> language
>>> meaning of identifiers are U+200C ZWNJ and U+200D ZWJ.  These are  
>>> the
>> only
>>> ones which should even be considered in ES5 identifiers.  UAX #31
>> (which
>>> is included by reference in Unicode 5.1) specifies narrower  
>>> conditions
>>> in which ZWNJ and ZWJ are essential; sticking to the conditions is
>>> non-trivial, but minimizes the chance of spoofing.
>>>
>>> Given the risks, I'm uncertain whether ZWNJ and ZWJ should be  
>>> allowed
>>> or not.
>>
>> Forget trying to minimize identifier spoofing as a security risk.  
>> That's
>> not possible, if Unicode identifiers are to be allowed at all. It  
>> is an
>> inherent characteristic of Unicode that many distinct (even when
>> normalized)
>> strings will look the same. It is not at all clear that this is a
>> genuine
>> security risk for general programming -- as opposed to situations  
>> that
>> require adversarial code review, which full ECMAScript is a long way
>> from being able to support.
>>
>> What is useful to attempt to minimize is the chance of accidentally
>> typing identifiers that are distinct but look the same, or of  
>> seeing an
>> identifier and being unable to reliably reproduce it. This is a
>> usability
>> issue, not a security issue.
>>
>> For usability, it may indeed be a good approach to allow <ZWNJ> and
>> <ZWJ>
>> but disallow other format-control characters. I am not sufficiently
>> familiar with the scripts that require these characters to be sure of
>> that, but it seems reasonable based on their descriptions in the  
>> Unicode
>> standard.
>>
>> However, the complicated script-dependent rules described in UAX  
>> #31 for
>> restricting the contexts in which <ZWNJ> and <ZWJ> can occur, seem  
>> quite
>> over-the-top given the impossibility of preventing spoofing. Again,  
>> see
>> <https://mail.mozilla.org/pipermail/es5-discuss/2009-April/002435.html 
>> >.
>>
>> Combining the proposal from that post with the changes for <NEL>,
>> <ZWSP> and <BOM> (since both affect section 7.1), we end up with  
>> this.
>>
>> ====
>> Changes to section 7.2:
>> - revert the addition of <NEL>, <ZWSP>, and <BOM> to WhiteSpace and
>>  to the table.
>>
>>
>> Changes to section 7.8.4:
>>
>>  DoubleStringCharacter ::
>>    SourceCharacter but not double-quote " or backslash \ or
>> LineTerminator
>> or <BOM>
>>    \ EscapeSequence
>>    LineContinuation
>>
>>  SingleStringCharacter ::
>>    SourceCharacter but not single-quote ' or backslash \ or
>> LineTerminator
>> or <BOM>
>>    \ EscapeSequence
>>    LineContinuation
>>
>>  NonEscapeCharacter ::
>>    SourceCharacter but not EscapeCharacter or LineTerminator or <BOM>
>>
>>  * The CV of DoubleStringCharacter :: SourceCharacter but not
>>    double-quote " or backslash \ or LineTerminator or <BOM>
>>    is the SourceCharacter character itself
>>
>>  * The CV of SingleStringCharacter :: SourceCharacter but not
>>    single-quote ' or backslash \ or LineTerminator or <BOM>
>>    is the SourceCharacter character itself.
>>
>>  * The CV of NonEscapeCharacter :: SourceCharacter but not
>>    EscapeCharacter or LineTerminator or <BOM> is the
>>    SourceCharacter character itself.
>>
>>
>> Replace section 7.1:
>>
>> 7.1 Unicode Format-Control Characters
>>
>> The Unicode format-control characters (i.e., the characters in
>> General Category "Cf" in the Unicode Character Database such as
>> LEFT-TO-RIGHT MARK or RIGHT-TO-LEFT MARK) are control codes used to
>> control the formatting of a range of text in the absence of
>> higher-level protocols for this, such as mark-up languages.
>>
>> <BOM> is a format-control character used primarily at the start of
>> a text to mark it as Unicode and to allow detection of the text's
>> encoding and byte order. <BOM> characters intended for this purpose
>> can sometimes also appear after the start of a text, for example as
>> a result of concatenating files.
>>
>> In ECMAScript source, <BOM> characters are ignored if they appear
>> immediately before or after a token, or within a span of consecutive
>> WhiteSpace characters (7.2). The lexical grammar does not explicitly
>> include such ignored <BOM> characters. It is a syntax error for a
>> <BOM> character to appear within a token (that is, if removing the
>> <BOM> would result in the preceding and following characters being
>> part of the same token).
>>
>> Note that comments are not tokens, and so the above rule allows
>> <BOM> characters to appear within comments. It does not allow them
>> to appear within string literals or regular expression literals (the
>> escape sequence \uFEFF should be used instead).
>>
>> It is useful to allow other format-control characters in source text
>> to facilitate editing and display. Format-control characters other
>> than <BOM> may be used within comments, string literals, and
>> regular expression literals. Two specific format-control characters,
>> <ZWNJ> and <ZWJ>, may also be used in an identifier after the first
>> character.
>>
>>
>>  Code Unit Value    Name                                Formal name
>>  ------------------------------------------------------------------
>>  \u200C             Zero width non-joiner               <ZWNJ>
>>  \u200D             Zero width joiner                   <ZWJ>
>>  \uFEFF             Byte order mark (also called
>>                       zero-width non-breaking space)    <BOM>
>>
>>
>> Changes to section 7.6:
>>
>> [...] This standard specifies specific character additions: The
>> dollar sign ($) and the underscore (_) are permitted anywhere in
>> an identifier. <ZWNJ> and <ZWJ> are permitted after the first
>> character.
>>
>>
>> Changes to section 7.8.5:
>>
>> RegularExpressionNonTerminator ::
>>   SourceCharacter but not LineTerminator or <BOM>
>>
>>
>> Changes to Annex A:
>> - update all productions changed above.
>>
>>
>> Changes to Annex E:
>> - add to the entry for section 7.1:
>>    <BOM> characters are ignored between tokens and in comments,
>>    but are not allowed within tokens (including string and
>>    regular expression literals). <ZWNJ> and <ZWJ> are significant
>>    within identifiers rather than being stripped.
>>
>> - delete the entries for sections 7.2 and 15.10.2.12.
>>
>>  (Reverting the additions of <NEL>, <ZWSP>, and <BOM> to the
>>  WhiteSpace production also reverts this for the \s character
>>  class, without any explicit change to section 15.10.2.12.)
>>
>> --
>> David-Sarah Hopwood  ⚥  http://davidsarah.livejournal.com
>>
>> _______________________________________________
>> es5-discuss mailing list
>> es5-discuss at mozilla.org
>> https://mail.mozilla.org/listinfo/es5-discuss
> _______________________________________________
> es5-discuss mailing list
> es5-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es5-discuss



More information about the es5-discuss mailing list