Last call for consensus on format-control char. issues

Allen Wirfs-Brock Allen.Wirfs-Brock at microsoft.com
Wed Jun 17 10:54:52 PDT 2009


Sorry, I didn't intend to imply that <ZWNJ> and <ZWJ> would be excluded from string literals in my simply proposal.  Their occurrence in literals is already in the draft. Basically, all the language from the ES3 spec. about stripping out Cf characters before lexing is long gone.  Except for treatment of <ZWNJ> and <ZWJ> as identifier characters and <BOM> as Whitespace, all Cf characters are currently treated in the spec. as generic Unicode characters and hence can occur in string literals, regexp literals and comments.

Allen

>-----Original Message-----
>From: Brendan Eich [mailto:brendan at mozilla.org]
>Sent: Wednesday, June 17, 2009 10:30 AM
>To: Allen Wirfs-Brock
>Cc: David-Sarah Hopwood; es5-discuss at mozilla.org
>Subject: Re: Last call for consensus on format-control char. issues
>
>We've dealt with users wishing for ZWNJ in string literals too. See
>resolved bug https://bugzilla.mozilla.org/show_bug.cgi?id=274152 -- I
>do not see why we should regress this for such users.
>
>/be
>
>On Jun 17, 2009, at 10:11 AM, Allen Wirfs-Brock wrote:
>
>> Waldemar's notes from the May F2F don't record any decision on the
>> issue of <ZWNJ> and <ZWJ> in identifiers.  However, my personal
>> notes say that I need to "keep in identifiers and fix grammar" which
>> is also my recollection of what we decided at the meeting.
>>
>> The simplest implementation of  that decisions is to simply add
>> <ZWNJ> and <ZWJ> as alternatives for IdentifierPart. In addition,
>> the text in section 7.1 that says that format control characters can
>> occur in identifier presumably needs to be narrowed to say only
>> <ZWNJ> and <ZWJ>.
>>
>> At about the same time as the F2F David-Sarah made a more
>> comprehensive proposal (duplicated below) that in addition to
>> addressing <ZWNJ> and <ZWJ> also significantly refines the rules for
>> <BOM> including excluding them from strings literals and regular
>> expressions and making it a syntax error for a <BOM> to appear
>> within an identifier.
>>
>> I'm not a Unicode expert, but my sense is that David-Sarah's
>> proposal is sound and probably consistent with the original goals of
>> cleaning up class Cf in the specification. However, his rules for
>> <BOM> also seem like they could significantly complicate the lexical
>> analysis phase of implementations.
>>
>> My sense from the F2F is that the consensus was more in the
>> direction of my simple solution above (<ZWNJ> and <ZWJ> in
>> identifiers, <BOM> is whitespace) rather than David-Sarah's more
>> comprehensive treatment of <BOM>.
>>
>> I need to have a final decision on this so I can update the draft
>> accordingly. Based upon my recollection of the F2F I'm going to go
>> with the "simple solution" unless there is apparent consensus
>> otherwise.
>>
>> Final thoughts?
>>
>> Allen
>>
>>
>>> -----Original Message-----
>>> From: es5-discuss-bounces at mozilla.org [mailto:es5-discuss-
>>> bounces at mozilla.org] On Behalf Of David-Sarah Hopwood
>>> Sent: Thursday, May 28, 2009 5:44 PM
>>> To: es5-discuss at mozilla.org
>>> Subject: Grammar for IdentifierName does not allow <ZWNJ> and <ZWJ>
>>>
>>> John Cowan wrote:
>>>> David-Sarah Hopwood scripsit:
>>>>
>>>>> The omission of format-control characters from <IdentifierName>
>>> appears
>>>>> to be just an oversight.
>>>>
>>>> -1
>>>
>>> Indeed, I had forgotten that we had already discussed this and come
>>> to
>>> a different conclusion:
>>>
>>> <https://mail.mozilla.org/pipermail/es5-discuss/2009-
>April/002432.html
>>> >
>>> <https://mail.mozilla.org/pipermail/es5-discuss/2009-
>April/002435.html
>>> >.
>>>
>>>> Allowing all of them causes the same kinds of problems as allowing
>>>> BOM.  Most of them have little visible effect on the surrounding
>>>> text
>>>> (especially Latin-script text) even in fully conformant Unicode
>>> renderers,
>>>> never mind renderers that muffle them.  The result is that "foobar"
>>> and
>>>> "foo<Cf>bar" look the same but aren't.
>>>>
>>>> Per Unicode 5.1, the only ones that actually affect the natural-
>>> language
>>>> meaning of identifiers are U+200C ZWNJ and U+200D ZWJ.  These are
>>>> the
>>> only
>>>> ones which should even be considered in ES5 identifiers.  UAX #31
>>> (which
>>>> is included by reference in Unicode 5.1) specifies narrower
>>>> conditions
>>>> in which ZWNJ and ZWJ are essential; sticking to the conditions is
>>>> non-trivial, but minimizes the chance of spoofing.
>>>>
>>>> Given the risks, I'm uncertain whether ZWNJ and ZWJ should be
>>>> allowed
>>>> or not.
>>>
>>> Forget trying to minimize identifier spoofing as a security risk.
>>> That's
>>> not possible, if Unicode identifiers are to be allowed at all. It
>>> is an
>>> inherent characteristic of Unicode that many distinct (even when
>>> normalized)
>>> strings will look the same. It is not at all clear that this is a
>>> genuine
>>> security risk for general programming -- as opposed to situations
>>> that
>>> require adversarial code review, which full ECMAScript is a long way
>>> from being able to support.
>>>
>>> What is useful to attempt to minimize is the chance of accidentally
>>> typing identifiers that are distinct but look the same, or of
>>> seeing an
>>> identifier and being unable to reliably reproduce it. This is a
>>> usability
>>> issue, not a security issue.
>>>
>>> For usability, it may indeed be a good approach to allow <ZWNJ> and
>>> <ZWJ>
>>> but disallow other format-control characters. I am not sufficiently
>>> familiar with the scripts that require these characters to be sure of
>>> that, but it seems reasonable based on their descriptions in the
>>> Unicode
>>> standard.
>>>
>>> However, the complicated script-dependent rules described in UAX
>>> #31 for
>>> restricting the contexts in which <ZWNJ> and <ZWJ> can occur, seem
>>> quite
>>> over-the-top given the impossibility of preventing spoofing. Again,
>>> see
>>> <https://mail.mozilla.org/pipermail/es5-discuss/2009-
>April/002435.html
>>> >.
>>>
>>> Combining the proposal from that post with the changes for <NEL>,
>>> <ZWSP> and <BOM> (since both affect section 7.1), we end up with
>>> this.
>>>
>>> ====
>>> Changes to section 7.2:
>>> - revert the addition of <NEL>, <ZWSP>, and <BOM> to WhiteSpace and
>>>  to the table.
>>>
>>>
>>> Changes to section 7.8.4:
>>>
>>>  DoubleStringCharacter ::
>>>    SourceCharacter but not double-quote " or backslash \ or
>>> LineTerminator
>>> or <BOM>
>>>    \ EscapeSequence
>>>    LineContinuation
>>>
>>>  SingleStringCharacter ::
>>>    SourceCharacter but not single-quote ' or backslash \ or
>>> LineTerminator
>>> or <BOM>
>>>    \ EscapeSequence
>>>    LineContinuation
>>>
>>>  NonEscapeCharacter ::
>>>    SourceCharacter but not EscapeCharacter or LineTerminator or <BOM>
>>>
>>>  * The CV of DoubleStringCharacter :: SourceCharacter but not
>>>    double-quote " or backslash \ or LineTerminator or <BOM>
>>>    is the SourceCharacter character itself
>>>
>>>  * The CV of SingleStringCharacter :: SourceCharacter but not
>>>    single-quote ' or backslash \ or LineTerminator or <BOM>
>>>    is the SourceCharacter character itself.
>>>
>>>  * The CV of NonEscapeCharacter :: SourceCharacter but not
>>>    EscapeCharacter or LineTerminator or <BOM> is the
>>>    SourceCharacter character itself.
>>>
>>>
>>> Replace section 7.1:
>>>
>>> 7.1 Unicode Format-Control Characters
>>>
>>> The Unicode format-control characters (i.e., the characters in
>>> General Category "Cf" in the Unicode Character Database such as
>>> LEFT-TO-RIGHT MARK or RIGHT-TO-LEFT MARK) are control codes used to
>>> control the formatting of a range of text in the absence of
>>> higher-level protocols for this, such as mark-up languages.
>>>
>>> <BOM> is a format-control character used primarily at the start of
>>> a text to mark it as Unicode and to allow detection of the text's
>>> encoding and byte order. <BOM> characters intended for this purpose
>>> can sometimes also appear after the start of a text, for example as
>>> a result of concatenating files.
>>>
>>> In ECMAScript source, <BOM> characters are ignored if they appear
>>> immediately before or after a token, or within a span of consecutive
>>> WhiteSpace characters (7.2). The lexical grammar does not explicitly
>>> include such ignored <BOM> characters. It is a syntax error for a
>>> <BOM> character to appear within a token (that is, if removing the
>>> <BOM> would result in the preceding and following characters being
>>> part of the same token).
>>>
>>> Note that comments are not tokens, and so the above rule allows
>>> <BOM> characters to appear within comments. It does not allow them
>>> to appear within string literals or regular expression literals (the
>>> escape sequence \uFEFF should be used instead).
>>>
>>> It is useful to allow other format-control characters in source text
>>> to facilitate editing and display. Format-control characters other
>>> than <BOM> may be used within comments, string literals, and
>>> regular expression literals. Two specific format-control characters,
>>> <ZWNJ> and <ZWJ>, may also be used in an identifier after the first
>>> character.
>>>
>>>
>>>  Code Unit Value    Name                                Formal name
>>>  ------------------------------------------------------------------
>>>  \u200C             Zero width non-joiner               <ZWNJ>
>>>  \u200D             Zero width joiner                   <ZWJ>
>>>  \uFEFF             Byte order mark (also called
>>>                       zero-width non-breaking space)    <BOM>
>>>
>>>
>>> Changes to section 7.6:
>>>
>>> [...] This standard specifies specific character additions: The
>>> dollar sign ($) and the underscore (_) are permitted anywhere in
>>> an identifier. <ZWNJ> and <ZWJ> are permitted after the first
>>> character.
>>>
>>>
>>> Changes to section 7.8.5:
>>>
>>> RegularExpressionNonTerminator ::
>>>   SourceCharacter but not LineTerminator or <BOM>
>>>
>>>
>>> Changes to Annex A:
>>> - update all productions changed above.
>>>
>>>
>>> Changes to Annex E:
>>> - add to the entry for section 7.1:
>>>    <BOM> characters are ignored between tokens and in comments,
>>>    but are not allowed within tokens (including string and
>>>    regular expression literals). <ZWNJ> and <ZWJ> are significant
>>>    within identifiers rather than being stripped.
>>>
>>> - delete the entries for sections 7.2 and 15.10.2.12.
>>>
>>>  (Reverting the additions of <NEL>, <ZWSP>, and <BOM> to the
>>>  WhiteSpace production also reverts this for the \s character
>>>  class, without any explicit change to section 15.10.2.12.)
>>>
>>> --
>>> David-Sarah Hopwood  ⚥  http://davidsarah.livejournal.com
>>>
>>> _______________________________________________
>>> es5-discuss mailing list
>>> es5-discuss at mozilla.org
>>> https://mail.mozilla.org/listinfo/es5-discuss
>> _______________________________________________
>> es5-discuss mailing list
>> es5-discuss at mozilla.org
>> https://mail.mozilla.org/listinfo/es5-discuss
>



More information about the es5-discuss mailing list