Last call for consensus on format-control char. issues
brendan at mozilla.org
Wed Jun 17 10:30:17 PDT 2009
We've dealt with users wishing for ZWNJ in string literals too. See
resolved bug https://bugzilla.mozilla.org/show_bug.cgi?id=274152 -- I
do not see why we should regress this for such users.
On Jun 17, 2009, at 10:11 AM, Allen Wirfs-Brock wrote:
> Waldemar's notes from the May F2F don't record any decision on the
> issue of <ZWNJ> and <ZWJ> in identifiers. However, my personal
> notes say that I need to "keep in identifiers and fix grammar" which
> is also my recollection of what we decided at the meeting.
> The simplest implementation of that decisions is to simply add
> <ZWNJ> and <ZWJ> as alternatives for IdentifierPart. In addition,
> the text in section 7.1 that says that format control characters can
> occur in identifier presumably needs to be narrowed to say only
> <ZWNJ> and <ZWJ>.
> At about the same time as the F2F David-Sarah made a more
> comprehensive proposal (duplicated below) that in addition to
> addressing <ZWNJ> and <ZWJ> also significantly refines the rules for
> <BOM> including excluding them from strings literals and regular
> expressions and making it a syntax error for a <BOM> to appear
> within an identifier.
> I'm not a Unicode expert, but my sense is that David-Sarah's
> proposal is sound and probably consistent with the original goals of
> cleaning up class Cf in the specification. However, his rules for
> <BOM> also seem like they could significantly complicate the lexical
> analysis phase of implementations.
> My sense from the F2F is that the consensus was more in the
> direction of my simple solution above (<ZWNJ> and <ZWJ> in
> identifiers, <BOM> is whitespace) rather than David-Sarah's more
> comprehensive treatment of <BOM>.
> I need to have a final decision on this so I can update the draft
> accordingly. Based upon my recollection of the F2F I'm going to go
> with the "simple solution" unless there is apparent consensus
> Final thoughts?
>> -----Original Message-----
>> From: es5-discuss-bounces at mozilla.org [mailto:es5-discuss-
>> bounces at mozilla.org] On Behalf Of David-Sarah Hopwood
>> Sent: Thursday, May 28, 2009 5:44 PM
>> To: es5-discuss at mozilla.org
>> Subject: Grammar for IdentifierName does not allow <ZWNJ> and <ZWJ>
>> John Cowan wrote:
>>> David-Sarah Hopwood scripsit:
>>>> The omission of format-control characters from <IdentifierName>
>>>> to be just an oversight.
>> Indeed, I had forgotten that we had already discussed this and come
>> a different conclusion:
>>> Allowing all of them causes the same kinds of problems as allowing
>>> BOM. Most of them have little visible effect on the surrounding
>>> (especially Latin-script text) even in fully conformant Unicode
>>> never mind renderers that muffle them. The result is that "foobar"
>>> "foo<Cf>bar" look the same but aren't.
>>> Per Unicode 5.1, the only ones that actually affect the natural-
>>> meaning of identifiers are U+200C ZWNJ and U+200D ZWJ. These are
>>> ones which should even be considered in ES5 identifiers. UAX #31
>>> is included by reference in Unicode 5.1) specifies narrower
>>> in which ZWNJ and ZWJ are essential; sticking to the conditions is
>>> non-trivial, but minimizes the chance of spoofing.
>>> Given the risks, I'm uncertain whether ZWNJ and ZWJ should be
>>> or not.
>> Forget trying to minimize identifier spoofing as a security risk.
>> not possible, if Unicode identifiers are to be allowed at all. It
>> is an
>> inherent characteristic of Unicode that many distinct (even when
>> strings will look the same. It is not at all clear that this is a
>> security risk for general programming -- as opposed to situations
>> require adversarial code review, which full ECMAScript is a long way
>> from being able to support.
>> What is useful to attempt to minimize is the chance of accidentally
>> typing identifiers that are distinct but look the same, or of
>> seeing an
>> identifier and being unable to reliably reproduce it. This is a
>> issue, not a security issue.
>> For usability, it may indeed be a good approach to allow <ZWNJ> and
>> but disallow other format-control characters. I am not sufficiently
>> familiar with the scripts that require these characters to be sure of
>> that, but it seems reasonable based on their descriptions in the
>> However, the complicated script-dependent rules described in UAX
>> #31 for
>> restricting the contexts in which <ZWNJ> and <ZWJ> can occur, seem
>> over-the-top given the impossibility of preventing spoofing. Again,
>> Combining the proposal from that post with the changes for <NEL>,
>> <ZWSP> and <BOM> (since both affect section 7.1), we end up with
>> Changes to section 7.2:
>> - revert the addition of <NEL>, <ZWSP>, and <BOM> to WhiteSpace and
>> to the table.
>> Changes to section 7.8.4:
>> DoubleStringCharacter ::
>> SourceCharacter but not double-quote " or backslash \ or
>> or <BOM>
>> \ EscapeSequence
>> SingleStringCharacter ::
>> SourceCharacter but not single-quote ' or backslash \ or
>> or <BOM>
>> \ EscapeSequence
>> NonEscapeCharacter ::
>> SourceCharacter but not EscapeCharacter or LineTerminator or <BOM>
>> * The CV of DoubleStringCharacter :: SourceCharacter but not
>> double-quote " or backslash \ or LineTerminator or <BOM>
>> is the SourceCharacter character itself
>> * The CV of SingleStringCharacter :: SourceCharacter but not
>> single-quote ' or backslash \ or LineTerminator or <BOM>
>> is the SourceCharacter character itself.
>> * The CV of NonEscapeCharacter :: SourceCharacter but not
>> EscapeCharacter or LineTerminator or <BOM> is the
>> SourceCharacter character itself.
>> Replace section 7.1:
>> 7.1 Unicode Format-Control Characters
>> The Unicode format-control characters (i.e., the characters in
>> General Category "Cf" in the Unicode Character Database such as
>> LEFT-TO-RIGHT MARK or RIGHT-TO-LEFT MARK) are control codes used to
>> control the formatting of a range of text in the absence of
>> higher-level protocols for this, such as mark-up languages.
>> <BOM> is a format-control character used primarily at the start of
>> a text to mark it as Unicode and to allow detection of the text's
>> encoding and byte order. <BOM> characters intended for this purpose
>> can sometimes also appear after the start of a text, for example as
>> a result of concatenating files.
>> In ECMAScript source, <BOM> characters are ignored if they appear
>> immediately before or after a token, or within a span of consecutive
>> WhiteSpace characters (7.2). The lexical grammar does not explicitly
>> include such ignored <BOM> characters. It is a syntax error for a
>> <BOM> character to appear within a token (that is, if removing the
>> <BOM> would result in the preceding and following characters being
>> part of the same token).
>> Note that comments are not tokens, and so the above rule allows
>> <BOM> characters to appear within comments. It does not allow them
>> to appear within string literals or regular expression literals (the
>> escape sequence \uFEFF should be used instead).
>> It is useful to allow other format-control characters in source text
>> to facilitate editing and display. Format-control characters other
>> than <BOM> may be used within comments, string literals, and
>> regular expression literals. Two specific format-control characters,
>> <ZWNJ> and <ZWJ>, may also be used in an identifier after the first
>> Code Unit Value Name Formal name
>> \u200C Zero width non-joiner <ZWNJ>
>> \u200D Zero width joiner <ZWJ>
>> \uFEFF Byte order mark (also called
>> zero-width non-breaking space) <BOM>
>> Changes to section 7.6:
>> [...] This standard specifies specific character additions: The
>> dollar sign ($) and the underscore (_) are permitted anywhere in
>> an identifier. <ZWNJ> and <ZWJ> are permitted after the first
>> Changes to section 7.8.5:
>> RegularExpressionNonTerminator ::
>> SourceCharacter but not LineTerminator or <BOM>
>> Changes to Annex A:
>> - update all productions changed above.
>> Changes to Annex E:
>> - add to the entry for section 7.1:
>> <BOM> characters are ignored between tokens and in comments,
>> but are not allowed within tokens (including string and
>> regular expression literals). <ZWNJ> and <ZWJ> are significant
>> within identifiers rather than being stripped.
>> - delete the entries for sections 7.2 and 184.108.40.206.
>> (Reverting the additions of <NEL>, <ZWSP>, and <BOM> to the
>> WhiteSpace production also reverts this for the \s character
>> class, without any explicit change to section 220.127.116.11.)
>> David-Sarah Hopwood ⚥ http://davidsarah.livejournal.com
>> es5-discuss mailing list
>> es5-discuss at mozilla.org
> es5-discuss mailing list
> es5-discuss at mozilla.org
More information about the es5-discuss