invalid escape sequences

Mike Samuel mikesamuel at gmail.com
Tue May 31 18:33:54 PDT 2011


During the last meeting, the semantics of "\z" came up.  Specifically,
what does \ followed by a character not in the set with a specified
escape expand to?

>From 7.8.4 StringLiteral

    "
    EscapeSequence :: CharacterEscapeSequence
    "

leads to

    "
    CharacterEscapeSequence :: ...
        NonEscapeCharacter

    NonEscapeCharacter :: SourceCharacter but not one of
EscapeCharacter or LineTerminator
    "

and the semantics of NonEscapeCharacter is given thus

    "
    The CV of CharacterEscapeSequence :: NonEscapeCharacter is the CV
of the NonEscapeCharacter.
    "

so are the following assertions true?

(1)

The only SourceCharacter sequences that do not match (
DoubleStringCharacter | SingleStringCharacter ) applied one or more
times are a LineTerminator not preceded by an odd number of
backslashes, "u" not followed by 4 valid hex digits and not preceded
by an even number of backslashes, "x" not followed by 2 valid hex
digits and not preceded by an even number of backslashes, or a decimal
digit not preceded by an even number of backslashes.
I.e. /(?:^|[^\\])(?:\\\\)*([\r\n\u2028\u2029]|\\u(?![0-9A-Fa-f]{4})|\\x(?![0-9A-Fa-f]{2})|\\[0-9]/
tests whether a sequence of SourceCharacters matches zero or more (
DoubleStringCharacter | SingleStringCharacter ).

(2)

The B.1.2 additional octal syntax, quoted below, does change the
validity of the test above.
    "
    OctalEscapeSequence :: OctalDigit [lookahead not in DecimalDigit]
        ZeroToThree OctalDigit [lookahead not in DecimalDigit]
        FourToSeven OctalDigit
        ZeroToThree OctalDigit OctalDigit
    "

NonEscapeCharacter excludes DecimalDigit through SingleEscapeCharacter
but OctalEscape allows [0-7].  So under B.1.2,
/(?:^|[^\\])(?:\\\\)*([\r\n\u2028\u2029]|\\u(?![0-9A-Fa-f]{4})|\\x(?![0-9A-Fa-f]{2}|\\[89]|\\[0-3][0-7]?(?![89])|\\[4-7](?![89]))/
tests whether a sequence of SourceCharacters matches zero or more (
DoubleStringCharacter | SingleStringCharacter ).



I did some empirical testing to see what is actually allowed by
running the below in a variety of browsers in the squarefree shell.

var notStringLiterals = [ "\r", "\\u", "\\x", "\\8", "\\28", "\\228",
"\\3778", "\\478", "\\778" ];
for (var i = 0; i < notStringLiterals.length; ++i) {
  var result;
  try {
    result = eval('"' + notStringLiterals[i] + '"');
  } catch (ex) {
    result = "ERROR";
  }
  print(JSON.stringify(notStringLiterals[i]) + " : " + JSON.stringify(result));
}

All are invalid absent B.1.2 if the assertions above are true.  With
B.1.2, "\3778", "\478", and "\778" are valid.

I'm having trouble running IE today, but on other browsers, in
alphabetical order:

Chrome
"\r" : "ERROR"
"\\u" : "u"
"\\x" : "x"
"\\8" : "8"
"\\28" : "\u00028"
"\\228" : "\u00128"
"\\3778" : "ÿ8"
"\\478" : "'8"
"\\778" : "?8"


FF3
"\u000d" : "ERROR"
"\\u" : "u"
"\\x" : "x"
"\\8" : "8"
"\\28" : "\u00028"
"\\228" : "\u00128"
"\\3778" : "ÿ8"
"\\478" : "'8"
"\\778" : "?8"


Safari
"\r" : "ERROR"
"\\u" : "u"
"\\x" : "x"
"\\8" : "8"
"\\28" : "\u00028"
"\\228" : "\u00128"
"\\3778" : "ÿ8"
"\\478" : "'8"
"\\778" : "?8"


So at least 3 different interpreter strains treat "\u" === "u", "\x"
=== "x", "\8" === "8", and don't care whether there is a decimal digit
after an octal escape sequence.  All reject unescaped newlines in
string literals.


I would like to be able to specify quasiliteral literal part decoding
in terms of the SV defined in 7.8.4.  If user code is going to have
decoded literal parts available when they validly decode, but at least
have access to the raw literal parts otherwise, then it would be good
for them to be consistently available across interpreters.  Would it
be worthwhile having the SV and CV in 7.8.4 specify the decoding of
some sourcecharacter sequences that can't actually reach the SV or CV
from via the StringLiteral production?


More information about the es-discuss mailing list