Q: Lonely surrogates and unicode regexps
allen at wirfs-brock.com
Wed Jan 28 08:11:08 PST 2015
On Jan 28, 2015, at 2:36 AM, Marja Hölttä <marja at chromium.org> wrote:
> Hello es-discuss,
> TL;DR: /foo.bar/u.test(“foo\uD83Dbar”) == ?
> The ES6 unicode regexp spec is not very clear regarding what should happen if the regexp or the matched string contains lonely surrogates (a lead surrogate without a trail, or a trail without a lead). For example, for the . operator, the relevant parts of the spec speak about characters:
TL;DR: in a unicode regexp lonely surrogates are considered to be a single “character”.
As André has already covered “character” has a very specific meaning within the context of the ES6 RegExp specification in the second paragraph of http://people.mozilla.org/~jorendorff/es6-draft.html#sec-pattern-semantics . The specification uses the same set of algorithms to describe both BCP (i.e., 16-bit elements) and unicode (i.e., 32-bit elements) patterns and matching semantics. “Character” is used in those algorithm to refer to a single element of the mode that is currently operating within.
I think the ambiguity you find is in step 2.1 of http://people.mozilla.org/~jorendorff/es6-draft.html#sec-pattern :
2. Return an internal closure that takes two arguments, a String str and an integer index, and performs the following:
1. If Unicode is true, let Input be a List consisting of the sequence of code points of str interpreted as a UTF-16 encoded Unicode string. Otherwise, let Input be a List consisting of the sequence of code units that are the elements of str. Input will be used throughout the algorithms in 21.2.2. Each element of Input is considered to be a character.
Apparently I don’t have an adequate definition of “interpreted as a UTF-16 encoded Unicode string”. If you submit a bug to bugs.emncascript.org) I will provided one in the next spec. revisions. The intended semantics is that:
In ascending string index order:
Each valid UTF-16 surrogate pair is interpreted as a signal code point that is the UTF-16 encoded value
Each “lonely” surrogate is interpreted as single code point that is the surrogate value
Every other 16-bit code unit is interpreted as a single code point.
> “Let A be the set of all *characters* except LineTerminator.”
> “Let ch be the *character* Input[e].”
> But is a lonely surrogate a character? According to the Unicode standard, it’s not. If it's not, what will ch be if the input string contains a lonely surrogate in the relevant position?
> Q1: Are lonely surrogates allowed in /u regexps?
> E.g., /foo\uD83D/u; (note lonely lead surrogate), should this be allowed? Will it match a lead surrogate inside a surrogate pair?
> Suggestion: we shouldn't allow lonely surrogates in /u regexps.
> If users actually want to match lonely surrogates (e.g., to check for them or remove them) then they can use non-/u regexps.
> The regexp syntax treats a lonely surrogate as a normal unicode escape, and the rules say e.g., "The production RegExpUnicodeEscapeSequence :: u Hex4Digits evaluates as follows: Return the character whose code is the SV of Hex4Digits." - it's also unclear what this means if no valid character has this code.
> Q2: If the string contains a lonely surrogate, what should it match? Should it match .? Should it match [^a] ? (Or is it undefined behavior?)
> Test cases:
> /foo.bar/u.test("foo\uD83Dbar") == ?
> /foo.bar/u.test("foo\uDC00bar") == ?
> /foo[^a]bar/u.test("foo\uD83Dbar") == ?
> /foo[^a]bar/u.test("foo\uDC00bar") == ?
> /foo/u.test("bar\uD83Dbarfoo") == ?
> /foo/u.test("bar\uDC00barfoo") == ?
> /foo(.*)bar\1/u.test("foo\uD834bar\uD834\uDC00") == ? // Should the backreference be allowed to match the lead surrogate of a surrogate pair?
> /^(.+)\1$/u.test("\uDC00foobar\uD83D\uDC00foobar\uD83D") == ?? // Should we allow splitting the surrogate pair like this?
> Suggestion: a lonely surrogate should not be a character and it should not match . or [^a] etc. However, a lonely surrogate in the input string shouldn't prevent some other part of the string from matching.
> If a lonely surrogate is treated as a character, the matching rule for . gets complicated and difficult / slow to implement: . should not match individual surrogates inside a surrogate pair, but if it has to match a lonely surrogate, we'll end up needing lookahead and lookbehind logic to implement that behavior.
> For example, the current version of Mathias’s ES6 Unicode regular expression transpiler ( https://mothereff.in/regexpu ) converts /a.b/u into /a(?:[\0-\t\x0B\f\x0E-\u2027\u202A-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF])b/ and afaics it’s not yet fully consistent wrt lonely surrogates, so, a consistent implementation is going to be more complex than this.
> If we convert the string into UC-32 before matching, then the "lonely surrogate is a character" behavior gets easier to implement, but we wouldn't want to be forced to do that. The intention behind the ES6 spec seems to be that strings can / should still be stored as UC-16. Converting strings to UC-32 before matching with /u regexps would require an additional pass over the string which we'd want to avoid, and converting only when strictly needed for the "lonely surrogate is a character" implementation adds complexity. E.g., with some regexps we don't need to scan the whole input string to find a match, and also most input strings, even for /u regexps, probably won't contain surrogates (to find that out we'd also need to scan the whole string, or some logic to fall back to UC-32 matching when we see a surrogate).
> es-discuss mailing list
> es-discuss at mozilla.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the es-discuss