Q: Lonely surrogates and unicode regexps
mathias at qiwi.be
Wed Jan 28 02:45:38 PST 2015
> On 28 Jan 2015, at 11:36, Marja Hölttä <marja at chromium.org> wrote:
> TL;DR: /foo.bar/u.test(“foo\uD83Dbar”) == ?
> The ES6 unicode regexp spec is not very clear regarding what should happen if the regexp or the matched string contains lonely surrogates (a lead surrogate without a trail, or a trail without a lead). For example, for the . operator, the relevant parts of the spec speak about characters:
> “Let A be the set of all *characters* except LineTerminator.”
> “Let ch be the *character* Input[e].”
> But is a lonely surrogate a character? According to the Unicode standard, it’s not. If it's not, what will ch be if the input string contains a lonely surrogate in the relevant position?
> Q1: Are lonely surrogates allowed in /u regexps?
> E.g., /foo\uD83D/u; (note lonely lead surrogate), should this be allowed? Will it match a lead surrogate inside a surrogate pair?
> Suggestion: we shouldn't allow lonely surrogates in /u regexps.
> If users actually want to match lonely surrogates (e.g., to check for them or remove them) then they can use non-/u regexps.
You’re proposing to define “characters” in terms of Unicode scalar values in the case `/u` is used. I could get behind that — it reinforces the idea that `/u` is like a strict mode for regular expressions.
> The regexp syntax treats a lonely surrogate as a normal unicode escape, and the rules say e.g., "The production RegExpUnicodeEscapeSequence :: u Hex4Digits evaluates as follows: Return the character whose code is the SV of Hex4Digits." - it's also unclear what this means if no valid character has this code.
> Q2: If the string contains a lonely surrogate, what should it match? Should it match .? Should it match [^a] ? (Or is it undefined behavior?)
> Test cases:
> /foo.bar/u.test("foo\uD83Dbar") == ?
> /foo.bar/u.test("foo\uDC00bar") == ?
> /foo[^a]bar/u.test("foo\uD83Dbar") == ?
> /foo[^a]bar/u.test("foo\uDC00bar") == ?
> /foo/u.test("bar\uD83Dbarfoo") == ?
> /foo/u.test("bar\uDC00barfoo") == ?
> /foo(.*)bar\1/u.test("foo\uD834bar\uD834\uDC00") == ? // Should the backreference be allowed to match the lead surrogate of a surrogate pair?
> /^(.+)\1$/u.test("\uDC00foobar\uD83D\uDC00foobar\uD83D") == ?? // Should we allow splitting the surrogate pair like this?
> Suggestion: a lonely surrogate should not be a character and it should not match . or [^a] etc. However, a lonely surrogate in the input string shouldn't prevent some other part of the string from matching.
> If a lonely surrogate is treated as a character, the matching rule for . gets complicated and difficult / slow to implement: . should not match individual surrogates inside a surrogate pair, but if it has to match a lonely surrogate, we'll end up needing lookahead and lookbehind logic to implement that behavior.
> For example, the current version of Mathias’s ES6 Unicode regular expression transpiler ( https://mothereff.in/regexpu ) converts /a.b/u into /a(?:[\0-\t\x0B\f\x0E-\u2027\u202A-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF])b/ and afaics it’s not yet fully consistent wrt lonely surrogates, so, a consistent implementation is going to be more complex than this.
This is indeed an incomplete solution. The lack of lookbehind support in ES makes this hard to transpile correctly. Ideas welcome!
More information about the es-discuss