Q: Lonely surrogates and unicode regexps
marja at chromium.org
Wed Jan 28 02:36:05 PST 2015
TL;DR: /foo.bar/u.test(“foo\uD83Dbar”) == ?
The ES6 unicode regexp spec is not very clear regarding what should happen
if the regexp or the matched string contains lonely surrogates (a lead
surrogate without a trail, or a trail without a lead). For example, for the
. operator, the relevant parts of the spec speak about characters:
“Let A be the set of all *characters* except LineTerminator.”
“Let ch be the *character* Input[e].”
But is a lonely surrogate a character? According to the Unicode standard,
it’s not. If it's not, what will ch be if the input string contains a
lonely surrogate in the relevant position?
Q1: Are lonely surrogates allowed in /u regexps?
E.g., /foo\uD83D/u; (note lonely lead surrogate), should this be allowed?
Will it match a lead surrogate inside a surrogate pair?
Suggestion: we shouldn't allow lonely surrogates in /u regexps.
If users actually want to match lonely surrogates (e.g., to check for them
or remove them) then they can use non-/u regexps.
The regexp syntax treats a lonely surrogate as a normal unicode escape, and
the rules say e.g., "The production RegExpUnicodeEscapeSequence :: u
Hex4Digits evaluates as follows: Return the character whose code is the SV
of Hex4Digits." - it's also unclear what this means if no valid character
has this code.
Q2: If the string contains a lonely surrogate, what should it match? Should
it match .? Should it match [^a] ? (Or is it undefined behavior?)
/foo.bar/u.test("foo\uD83Dbar") == ?
/foo.bar/u.test("foo\uDC00bar") == ?
/foo[^a]bar/u.test("foo\uD83Dbar") == ?
/foo[^a]bar/u.test("foo\uDC00bar") == ?
/foo/u.test("bar\uD83Dbarfoo") == ?
/foo/u.test("bar\uDC00barfoo") == ?
/foo(.*)bar\1/u.test("foo\uD834bar\uD834\uDC00") == ? // Should the
backreference be allowed to match the lead surrogate of a surrogate pair?
/^(.+)\1$/u.test("\uDC00foobar\uD83D\uDC00foobar\uD83D") == ?? // Should we
allow splitting the surrogate pair like this?
Suggestion: a lonely surrogate should not be a character and it should not
match . or [^a] etc. However, a lonely surrogate in the input string
shouldn't prevent some other part of the string from matching.
If a lonely surrogate is treated as a character, the matching rule for .
gets complicated and difficult / slow to implement: . should not match
individual surrogates inside a surrogate pair, but if it has to match a
lonely surrogate, we'll end up needing lookahead and lookbehind logic to
implement that behavior.
For example, the current version of Mathias’s ES6 Unicode regular
expression transpiler ( https://mothereff.in/regexpu ) converts /a.b/u into
and afaics it’s not yet fully consistent wrt lonely surrogates, so, a
consistent implementation is going to be more complex than this.
If we convert the string into UC-32 before matching, then the "lonely
surrogate is a character" behavior gets easier to implement, but we
wouldn't want to be forced to do that. The intention behind the ES6 spec
seems to be that strings can / should still be stored as UC-16. Converting
strings to UC-32 before matching with /u regexps would require an
additional pass over the string which we'd want to avoid, and converting
only when strictly needed for the "lonely surrogate is a character"
implementation adds complexity. E.g., with some regexps we don't need to
scan the whole input string to find a match, and also most input strings,
even for /u regexps, probably won't contain surrogates (to find that out
we'd also need to scan the whole string, or some logic to fall back to
UC-32 matching when we see a surrogate).
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the es-discuss