Q: Lonely surrogates and unicode regexps
Erik Corry
erik.corry at gmail.com
Wed Jan 28 02:57:41 PST 2015
On Wed, Jan 28, 2015 at 11:45 AM, Mathias Bynens <mathias at qiwi.be> wrote:
>
> > On 28 Jan 2015, at 11:36, Marja Hölttä <marja at chromium.org> wrote:
> >
> > For example, the current version of Mathias’s ES6 Unicode regular
> expression transpiler ( https://mothereff.in/regexpu ) converts /a.b/u
> into
> /a(?:[\0-\t\x0B\f\x0E-\u2027\u202A-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF])b/
> and afaics it’s not yet fully consistent wrt lonely surrogates, so, a
> consistent implementation is going to be more complex than this.
>
> This is indeed an incomplete solution. The lack of lookbehind support in
> ES makes this hard to transpile correctly. Ideas welcome!
>
I don't think your transpiler can work without lookbehind. If you could
guarantee that none of your transpiled regexp matches a substring that ends
in the middle of a pair, then I think you could get it right without
lookbehind, but consider:
"TxL-TxLT".test(/(...)-\1./);
Where L stands for a lead surrogate, and T stands for a trailing
surrogate. There's no way to stop the backreference from swallowing the
last L, and without lookbehind there is no way to stop the . from matching
the final T. A second issue is having a match that starts in the middle of
a pair. You could test for this after the matching if JS gave you the index
of the match in the string, but I don't think it does.
Ignoring the start-of-match-in-the-middle-of-a-pair issue, and the
backreferences case, I think you can do without the backreference.
Assuming the lonely-surrogates-are-a-character scenario, the period (.)
transpiles to (ignore spaces added for readability):
(?: \L(?!\T) | \L\T | \T | [^\L\T\N])
where \L means leading surrogates, \T means trailing surrogates, \N means
all newlines. Whatever comes before the . is not allowed to match a half
As an optimization, .x can transpile to (?: \L\T | . )x where the x stands
in for any literal characters.
For a JS engine implementor, like Marja, it is of course possible to add
1-character negative lookbehind (\b already has elements of this). Then
your in-engine transpiler turns . into
(?: \L(?!\T) | \L\T | (?<!\L)\T | [^\L\T\N])
Which is going to be truly horrible in terms of code size and performance.
It's not like the period operator is a rare thing in a regexp, and other
common things like [^a-z] and [^\d] will expand into similar horrors.
On the other hand, in the lonely-surrogates-match-nothing scenario, the .
transpiles to
(?: \l\t | [^\l\t\n] )
which is quite a lot nicer and faster. In this scenario, .x expands to (?:
\L\T | [^\T\L\N ) which still has no lookaheads and lookbehinds.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20150128/681658dd/attachment.html>
More information about the es-discuss
mailing list