Q: Lonely surrogates and unicode regexps

Erik Corry erik.corry at gmail.com
Wed Jan 28 02:57:41 PST 2015

On Wed, Jan 28, 2015 at 11:45 AM, Mathias Bynens <mathias at qiwi.be> wrote:

> > On 28 Jan 2015, at 11:36, Marja Hölttä <marja at chromium.org> wrote:
> >
> > For example, the current version of Mathias’s ES6 Unicode regular
> expression transpiler ( https://mothereff.in/regexpu ) converts /a.b/u
> into
> /a(?:[\0-\t\x0B\f\x0E-\u2027\u202A-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF])b/
> and afaics it’s not yet fully consistent wrt lonely surrogates, so, a
> consistent implementation is going to be more complex than this.
> This is indeed an incomplete solution. The lack of lookbehind support in
> ES makes this hard to transpile correctly. Ideas welcome!

I don't think your transpiler can work without lookbehind.  If you could
guarantee that none of your transpiled regexp matches a substring that ends
in the middle of a pair, then I think you could get it right without
lookbehind, but consider:


Where L stands for a lead surrogate, and T stands for a trailing
surrogate.  There's no way to stop the backreference from swallowing the
last L, and without lookbehind there is no way to stop the . from matching
the final T.  A second issue is having a match that starts in the middle of
a pair. You could test for this after the matching if JS gave you the index
of the match in the string, but I don't think it does.

Ignoring the start-of-match-in-the-middle-of-a-pair issue, and the
backreferences case, I think you can do without the backreference.
Assuming the lonely-surrogates-are-a-character scenario, the period (.)
transpiles to (ignore spaces added for readability):

(?:  \L(?!\T)  | \L\T  |  \T  |  [^\L\T\N])

where \L means leading surrogates, \T means trailing surrogates, \N means
all newlines.  Whatever comes before the . is not allowed to match a half

As an optimization, .x can transpile to (?: \L\T | . )x where the x stands
in for any literal characters.

For a JS engine implementor, like Marja, it is of course possible to add
1-character negative lookbehind (\b already has elements of this).  Then
your in-engine transpiler turns . into

(?:  \L(?!\T)  | \L\T  |  (?<!\L)\T  |  [^\L\T\N])

Which is going to be truly horrible in terms of code size and performance.
It's not like the period operator is a rare thing in a regexp, and other
common things like [^a-z] and [^\d] will expand into similar horrors.

On the other hand, in the lonely-surrogates-match-nothing scenario, the .
transpiles to

(?: \l\t  |  [^\l\t\n] )

which is quite a lot nicer and faster.  In this scenario, .x expands to (?:
\L\T | [^\T\L\N )  which still has no lookaheads and lookbehinds.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20150128/681658dd/attachment.html>

More information about the es-discuss mailing list