Q: Lonely surrogates and unicode regexps

Mark Davis ☕️ mark at macchiato.com
Wed Jan 28 05:26:37 PST 2015

I think the cleanest mental model is where UTF-16 or UTF-8 strings are
interpreted as if they were transformed into UTF-32.

While that is generally feasible, it often represents a cost in performance
which is not acceptable in practice. So you see various approaches that
involve some deviation from that mental model.

Mark <https://google.com/+MarkDavis>

*— Il meglio è l’inimico del bene —*

On Wed, Jan 28, 2015 at 2:15 PM, Marja Hölttä <marja at chromium.org> wrote:

> For a reference, here's how Java (tried w/ Oracle 1.8.0_31 and openjdk
> 1.7.0_65) Pattern.UNICODE_CHARACTER_CLASS works:
> foo\uD834bar and foo\uDC00bar match ^foo[^a]bar$ and ^foo.bar$, so,
> generally, lonely surrogates match /./.
> Backreferences are allowed to consume the leading surrogate of a valid
> surrogate pair:
> Ex1: foo\uD834bar\uD834\uDC00 matches foo(.+)bar\1
> But surprisingly:
> Ex2: \uDC00foobar\uD834\uDC00foobar\uD834 doesn't match ^(.+)\1$
> ... So Ex2 works as if the input string was converted to UTF-32 before
> matching, but Ex1 works as if it was def not. Idk what's the correct mental
> model where both Ex1 and Ex2 would make sense.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20150128/6b2e1749/attachment.html>

More information about the es-discuss mailing list