Q: Lonely surrogates and unicode regexps

Marja Hölttä marja at chromium.org
Wed Jan 28 05:15:29 PST 2015


For a reference, here's how Java (tried w/ Oracle 1.8.0_31 and openjdk
1.7.0_65) Pattern.UNICODE_CHARACTER_CLASS works:

foo\uD834bar and foo\uDC00bar match ^foo[^a]bar$ and ^foo.bar$, so,
generally, lonely surrogates match /./.

Backreferences are allowed to consume the leading surrogate of a valid
surrogate pair:

Ex1: foo\uD834bar\uD834\uDC00 matches foo(.+)bar\1

But surprisingly:

Ex2: \uDC00foobar\uD834\uDC00foobar\uD834 doesn't match ^(.+)\1$

... So Ex2 works as if the input string was converted to UTF-32 before
matching, but Ex1 works as if it was def not. Idk what's the correct mental
model where both Ex1 and Ex2 would make sense.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20150128/d2eaa1f6/attachment-0001.html>


More information about the es-discuss mailing list