Q: Lonely surrogates and unicode regexps
André Bargull
andre.bargull at udo.edu
Wed Jan 28 06:11:40 PST 2015
On 1/28/2015 2:51 PM, André Bargull wrote:
>> For a reference, here's how Java (tried w/ Oracle 1.8.0_31 and openjdk
>> 1.7.0_65) Pattern.UNICODE_CHARACTER_CLASS works:
>>
>> foo\uD834bar and foo\uDC00bar match ^foo[^a]bar$ and ^foo.bar$, so,
>> generally, lonely surrogates match /./.
>>
>> Backreferences are allowed to consume the leading surrogate of a valid
>> surrogate pair:
>>
>> Ex1: foo\uD834bar\uD834\uDC00 matches foo(.+)bar\1
>>
>> But surprisingly:
>>
>> Ex2: \uDC00foobar\uD834\uDC00foobar\uD834 doesn't match ^(.+)\1$
>>
>> ... So Ex2 works as if the input string was converted to UTF-32 before
>> matching, but Ex1 works as if it was def not. Idk what's the correct mental
>> model where both Ex1 and Ex2 would make sense.
>
> java.util.regex.Pattern matches back references by comparing (Java) chars [1], but reads patterns
> as a sequence of code points [2]. That should help to explain the differences between ex1 and ex2.
>
> [1]
> http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/c46daef6edb5/src/share/classes/java/util/regex/Pattern.java#l4890
> [2]
> http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/c46daef6edb5/src/share/classes/java/util/regex/Pattern.java#l1671
Err, the part about how patterns are read is not important here. What I should have written is that
the input string is (also) read as a sequence of code points [3]. So in ex2 `\uD834\uDC00` is read
as a single code point (and not split into \uD834 and \uDC00 during backtracking).
[3]
http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/c46daef6edb5/src/share/classes/java/util/regex/Pattern.java#l3773
More information about the es-discuss
mailing list