Q: Lonely surrogates and unicode regexps

Norbert Lindenberg ecmascript at lindenbergsoftware.com
Sat Jan 31 00:39:55 PST 2015


> On Jan 28, 2015, at 8:30 , Allen Wirfs-Brock <allen at wirfs-brock.com> wrote:
> 
> 
> On Jan 28, 2015, at 4:54 AM, Wes Garland <wes at page.ca> wrote:
> 

>> Do we extend the regexp syntax to have a symbol which matches an unmatched surrogate?
> we already have it: \u{D83D}

Or to match any unpaired surrogate: /[\u{D800}-\u{DFFF}]/u

>>   How about reserved code points?  What happens when they become assigned?
> Other than the initial decoding of valid surrogate pairs into 32-bit code points, the ES6 //u RegExp spec. applies no semantics to any code points in the string that is being matched.

There are a few places where RegExp applies Unicode semantics:

– //ui uses Unicode case folding to compare case-insensitively. If the comparison involves code points that are unassigned in the Unicode version assumed by an ECMAScript implementation and in a later version get assigned to characters that are case-variants of each other, then the RegExp behavior can change. See section 21.2.2.8.2.

– RegExp knows a few character classes: \d, \D, \s, \S, \w, \W. \d, \D, \w, \W are defined by character lists that cannot change, but \s and therefore \S could change if Unicode assigns new characters with the category “Separator, space”. See section 21.2.2.12.

But in general //u is defined based on code points and doesn’t care whether code points are assigned or reserved.

Norbert



More information about the es-discuss mailing list