Q: Lonely surrogates and unicode regexps

Norbert Lindenberg ecmascript at lindenbergsoftware.com
Sat Jan 31 00:39:55 PST 2015

> On Jan 28, 2015, at 8:30 , Allen Wirfs-Brock <allen at wirfs-brock.com> wrote:
> On Jan 28, 2015, at 4:54 AM, Wes Garland <wes at page.ca> wrote:

>> Do we extend the regexp syntax to have a symbol which matches an unmatched surrogate?
> we already have it: \u{D83D}

Or to match any unpaired surrogate: /[\u{D800}-\u{DFFF}]/u

>>   How about reserved code points?  What happens when they become assigned?
> Other than the initial decoding of valid surrogate pairs into 32-bit code points, the ES6 //u RegExp spec. applies no semantics to any code points in the string that is being matched.

There are a few places where RegExp applies Unicode semantics:

– //ui uses Unicode case folding to compare case-insensitively. If the comparison involves code points that are unassigned in the Unicode version assumed by an ECMAScript implementation and in a later version get assigned to characters that are case-variants of each other, then the RegExp behavior can change. See section

– RegExp knows a few character classes: \d, \D, \s, \S, \w, \W. \d, \D, \w, \W are defined by character lists that cannot change, but \s and therefore \S could change if Unicode assigns new characters with the category “Separator, space”. See section

But in general //u is defined based on code points and doesn’t care whether code points are assigned or reserved.


More information about the es-discuss mailing list