Q: Lonely surrogates and unicode regexps

Mark Davis ☕️ mark at macchiato.com
Wed Jan 28 02:50:07 PST 2015

On Wed, Jan 28, 2015 at 11:36 AM, Marja Hölttä <marja at chromium.org> wrote:

> The ES6 unicode regexp spec is not very clear regarding what should happen
> if the regexp or the matched string contains lonely surrogates (a lead
> surrogate without a trail, or a trail without a lead). For example, for the
> . operator, the relevant parts of the spec speak about characters:

​Just a bit of terminology.

The term "character" is overloaded, so Unicode provides the unambiguous
term "code point". For example, U+0378​ is not (currently) an encoded
character according to Unicode, but it would certainly be a terrible idea
to disregard it, or not match it. It is a reserved code point that may be
assigned as an encoded character in the future. So both U+D83D and U+0378
are not characters.

If a ES spec uses the term "character" instead of "code point", then at
some point in the text it needs to disambiguate what is meant.

As to how this should be handled in regex expressions, I'd suggest looking
at Java's approach.

Mark <https://google.com/+MarkDavis>

*— Il meglio è l’inimico del bene —*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20150128/b93b2331/attachment.html>

More information about the es-discuss mailing list