Q: Lonely surrogates and unicode regexps

Erik Corry erik.corry at gmail.com
Sat Jan 31 08:24:54 PST 2015

I think it's problematic that this is being standardized without a single

On Wed, Jan 28, 2015 at 11:57 AM, André Bargull <andre.bargull at udo.edu>

>  On Wed, Jan 28, 2015 at 11:36 AM, Marja Hölttä <marja at chromium.org <https://mail.mozilla.org/listinfo/es-discuss>> wrote:
> >* The ES6 unicode regexp spec is not very clear regarding what should happen
> *>* if the regexp or the matched string contains lonely surrogates (a lead
> *>* surrogate without a trail, or a trail without a lead). For example, for the
> *>* . operator, the relevant parts of the spec speak about characters:
> *>
> ​Just a bit of terminology.
> The term "character" is overloaded, so Unicode provides the unambiguous
> term "code point". For example, U+0378​ is not (currently) an encoded
> character according to Unicode, but it would certainly be a terrible idea
> to disregard it, or not match it. It is a reserved code point that may be
> assigned as an encoded character in the future. So both U+D83D and U+0378
> are not characters.
> If a ES spec uses the term "character" instead of "code point", then at
> some point in the text it needs to disambiguate what is meant.
> "character" is defined in 21.2.2 Pattern Semantics [1]:
> In the context of describing the behaviour of a BMP pattern “character”
> means a single 16-bit Unicode BMP code point. In the context of describing
> the behaviour of a Unicode pattern “character” means a UTF-16 encoded code
> point.
> [1]
> https://people.mozilla.org/~jorendorff/es6-draft.html#sec-pattern-semantics
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20150131/11d1281a/attachment.html>

More information about the es-discuss mailing list