Q: Lonely surrogates and unicode regexps
Norbert Lindenberg
ecmascript at lindenbergsoftware.com
Sat Jan 31 00:01:32 PST 2015
> On Jan 28, 2015, at 8:11 , Allen Wirfs-Brock <allen at wirfs-brock.com> wrote:
>
>
> On Jan 28, 2015, at 2:36 AM, Marja Hölttä <marja at chromium.org> wrote:
>
>> Hello es-discuss,
>>
>> TL;DR: /foo.bar/u.test(“foo\uD83Dbar”) == ?
>>
>> The ES6 unicode regexp spec is not very clear regarding what should happen if the regexp or the matched string contains lonely surrogates (a lead surrogate without a trail, or a trail without a lead). For example, for the . operator, the relevant parts of the spec speak about characters:
>
> TL;DR: in a unicode regexp lonely surrogates are considered to be a single “character”.
>
> As André has already covered “character” has a very specific meaning within the context of the ES6 RegExp specification in the second paragraph of http://people.mozilla.org/~jorendorff/es6-draft.html#sec-pattern-semantics . The specification uses the same set of algorithms to describe both BCP (i.e., 16-bit elements) and unicode (i.e., 32-bit elements) patterns and matching semantics. “Character” is used in those algorithm to refer to a single element of the mode that is currently operating within.
>
> I think the ambiguity you find is in step 2.1 of http://people.mozilla.org/~jorendorff/es6-draft.html#sec-pattern :
>
> 2. Return an internal closure that takes two arguments, a String str and an integer index, and performs the following:
> 1. If Unicode is true, let Input be a List consisting of the sequence of code points of str interpreted as a UTF-16 encoded Unicode string. Otherwise, let Input be a List consisting of the sequence of code units that are the elements of str. Input will be used throughout the algorithms in 21.2.2. Each element of Input is considered to be a character.
>
> Apparently I don’t have an adequate definition of “interpreted as a UTF-16 encoded Unicode string”. If you submit a bug to bugs.emncascript.org) I will provided one in the next spec. revisions. The intended semantics is that:
> In ascending string index order:
> Each valid UTF-16 surrogate pair is interpreted as a signal code point that is the UTF-16 encoded value
> Each “lonely” surrogate is interpreted as single code point that is the surrogate value
> Every other 16-bit code unit is interpreted as a single code point.
That definition is in section 6.1.4:
http://people.mozilla.org/~jorendorff/es6-draft.html#sec-ecmascript-language-types-string-type
A cross-reference would be useful.
Norbert
More information about the es-discuss
mailing list