Q: Lonely surrogates and unicode regexps

Allen Wirfs-Brock allen at wirfs-brock.com
Wed Jan 28 08:30:37 PST 2015


On Jan 28, 2015, at 4:54 AM, Wes Garland <wes at page.ca> wrote:

> Some interesting questions here.

These aren't discussion points.  These are all things that must have answers that are directly derivable from the ES6 spec.  If, after developing an adequate understand of that part of the specification, you can’t find the answer to these questions then there is probably something that needs to be clarified in the spec.
> 
> 1 - What is a character? Is it a Unicode Code Point?
defined in: paragraph 2 http://people.mozilla.org/~jorendorff/es6-draft.html#sec-pattern-semantics 
> 2 - Should we be able to match all possible JS Strings?
yes, there is nothing in the algorithms that restrict JS String values
> 3 - Should we be able to match all possible Unicode Strings?
yes, subject to what you mean but “Unicode Strings” as within JS Strings supplemental code points must be UTF-16 encoded.
> 4 - What do we do if there is a character in a String we cannot match?
RegExp.exec returns null if a string cannot be matched by a pattern
> 5 - Do unmatchable characters match . ?
there is no concept in the specification of an “unmatchable” character
> 6 - Are subsections of unmatchable strings matchable if they contain only matchable characters?
there is no concept in the specification of an “unmatchable” character
> 
> It is important to remember in these discussions that the Unicode specification allows strings which contain unmatched surrogates.
and ES6 //u patterns can match them
> Do we want regular expressions that can't match some Unicode strings?
No, the ES6 specificaiton can match all possible strings
> Do we extend the regexp syntax to have a symbol which matches an unmatched surrogate?
we already have it: \u{D83D}
>   How about reserved code points?  What happens when they become assigned?
Other than the initial decoding of valid surrogate pairs into 32-bit code points, the ES6 //u RegExp spec. applies no semantics to any code points in the string that is being matched.

Allen

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20150128/b0e98536/attachment.html>


More information about the es-discuss mailing list