Working with grapheme clusters

Mathias Bynens mathias at qiwi.be
Sun Oct 27 02:44:09 PDT 2013


On 26 Oct 2013, at 14:39, Bjoern Hoehrmann <derhoermi at gmx.net> wrote:

> * Norbert Lindenberg wrote:
>> On Oct 25, 2013, at 18:35 , Jason Orendorff <jason.orendorff at gmail.com> wrote:
>> 
>>> UTF-16 is designed so that you can search based on code units
>>> alone, without computing boundaries. RegExp searches fall in this
>>> category.
>> 
>> Not if the RegExp is case insensitive, or uses a character class, or ".", or a
>> quantifier - these all require looking at code points rather than UTF-16 code
>> units in order to support the full Unicode character set.
> 
> If you have a regular expression over an alphabet like "Unicode scalar
> values" it is easy to turn it into an equivalent regular expression over
> an alphabet like "UTF-16 code units".

FWIW, [Regenerate](http://mths.be/regenerate) is a JavaScript library that can be used for this. A few examples from <http://mathiasbynens.be/notes/javascript-unicode#regex>:

> Here’s a regular expression is created that matches any Unicode scalar value:
> 
>     >> regenerate()
>          .addRange(0x0, 0x10FFFF)     // all Unicode code points
>          .removeRange(0xD800, 0xDBFF) // minus high surrogates
>          .removeRange(0xDC00, 0xDFFF) // minus low surrogates
>          .toRegExp()
>     /[\0-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]/


Similarly, to polyfill `.` in a Unicode-enabled ES6 regex:

> When the `u` flag is set, `.` is equivalent to the following backwards-compatible regular expression pattern:
> 
>     >> regenerate()
>          .addRange(0x0, 0x10FFFF) // all Unicode code points
>          .remove(  // minus `LineTerminator`s (http://ecma-international.org/ecma-262/5.1/#sec-7.3):
>            0x000A, // Line Feed <LF>
>            0x000D, // Carriage Return <CR>
>            0x2028, // Line Separator <LS>
>            0x2029  // Paragraph Separator <PS>
>          )
>          .toString();
>     '[\0-\x09\x0B\x0C\x0E-\u2027\u202A-\uD7FF\uDC00-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF]'
>     
>     >> /foo(?:[\0-\x09\x0B\x0C\x0E-\u2027\u202A-\uD7FF\uDC00-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF])bar/u.test('foo💩bar')
>     true


More information about the es-discuss mailing list