Working with grapheme clusters

Norbert Lindenberg ecmascript at lindenbergsoftware.com
Sat Oct 26 20:12:51 PDT 2013


On Oct 26, 2013, at 6:58 , Jason Orendorff <jason.orendorff at gmail.com> wrote:

> On Fri, Oct 25, 2013 at 11:42 PM, Norbert Lindenberg
> <ecmascript at lindenbergsoftware.com> wrote:
>> 
>> On Oct 25, 2013, at 18:35 , Jason Orendorff <jason.orendorff at gmail.com> wrote:
>> 
>>> UTF-16 is designed so that you can search based on code units
>>> alone, without computing boundaries. RegExp searches fall in this
>>> category.
>> 
>> Not if the RegExp is case insensitive, or uses a character class, or ".", or a quantifier - these all require looking at code points rather than UTF-16 code units in order to support the full Unicode character set.

> I'd like to know what you have in mind regarding quantifiers though.

When I write /💩{2}/, I mean /💩💩/, but the current code unit based RegExp will interpret it as /💩\uDCA9/, which can't match any well-formed UTF-16 string.

Norbert


More information about the es-discuss mailing list