Working with grapheme clusters

Jason Orendorff jason.orendorff at gmail.com
Sat Oct 26 06:58:46 PDT 2013


On Fri, Oct 25, 2013 at 11:42 PM, Norbert Lindenberg
<ecmascript at lindenbergsoftware.com> wrote:
>
> On Oct 25, 2013, at 18:35 , Jason Orendorff <jason.orendorff at gmail.com> wrote:
>
>> UTF-16 is designed so that you can search based on code units
>> alone, without computing boundaries. RegExp searches fall in this
>> category.
>
> Not if the RegExp is case insensitive, or uses a character class, or ".", or a quantifier - these all require looking at code points rather than UTF-16 code units in order to support the full Unicode character set.

Can you explain this more?  ISTM case insensitive searches and
character classes don't require finding boundaries in the string being
searched. Matching /./ does, sometimes. The common use is /.*/ and in
that case you don't have to find all boundaries in the text being
matched, only at the end or (again, only in certain cases) if you have
to backtrack.

Of course all those things have code-point-oriented *semantics*, which
is great. But the implementation can be faster than that.

I'd like to know what you have in mind regarding quantifiers though.

-j


More information about the es-discuss mailing list