Working with grapheme clusters

Jason Orendorff jason.orendorff at gmail.com
Fri Oct 25 18:35:55 PDT 2013


On Thu, Oct 24, 2013 at 7:38 AM, Anne van Kesteren <annevk at annevk.nl> wrote:
> On Thu, Oct 24, 2013 at 3:31 PM, Mathias Bynens <mathias at qiwi.be> wrote:
>> Imagine you’re writing a JavaScript library that escapes a given string as an HTML character reference, or as a CSS identifier, or anything else. In those cases, you don’t care about grapheme clusters, you care about code points, cause those are the units you end up escaping individually.
>
> Is that really a common operation? I would expect formatting,
> searching, etc. to dominate. E.g. whenever you do substr/substring you
> would want that to be grapheme-cluster aware.

I think I disagree. Trying to take this apart:

If you're searching, you don't want to use the iterator anyway,
because finding character boundaries or grapheme boundaries is a waste
of time. UTF-16 is designed so that you can search based on code units
alone, without computing boundaries. RegExp searches fall in this
category.

IIUC, "formatting" mostly involves finding patterns to replace—it's a
special case of searching, right?

When you do substr/slice/substring, you should be using offsets that
are on grapheme boundaries, but obtaining offsets by using String
iteration and adding up the lengths will be very rare, I think.

So String iteration is kind of left looking around for a use case. I
can't think of any that compel me to prefer graphemes over characters
out of sheer practicality. Reversing strings, for example, I can't
care about that. Anyone?

-j


More information about the es-discuss mailing list