Working with grapheme clusters

Norbert Lindenberg ecmascript at
Sat Oct 26 20:05:00 PDT 2013

On Oct 26, 2013, at 5:39 , Bjoern Hoehrmann <derhoermi at> wrote:

> * Norbert Lindenberg wrote:
>> On Oct 25, 2013, at 18:35 , Jason Orendorff <jason.orendorff at> wrote:
>>> UTF-16 is designed so that you can search based on code units
>>> alone, without computing boundaries. RegExp searches fall in this
>>> category.
>> Not if the RegExp is case insensitive, or uses a character class, or ".", or a
>> quantifier - these all require looking at code points rather than UTF-16 code
>> units in order to support the full Unicode character set.
> If you have a regular expression over an alphabet like "Unicode scalar
> values" it is easy to turn it into an equivalent regular expression over
> an alphabet like "UTF-16 code units". I have written a Perl module that
> does it for UTF-8, <>;
> Russ Cox's is a popular
> implementation. In effect it is still as though the implementation used
> Unicode scalar values, but that would be true of any implementation. It
> is much harder to implement something like this for other encodings like
> UTF-7 and Punycode.
> It is useful to keep in mind features like character classes are just
> syntactic sugar and can be decomposed into regular expression primitives
> like a choice listing each member of the character class as literal. The
> `.` is just a large character class, and flags like //i just transform
> parts of an expression where /a/i becomes something more like /a|A/.

OK, if Jason's comment was meant to say that RegExp searches specified based on code points can be implemented through an equivalent search based on code units, then that's correct. I was assuming that we're discussing API design, and requiring developers to provide those equivalent UTF-16 based regular expressions (as the current RegExp design does) is a recipe for breakage whenever supplementary characters are involved.


More information about the es-discuss mailing list