Working with grapheme clusters

Mathias Bynens mathias at qiwi.be
Thu Oct 24 07:24:29 PDT 2013


On 24 Oct 2013, at 16:02, Claude Pache <claude.pache at gmail.com> wrote:

> Therefore, I propose the following basic operations to operate on grapheme clusters:

Out of curiosity, is there any programming language that operates on grapheme clusters (rather than code points) by default? FWIW, code point iteration is what I’d expect in any language.

> 	text.graphemeAt(0) // get the first grapheme of the text
> 
> 	// shorten a text to its first hundred graphemes
> 	var shortenText = ''
> 	let numGraphemes = 0
> 	for (let grapheme of text) {
> 		numGraphemes += 1
> 		if (numGraphemes > 100) {
> 			shortenText += '…'
> 			break
> 		}
> 		shortenText += grapheme
> 	}

So, you would want to change the string iterator’s behavior too?

> As a side note, I ask whether the `String.prototype.symbolAt `/`String.prototype.at` as proposed in a recent thread, and the `String.prototype[@@iterator]` as currently specified, are really what people need, or if they would mistakenly use them with the intended meaning of `String.prototype.graphemeAt` and `String.prototype.graphemes` as discussed in the present message?

I don’t think this would be an issue. The new `String` methods and the iterator are well-defined and documented in terms of *code points*.

IMHO combining marks are easy enough to match and special-case in your code if that’s what you need. You could use a regular expression to iterate over all grapheme clusters in the string:

    // Based on the example on http://mathiasbynens.be/notes/javascript-unicode#accounting-for-other-combining-marks
    var regexGraphemeCluster = /([\0-\u02FF\u0370-\u1DBF\u1E00-\u20CF\u2100-\uD7FF\uDC00-\uFE1F\uFE30-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF])([\u0300-\u036F\u    1DC0-\u1DFF\u20D0-\u20FF\uFE20-\uFE2F]*)/g;
    
    var zalgo = 'Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞';
    
    zalgo.match(regexGraphemeCluster);
    [
      "Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍",
      "A̴̵̜̰͔ͫ͗͢",
      "L̠ͨͧͩ͘",
      "G̴̻͈͍͔̹̑͗̎̅͛́",
      "Ǫ̵̹̻̝̳͂̌̌͘",
      "!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞"
    ]


More information about the es-discuss mailing list