Working with grapheme clusters

Claude Pache claude.pache at gmail.com
Thu Oct 24 08:16:53 PDT 2013


Le 24 oct. 2013 à 16:24, Mathias Bynens <mathias at qiwi.be> a écrit :

> 
>> 	text.graphemeAt(0) // get the first grapheme of the text
>> 
>> 	// shorten a text to its first hundred graphemes
>> 	var shortenText = ''
>> 	let numGraphemes = 0
>> 	for (let grapheme of text) {
>> 		numGraphemes += 1
>> 		if (numGraphemes > 100) {
>> 			shortenText += '…'
>> 			break
>> 		}
>> 		shortenText += grapheme
>> 	}
> 
> So, you would want to change the string iterator’s behavior too?

At least, I'd like to have the opportunity to iterate over what I need. I have no opinion whether iterating through code points, or through grapheme clusters, should be the default
iterator, or if there should be none, forcing the developer to consciously pick the one they really mean.

> 
>> As a side note, I ask whether the `String.prototype.symbolAt `/`String.prototype.at` as proposed in a recent thread, and the `String.prototype[@@iterator]` as currently specified, are really what people need, or if they would mistakenly use them with the intended meaning of `String.prototype.graphemeAt` and `String.prototype.graphemes` as discussed in the present message?
> 
> I don’t think this would be an issue. The new `String` methods and the iterator are well-defined and documented in terms of *code points*.
> 
> IMHO combining marks are easy enough to match and special-case in your code if that’s what you need. You could use a regular expression to iterate over all grapheme clusters in the string:
> 
>    // Based on the example on http://mathiasbynens.be/notes/javascript-unicode#accounting-for-other-combining-marks
>    var regexGraphemeCluster = /([\0-\u02FF\u0370-\u1DBF\u1E00-\u20CF\u2100-\uD7FF\uDC00-\uFE1F\uFE30-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF])([\u0300-\u036F\u    1DC0-\u1DFF\u20D0-\u20FF\uFE20-\uFE2F]*)/g;

Note that the specification in [UAX29], section 3.1, for determining grapheme cluster boundaries does not just use the notion of "combining marks". I fear that, for some exotic scripts (apparently, at least Hangul), it is more complicated than just finding a span of combining marks.

—Claude


[UAX29]: http://www.unicode.org/reports/tr29/ "Unicode Standard Annex #29: Unicode text segmentation."


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20131024/65fa0c7f/attachment.html>


More information about the es-discuss mailing list