Unicode normalization problem

Mathias Bynens mathiasb at opera.com
Thu Apr 2 07:47:07 UTC 2015


On Thu, Apr 2, 2015 at 1:39 AM, Andrea Giammarchi
<andrea.giammarchi at gmail.com> wrote:
> Jordan the purpose of `Array.from` is to iterate over the string, and the point of iteration instead of splitting is to have automagically codepoints. This, unless I've misunderstood Mathias presentation (might be)
>
> So, here there is a different problem: there are code-points that do not represent real visual representation ...

Those are called grapheme clusters or just “graphemes”, as Boris
mentioned. And here’s how to deal with them:
https://mathiasbynens.be/notes/javascript-unicode#other-grapheme-clusters

“Unicode Standard Annex #29 describes [an algorithm for determining
grapheme cluster
boundaries](http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries).
For a _completely_ accurate solution that works for all Unicode
scripts, implement this algorithm in JavaScript, and then count each
grapheme cluster as a single symbol.”

> or maybe, the real problem, is about broken `Array.from` polyfill?

`Array.from` just uses `String.prototype[Symbol.iterator]` internally,
and that is defined to deal with code points, not grapheme clusters.
Either choice would have confused some developers. IIRC, Perl 6 has
built-in capabilities to deal with grapheme clusters, but until ES
does, this use case must be addressed in user-land.


More information about the es-discuss mailing list