Unicode normalization problem

Claude Pache claude.pache at gmail.com
Thu Apr 2 11:15:00 UTC 2015


> Le 2 avr. 2015 à 01:22, Jordan Harband <ljharb at gmail.com> a écrit :
> 
> Unfortunately we don't have a String#codepoints or something that would return the number of code points as opposed to the number of characters (that "length" returns) - something like that imo would greatly simplify explaining the differences to people.
> 
> For the time being, I've been explaining that some characters are actually made up of two, and the 💩 character (it's a fun example to use) is an example of two characters combining to make one "code point". It's not a quick or trivial thing to explain but people do seem to grasp it eventually.

And when they think to have understood, they are in fact still in great trouble, because they will confuse it with other unrelated issues like grapheme clusters and/or precomposed characters.

The issue here is specific to the UTF16 encoding, where some Unicode code points are encoded as a sequence of two 16-bit units; and ES strings are (by an accident of history) sequences of 16-bit units, not Unicode code points. I think it is important to stress that it is an issue of encoding, at least in order to have a chance to distinguish it from the other aforementioned issues.

(So, taking your example, the 💩 character is internally represented as a sequence of two 16-bit-units, not “characters”. And, very confusingly, the String methods that contain “char” in their name have nothing to do with “characters”.)

—Claude



More information about the es-discuss mailing list