On `String.prototype.codePointAt` and `String.fromCodePoint`

Bjoern Hoehrmann derhoermi at gmx.net
Wed Sep 25 10:59:56 PDT 2013


* Anne van Kesteren wrote:
>I think I'm convinced that String.fromCodePoint()'s design is correct,
>especially since the rendering subsystem deals with code points too.
>String.prototype.codePointAt() however still feels wrong since you
>always need to iterate from the start to get the correct code *unit*
>offset anyway so why would you use it rather than the code *point*
>iterator that is planned for inclusion?

UTF-16 is a self-synchronizing code and you need to move at most one
`.length` unit to get to a proper `.codePointAt` index in a properly
formed string. You only need to start from the beginning if you care
about what is between the start and the given index position. If you
want to treat proper surrogate pairs as one unit for counting, then
`.codePointAt` let's you do

  while (ix < s.length) {
    ix += s.codePointAt(ix) > 0xFFFF;
    ix += 1;
  }

That perhaps also illustrates why making the method return a replace-
ment character for unpaired surrogates is a bad idea: you may violate

  count_unicode(s1 + s2) === count_unicode(s1) + count_unicode(s2)

if this concatenates two halfs of a surrogate pair. The `.codePointAt`
method is for random indexing, iterators are for sequential access.
Random indexing into strings is rare except for a few special positions,
but it happens through user input for instance (give me the Unicode
scalar value of the first character of the current text selection).
-- 
Björn Höhrmann · mailto:bjoern at hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 


More information about the es-discuss mailing list