How to count the number of symbols in a string?

Norbert Lindenberg ecmascript at norbertlindenberg.com
Tue Dec 4 14:03:32 PST 2012


On Dec 4, 2012, at 11:43 , David Bruant wrote:

> Le 04/12/2012 20:25, Jason Orendorff a écrit :
>> On Sat, Dec 1, 2012 at 2:09 AM, Mathias Bynens <mathias at qiwi.be> wrote:
>> 
>>> My guess would be that in 99% of all cases where `String.prototype.length` is used the intention is to count the code points, not the UCS-2/UTF-16 code units.
>>> 
>> I don't think this is right. My guess is that in most cases where it matters either way, the intention is to get a count that's consistent with .charAt(), .indexOf(), .slice(), RegExp match.index, and every other place where string indexes are used.
> I think Twitter has a bug as mentioned earlier in the thread and that's unrelated to consistency with the method you're mentioning.

One example isn't enough to support a "99% of all cases" claim. And I agree with Jason - many uses of String.length are related to some sort of iteration over the code units of the String, and then consistency with indices is critical. Showing the length of a string to the user is a rare (although important) case.

> I however agree that if something is added to get the actual length, a whole set of methods needs to be added too.

Which proposal are you referring and agreeing to?

>> That said, of course this is a sensible feature to add; but calling it ".realLength" wouldn't help anyone understand the rather fine distinction at issue.
> Maybe the solution lies in finding the right prefix to define .*length, .*charAt(), .*indexOf(), etc. Maybe "CP" for "code points" .CPlength? .cpLength/cpCharAt/cpIndexOf... ?

"cp" to indicate that code point indices? I think using two parallel index systems would only create confusion. Most string processing, including indexOf, works fine with supplementary characters without doing anything special for them.  We need to provide a foundation that lets developers easily support supplementary characters in functionality that needs to be aware of them, but in many applications few changes will be required.

> While you're talking about regexps, I think there is an issue with current RegExps. Mathias will know better. Could a new flag solve the issue?

RegExp does require major changes to support supplementary characters. The proposal accepted for ES6 (although not integrated into the spec yet) is at
http://norbertlindenberg.com/2012/05/ecmascript-supplementary-characters/index.html#RegExp

Are you aware of issues not addressed there?

Norbert




More information about the es-discuss mailing list