Code points vs Unicode scalar values
Anne van Kesteren
annevk at annevk.nl
Fri Sep 6 04:16:33 PDT 2013
On Thu, Sep 5, 2013 at 8:07 PM, Norbert Lindenberg
<ecmascript at lindenbergsoftware.com> wrote:
> ... they're not meant to interpret the code points beyond that, and some processing
> (such as test cases) may depend on them being preserved.
Since when are test cases a use case? And why can't a test case use
the more difficult route?
I think ideally a string in a language can only represent Unicode
scalar values. I.e. it perfectly maps back and forth to utf-8 (or
indeed utf-16, although people shouldn't use that).
In ECMAScript a string is 16-bit code units with some sort of utf-16
layer on top in various scenarios.
Now we're adding another layer on top of strings, but instead of
exposing ideal strings (Unicode scalar values) we go with some kludge
to serve edge cases (whose scenarios have not been fully explained
thus far) that are better served using the "low-level" 16-bit code
you suggest this is a policy matter, but I do not think it is at all.
Unicode scalar values are the code points of Unicode that can be
represented in any environment, this is not true for Unicode code
points. This is not about policy at all, but rather about what a
string ought to be.
> Adding code point indexing to 16-bit code unit strings would add significant performance overhead.
Agreed. I don't think we need the *At method for now. Use the iterator.
More information about the es-discuss