Flexible String Representation - full Unicode for ES6?
rosuav at gmail.com
Fri Dec 21 15:45:05 PST 2012
Hi! I was directed here from the V8 discussion list, hope this is the
right place to raise this.
I've read http://norbertlindenberg.com/2012/05/ecmascript-supplementary-characters/index.html
and some of the related discussion (of which there is a considerable
amount!). The problem with UTF-16 encodings has been biting me in a
project where we allow untrusted users to configure our application by
providing a script from which we call functions. The script is
manipulating text, so it makes good sense to support full Unicode; and
compatibility with older ECMAScript engines/interpreters is not a
significant point. I'm fully aware that this is a major barrier to
change in most situations, though; I am inclined toward some form of
BRS as proposed by Brendan Eich.
Some worthwhile reading:
If the language provides a string type that's UTF-16 and then has a
few functions that count code points (as described in the
norbertlindenberg page), the temptation will be strong for programmers
to ignore non-BMP characters, and then to quietly still be buggy in
the face of surrogates. To truly support full Unicode, the language
has to expose to its programmers *only* Unicode, not some encoding
used to represent Unicode characters in memory. The easiest way to do
this is to store strings as UTF-32, allowing O(1) indexing etc, but
that's really wasteful.
There is an alternative. Python (as of version 3.3) has implemented a
new Flexible String Representation, aka PEP-393; the same has existed
in Pike for some time. A string is stored in memory with a fixed
number of bytes per character, based on the highest codepoint in that
string - if there are any non-BMP characters, 4 bytes; if any
U+0100-U+FFFF, 2 bytes; otherwise 1 byte. This depends on strings
being immutable (otherwise there'd be an annoying string-copy
operation when a too-large character gets put in), which is true of
ECMAScript. Effectively, all strings are stored in UCS-4/UTF-32, but
with the leading 0 bytes elided when they're not needed.
Most scripts are going to have a large number of pure-ASCII strings in
them - variable names, identifiers, HTML tags, etc. These would
benefit from a switch to Pike-strings. And any strings that don't
actually have astral characters in them would suffer no penalty. Only
strings that are actually affected need pay the price. And we could
then trust that no surrogates ever get separated during transmission.
More information about the es-discuss