New full Unicode for ES6 idea
brendan at mozilla.com
Tue Feb 21 09:55:15 PST 2012
Phillips, Addison wrote:
> Because it has always been possible, it’s difficult to say how many
> scripts have transported byte-oriented data by “punning” the data into
> strings. Actually, I think this is more likely to be truly binary data
> rather than text in some non-Unicode character encoding, but anything
> is possible, I suppose. This could include using non-character values
> like “FFFE”, “FFFF” in addition to the surrogates. A BRS-running
> implementation would break a script that relied on String being a
> sequence of 16-bit unsigned integer values with no error checking.
Allen's view of the BRS-enabled semantics would have 16-bit "GIGO"
without exceptions -- you'd be storing 16-bit values, whatever their
source (including "\uXXXX" literals spelling invalid characters and
unmatched surrogates) in at-least-21-bit elements of strings, and
reading them back.
My concern and reason for advocating early or late errors on shenanigans
was that people today writing surrogate pais literally and then taking
extra pains in JS or C++ (whatever the host language might be) to
process them as single code points and characters would be broken by the
BRS-enabled behavior of separating the parts into distinct code points.
But that's pessimistic. It could happen, but OTOH anyone coding
surrogate pairs might want them to read back piece-wise when indexing.
In that case what Allen proposes, storing each formerly 16-bit code
unit, however expressed, in the wider 21-or-more-bits unit, and reading
back likewise, would "just work".
Sorry if this is all obvious. Mainly I want to throw in my lot with
Allen's exception-free literal/constructor approach. The encoding APIs
should throw on invalid Unicode but literals and strings as immutable
16-bit storage buffers should work as today.
More information about the es-discuss