New full Unicode for ES6 idea
barraclough at apple.com
Sun Feb 19 18:54:24 PST 2012
On Feb 19, 2012, at 3:13 PM, Allen Wirfs-Brock wrote:
>> My implementor's bias is showing, because I expect many engines would use UTF-16 internally and have non-O(1) indexing for strings with the contains-non-BMP-and-BRS-set-to-full-Unicode flag bit.
> A fine implementation, but not observable. Another implementation approach that would preserve O(1) indexing would be to simply have two or three different internal string representations with 1, 2, or 4 byte internal characters. (You can automatically pick the needed character size when the string is created because string are immutable and created with their value). A not-quite O(1) approach would segment strings into substring spans using such an representation. Representation choice probably depends a lot on what you think are the most common use cases. If it is string processing in JS then a fast representations is probably what you want to choose. If it is just passing text that is already UTF-8 or UTF-16 encoded from inputs to output then a representation that minimizing transcoding would probably be a higher priority.
One way in which the proposal under discussion seems to differ from the previous strawman is in the behavior arising from concatenation of strings ending/beginning with a surrogate hi and lo element.
How do we want to handle how do we want to handle unpaired UTF-16 surrogates in a full-unicode string? I can see three options:
1) Prohibit values from strings that do not map to valid unicode characters (either throw an exception, or replace with the unicode replacement character).
2) Allow invalid unicode characters in strings, and preserve them over concatenation – ("\uD800" + "\uDC00").length == 2.
3) Allow invalid unicode characters in strings, but allow surrogate pairs to fuse over concatenation – ("\uD800" + "\uDC00").length == 1.
It seems desirable for full-unicode strings to logically be a sequence of unicode characters, stored and processed in a encoding-agnostic manner. option 3 would seem to violate that, exposing the underlying UTF-16 implementation. It also loses a distributive property of .length over concatenation that I believe is true in ES5 for strings, in that currently for all strings s1 & s2:
s1.length + s2.length == (s1 + s2).length
However if we allow concatenation to fuse surrogate pairs into a single character (e.g. s1 = "\uD800", s2 = "\uDC00") this will no longer be true.
I guess I wonder if it's worth considering either options 1) or 2) – either prohibiting invalid unicode characters in strings, or consider something closer to the previous strawman, where string storage is defined to be 32-bit (with a BRS that instead of changing iteration would change string creation, introducing an implicit UTF16-UTF32 conversion).
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the es-discuss