New full Unicode for ES6 idea
Andrew Oakley
andrew at ado.is-a-geek.net
Mon Feb 20 06:56:37 PST 2012
Most content actually only tries to access characters of a string like this:
for (var i = 0; i < str.length(); i++) {
str[i];
}
While a naive implementation using UTF-8 encoding strings would be
O(n^2) if the previous lookup result was cached it is possible to
achieve a reasonably fast O(n) behaviour on such a loop. It feels like
some kind of iterator would be more efficient but I don't think
iterators would "feel right" in ECMAScript.
You can encode unmatched surrogates in UTF-8 (although they may have to
be removed before the string is passed to the browser DOM code) so it
may be possible to simply always encode strings in UTF-8 allowing for
much simpler sharing of strings between code that wants UTF-8 support
and code that is using the old model at the expense of more complex
behaviour where UTF-16 surrogates are referenced.
Issues only arise in code that tries to treat a string as an array of
16-bit integers, and I don't think we should be particularly bothered by
performance of code which misuses strings in this fashion (but clearly
this should still work without opt-in to new string handling).
I think this is a nicer and more flexible model than string
representations being dependent on which heap they came from - all
issues related to encoding can be contained in the String object
implementation.
While this is being discussed, for any new string handling I think we
should make any invalid strings (according to the rules in Unicode)
cause some kind of exception on creation.
--
Andrew Oakley
More information about the es-discuss
mailing list