New full Unicode for ES6 idea

Andrew Oakley andrew at ado.is-a-geek.net
Mon Feb 20 06:56:37 PST 2012


Most content actually only tries to access characters of a string like this:

for (var i = 0; i < str.length(); i++) {
	str[i];
}

While a naive implementation using UTF-8 encoding strings would be
O(n^2) if the previous lookup result was cached it is possible to
achieve a reasonably fast O(n) behaviour on such a loop.  It feels like
some kind of iterator would be more efficient but I don't think
iterators would "feel right" in ECMAScript.

You can encode unmatched surrogates in UTF-8 (although they may have to
be removed before the string is passed to the browser DOM code) so it
may be possible to simply always encode strings in UTF-8 allowing for
much simpler sharing of strings between code that wants UTF-8 support
and code that is using the old model at the expense of more complex
behaviour where UTF-16 surrogates are referenced.

Issues only arise in code that tries to treat a string as an array of
16-bit integers, and I don't think we should be particularly bothered by
performance of code which misuses strings in this fashion (but clearly
this should still work without opt-in to new string handling).

I think this is a nicer and more flexible model than string
representations being dependent on which heap they came from - all
issues related to encoding can be contained in the String object
implementation.

While this is being discussed, for any new string handling I think we
should make any invalid strings (according to the rules in Unicode)
cause some kind of exception on creation.

-- 
Andrew Oakley


More information about the es-discuss mailing list