New full Unicode for ES6 idea
Brendan Eich
brendan at mozilla.com
Sun Feb 19 08:05:21 PST 2012
Jussi Kalliokoski wrote:
> I'm not sure what to think about this, being a big fan of the UTF-8
> simplicity. :)
UTF-8 is great, but it's a transfer format, perfect for C and other such
systems languages (especially ones that use byte-wide char from old
days). It is not appropriate for JS, which gives users a "One True
String" (sorry for caps) primitive type that has higher-level "just
Unicode" semantics. Alas, JS's "just Unicode" was from '96.
There are lots of transfer formats and character set encodings.
Implementations could use many, depending on what chars a given string
uses. E.g. ASCII + UTF-16, UTF-8 only as you suggest, other
combinations. But this would all be under the hood, and at some cost to
the engine as well as some potential (space, mostly) savings.
> But anyhow, I like the idea of opt-in, actually so much that I started
> thinking, why not make JS be encoding-agnostic?
That is precisely the idea. Setting the BRS to "full Unicode" gives the
appearance of 21 bits per character via indexing and length accounting.
You'd have to spell non-BMP literal escapes via "\u{...}", no big deal.
> What I mean here is that maybe we could have multi-charset Strings in JS?
Now you're saying something else. Having one agnostic higher-level "just
Unicode" string type is one thing. That's JS's design goal, always has
been. It does not imply adding multiple observable CSEs or UTFs that
break the "just Unicode" abstraction.
If you can put a JS string in memory for low-level systems languages
such as C to view, of course there are abstraction breaks. Engine APIs
may or may not allow such views for optimizations. This is an issue, for
sure, when embedding (e.g. V8 in Node). It's not a language design
issue, though, and I'm focused on observables in the language because
that is where JS currently fails by livin' in the '90s.
/be
More information about the es-discuss
mailing list