New full Unicode for ES6 idea

Allen Wirfs-Brock allen at
Mon Feb 20 13:00:57 PST 2012

On Feb 20, 2012, at 12:32 PM, Brendan Eich wrote:

> Allen Wirfs-Brock wrote:
>> ...
>> You are essentially saying that a compiler targeting ES for a language X  that includes a string data type that does not confirm to your rules (for example, by allowing occurrences of surrogate code points within string data)
> First, as a point of order: yes, JS strings as full Unicode does not want stray surrogate pair-halves. Does anyone disagree?

Well, I'm disagreeing.  Do you know of any other language that has imposed these sorts of semantic restrictions on runtime string data?  

> Second, binary data / typed arrays stand ready for any such not-full-Unicode use-cases.

But lacks the same level of utility function support, not the least of which is RegExp

>> could not use ES strings as the target representation of its string data type.  It also could not use the built-in ES string functions in the implementation of language X's built-in functions.
> Not if this hypothetical source language being compiled to JS wants other than full Unicode, no.
> Why is this a problem, even hypothetically? Such a use-case has binary data and typed arrays standing ready, and if it really could use String.prototype.* methods I would be greatly surprised.

My sense is that there are a fairly large variety of string data types could be use the existing ES5 string type as a target type and for which many of the String.prototuype.* methods would function just fine  The reason is that most of the ES5 methods don't impose this sort of semantic restriction of string elements.

>> It could not leverage any optimizations that a ES engine may apply to strings and string functions.
> Emscripten already compiles LLVM source languages (C, C++, and Objective-C at least) to JS and does a very good job (getting better day by day). The utility of string function today (including uint16 indexing and length) is immaterial. Typed arrays are quite important, though.

There are a lot of reasons why ES strings are not a good backing representation for C/C++ strings (to the extend that there even is a C string data type).  But there are also lots of  high level languages that do not have those sort of mapping issues.

If Type arrays are going to be the new "string" type  (maybe better stated as array of chars) for people doing systems programming in JS then we should probably start thinking about a broader set of utility functions/methods that support them.

>> Also, values of X's string type can not be directly passed in foreign calls to ES functions. Etc.
> Emscripten does have a runtime that maps browser functionailty exposed to JS to the guest language. It does not AFAIK need to encode surrogate pairs in JS strings by hand, let alone make pair-halves.

But with the BRS flipped it would have to censor C "strings" passed to JS to ensure that unmatched surrogate pairs are present.

Probably not such a bit deal because it isn't using JS strings as its representation, but as hypothesized above that wouldn't necessarily be the case for other languages.


More information about the es-discuss mailing list