Full Unicode strings strawman
allen at wirfs-brock.com
Mon May 16 16:39:39 PDT 2011
On May 16, 2011, at 3:33 PM, Mike Samuel wrote:
>> 2011/5/16 Allen Wirfs-Brock <allen at wirfs-brock.com>:
> There is existing code out there that uses particular implementations
> for strings.
> Should the cost of migrating existing implementations be taken into
> account when considering this strawman?
If you mean existing ES implementation, then yes this proposal will impact all existing implementations. So does everything else we add to the ES specification. In my proposal, I have a section that discusses why I think the actual implementation impact will generally not be as great as you might first imagine.
>> values. At the representation level each position could even be represented
>> by a 32-bit cell and it doesn't matter. To be a valid DOMString element
>> values must be in the range 0-0xffff.
>> I think you are unnecessarily mixing up the string semantics defined by the
>> language, encodings that might be used in implementing the semantics, and
>> application level processing of those strings.
>> To simplify things just think of a ES string as if it was an array each
>> element of which could contain an arbitrary integer value. If we have such
>> an array like [0xd800, 0xdc00] at the language semantics level this is a two
>> element array containing two well specific values. At the language
>> implementation level there are all sorts of representations that might be
>> used, maybe the implementation Huffman encodes the elements... How the
>> application processes that array is completely up to the application. It
>> may treat the array simply as two integer values. It may treated each
>> element as a 21-bit value encoding a Unicode codepoint and logically
>> consider the array to be a unicode string of length 2. It may consider each
>> element to be a 16-bit value and that sequences of values are interpreted as
>> UTF-16 string encodings. In that case, it could consider it to represent a
>> string of logical length 1.
> I think we agree about the implementation/interface split.
> If DOMString specifies the semantics of a result from
> I'm not sure I understand the bit about how the semantics of DOMString
> could affect ES programs.
> Is it the case that
> document.createTextNode('\u+010000').length === 2
This is a where it get potentially sticky. This code is passing something that is not a valid UTF-16 string encoding to a DOM routine that is declared to taken DOMString argument. This is a new situation that we need to negotiate with the WebIDL ES binding. There are a couple possible ways to approach this binding. One is to say it is illegal to pass such a string as a DOMString. However, that isn't very user friendly as it precludes using such literals as DOM arguments or forces them to be written as document.createTextNode(UTF16Encode('\u+10000')). It would be better to do the encoding automatically as part of the DOM call marshaling.
Note that in either case, a check has to be made to determine whether the string contains characters whose character codes are > 0xffff. My argument is that, perhaps surprisingly, this should be a very cheap test. The reasons is that ES strings are immutable and that any reasonable implementation is like to use an optimized internal representation for strings that only contain 16-bit character codes. Thus, it is likely that 16-bit character code only string values will be immediately identifiable as such without requiring any call site inspection of the actual individual characters to make that determination.
> '\u+010000' === 1
yes, assuming that there is a .length missing above.
> or are you saying that when DOMStrings are exposed to ES code, ES gets
> to defined the semantics of the "length" and "0" properties.
No, for compatibility with the existing web, DOMstrings need to manifest as UTF-16 encodings.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the es-discuss