Re: Question about the “full Unicode in strings” strawman
allen at wirfs-brock.com
Wed Jan 25 09:46:27 PST 2012
On Jan 24, 2012, at 11:45 PM, Norbert Lindenberg wrote:
> I don't see the standard allowing character encodings other than UTF-16 in strings. Section 8.4 says "When a String contains actual textual data, each element is considered to be a single UTF-16 code unit." This aligns with other normative references to UTF-16 in sections 2, 6, and 15.1.3. Section 8.4 does seem to allow the use of strings for non-textual data, but character encodings are by definition for characters, i.e., textual data.
8.4 definitely allows for non-textual data" "String type is ... sequences of ... 16-bit unsigned integer values...", "The String type is generally used to represent textual data...", "All operations on Strings ... treat them as sequence of undifferentiated 16-bit signed integers..."
Arbitrary 16-bit values can be placed in a String using either String.fromCharCode (220.127.116.11) or the \uxxxx notation in string literals. Neither of these enforce a requirement that individual String elements are valid Unicode code units.
The standard always encodes strings expressed as string literals (except for literal containing \u escapes) using Unicode. However such literals are restricted to containing characters in the BCP so all such characters are encoded as single 16-bit String elements.
The functions in 15.1.3 do UTF-8 encoding/decoding but only if the the actual string arguments contain well formed UTF data. They explicitly throw when encountering other data. This is a characteristic of these specific functions, not of strings in general.
> Using a Unicode escape for non-textual data seems like abuse to me - Unicode is a character encoding standard. For Unicode, anything beyond six hex digits is excessive.
I see no intent in the spec. that \u or String.fromCharCode was to be restricted to valid Unicode character encodings.
Any character encoding is simply a semantic interpretation of binary values. There is no particular reason that "text" encode using non-Unicode encodings (say, for example EBCDIC) can be presented using ES String values and most of the String methods would work fine with such textual data. You would probably want to do exactly that, if you were writing code that had to deal with character set conversions.
More information about the es-discuss