Re: Question about the “full Unicode in strings” strawman
gillam at lab126.com
Wed Jan 25 10:59:21 PST 2012
> The current 16-bit character strings are sometimes uses to store non-Unicode binary data and can be used with non-Unicode character encoding with up to 16-bit chars. 21 bits is sufficient for Unicode but perhaps is not enough for other useful encodings. 32-bit seems like a plausable unit.
How would an eight-digit \u escape sequence work from an implementation standpoint? I'm assuming most implementations right now use 16-bit unsigned values as the individual elements of a String. If we allow arbitrary 32-bit values to be placed into a String, how would you make that work? There seem to only be a few options:
a) Change the implementation to use 32-bit units.
b) Change the implementation to use either 32-bit units as needed, with some sort of internal flag that specifies the unit size for an individual string.
c) Encode the 32-bit values somehow as a sequence of 16-bit values.
If you want to allow full generality, it seems like you'd be stuck with option a or option b. Is there really enough value in doing this?
If, on the other hand, the idea is just to make it easier to include non-BMP Unicode characters in strings, you can accomplish this by making a long \u sequence just be shorthand for the equivalent sequence in UTF-16: \u10ffff would be exactly equivalent to \udbff\udfff. You don't have to change the internal format of the string, the indexes of individual characters stay the same, etc.
More information about the es-discuss