Re: Question about the “full Unicode in strings” strawman

Gillam, Richard gillam at lab126.com
Wed Jan 25 11:27:22 PST 2012


Mark--

Of course.  Sorry.  That should have been "\U10ffff is equivalent to \udbff\udfff", with a capital U, or "\u{10ffff} is equivalent to \udbff\udfff".

--Rich

On Jan 25, 2012, at 11:11 AM, Mark Davis ☕ wrote:

You can't use \u10FFFF as syntax, because that could be \u10FF followed by literal FF. A better syntax is \u{...}, with 1 to 6 digits, values from 0 .. 10FFFF.

Mark
— Il meglio è l’inimico del bene —

[https://plus.google.com/114199149796022210033]



On Wed, Jan 25, 2012 at 10:59, Gillam, Richard <gillam at lab126.com<mailto:gillam at lab126.com>> wrote:
> The current 16-bit character strings are sometimes uses to store non-Unicode binary data and can be used with non-Unicode character encoding with up to 16-bit chars.  21 bits is sufficient for Unicode but perhaps is not enough for other useful encodings. 32-bit seems like a plausable unit.

How would an eight-digit \u escape sequence work from an implementation standpoint?  I'm assuming most implementations right now use 16-bit unsigned values as the individual elements of a String.  If we allow arbitrary 32-bit values to be placed into a String, how would you make that work?  There seem to only be a few options:

a) Change the implementation to use 32-bit units.

b) Change the implementation to use either 32-bit units as needed, with some sort of internal flag that specifies the unit size for an individual string.

c) Encode the 32-bit values somehow as a sequence of 16-bit values.

If you want to allow full generality, it seems like you'd be stuck with option a or option b.  Is there really enough value in doing this?

If, on the other hand, the idea is just to make it easier to include non-BMP Unicode characters in strings, you can accomplish this by making a long \u sequence just be shorthand for the equivalent sequence in UTF-16:  \u10ffff would be exactly equivalent to \udbff\udfff.  You don't have to change the internal format of the string, the indexes of individual characters stay the same, etc.

--Rich Gillam
 Lab126

_______________________________________________
es-discuss mailing list
es-discuss at mozilla.org<mailto:es-discuss at mozilla.org>
https://mail.mozilla.org/listinfo/es-discuss


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20120125/3599ca3c/attachment.html>


More information about the es-discuss mailing list