Full Unicode strings strawman

Wes Garland wes at page.ca
Tue May 17 20:01:54 PDT 2011


Mark;

Are you Dr. *Mark E. Davis* (born September 13, 1952 (age 58)), co-founder
of the Unicode <http://en.wikipedia.org/wiki/Unicode> project and the
president of the Unicode
Consortium<http://en.wikipedia.org/wiki/Unicode_Consortium>since its
incorporation in 1991?

(If so, uh, thanks for giving me alternatives to Shift-JIS, GB-2312, Big-5,
et al..those gave me lots of hair loss in the late 90s)

On 17 May 2011 21:55, Mark Davis ☕ <mark at macchiato.com> wrote:In the past, I
have read it thus, pseudo BNF:

>
>> UnicodeString => CodeUnitSequence // D80
>> CodeUnitSequence => CodeUnit | CodeUnitSequence CodeUnit // D78
>> CodeUnit => <anything in the current encoding form> // D77
>>
>
> So far, so good. In particular, d800 is a code unit for UTF-16, since it is
> a code unit that can occur in some code unit sequence in UTF-16.
>

*head smack* - code unit, not code point.


>
>
>> This means that your original assertion -- that Unicode strings cannot
>> contain the high surrogate code points, regardless of meaning -- is in fact
>> correct.
>>
>
> That is incorrect.
>

Aie, Karumba!

If we have

   - a sequence of code points
   - taking on values between 0 and 0x1FFFFF
   - including high surrogates and other reserved values
   - independent of encoding

..what exactly are we talking about?  Can it be represented in UTF-16
without round-trip loss when normalization is not performed, for the code
points 0 through 0xFFFF?

Incidentally, I think this discussion underscores nicely why I think we
should work hard to figure out a way to hide UTF-16 encoding details from
user-end programmers.

Wes

-- 
Wesley W. Garland
Director, Product Development
PageMail, Inc.
+1 613 542 2787 x 102
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20110517/e110aa1f/attachment.html>


More information about the es-discuss mailing list