Full Unicode strings strawman
wes at page.ca
Tue May 17 20:01:54 PDT 2011
Are you Dr. *Mark E. Davis* (born September 13, 1952 (age 58)), co-founder
of the Unicode <http://en.wikipedia.org/wiki/Unicode> project and the
president of the Unicode
incorporation in 1991?
(If so, uh, thanks for giving me alternatives to Shift-JIS, GB-2312, Big-5,
et al..those gave me lots of hair loss in the late 90s)
On 17 May 2011 21:55, Mark Davis ☕ <mark at macchiato.com> wrote:In the past, I
have read it thus, pseudo BNF:
>> UnicodeString => CodeUnitSequence // D80
>> CodeUnitSequence => CodeUnit | CodeUnitSequence CodeUnit // D78
>> CodeUnit => <anything in the current encoding form> // D77
> So far, so good. In particular, d800 is a code unit for UTF-16, since it is
> a code unit that can occur in some code unit sequence in UTF-16.
*head smack* - code unit, not code point.
>> This means that your original assertion -- that Unicode strings cannot
>> contain the high surrogate code points, regardless of meaning -- is in fact
> That is incorrect.
If we have
- a sequence of code points
- taking on values between 0 and 0x1FFFFF
- including high surrogates and other reserved values
- independent of encoding
..what exactly are we talking about? Can it be represented in UTF-16
without round-trip loss when normalization is not performed, for the code
points 0 through 0xFFFF?
Incidentally, I think this discussion underscores nicely why I think we
should work hard to figure out a way to hide UTF-16 encoding details from
Wesley W. Garland
Director, Product Development
+1 613 542 2787 x 102
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the es-discuss