Full Unicode strings strawman

Mark Davis ☕ mark at macchiato.com
Wed May 18 11:17:26 PDT 2011


On Tue, May 17, 2011 at 20:01, Wes Garland <wes at page.ca> wrote:

> Mark;
>
> Are you Dr. *Mark E. Davis* (born September 13, 1952 (age 58)), co-founder
> of the Unicode <http://en.wikipedia.org/wiki/Unicode> project and the
> president of the Unicode Consortium<http://en.wikipedia.org/wiki/Unicode_Consortium>since its incorporation in 1991?


Guilty as charged.


>
>
> (If so, uh, thanks for giving me alternatives to Shift-JIS, GB-2312, Big-5,
> et al..those gave me lots of hair loss in the late 90s)
>

Your welcome. We did it to save ourselves from the hair-pulling we had in
the *80's* over those charsets ;-)


>
> On 17 May 2011 21:55, Mark Davis ☕ <mark at macchiato.com> wrote:In the past,
> I have read it thus, pseudo BNF:
>
>
>>> UnicodeString => CodeUnitSequence // D80
>>> CodeUnitSequence => CodeUnit | CodeUnitSequence CodeUnit // D78
>>> CodeUnit => <anything in the current encoding form> // D77
>>>
>>
>> So far, so good. In particular, d800 is a code unit for UTF-16, since it
>> is a code unit that can occur in some code unit sequence in UTF-16.
>>
>
> *head smack* - code unit, not code point.
>
>
>>
>>
>>> This means that your original assertion -- that Unicode strings cannot
>>> contain the high surrogate code points, regardless of meaning -- is in fact
>>> correct.
>>>
>>
>>  That is incorrect.
>>
>
> Aie, Karumba!
>
> If we have
>
>    - a sequence of code points
>    - taking on values between 0 and 0x1FFFFF
>
> 10FFFF

>
>    - including high surrogates and other reserved values
>    - independent of encoding
>
> ..what exactly are we talking about?  Can it be represented in UTF-16
> without round-trip loss when normalization is not performed, for the code
> points 0 through 0xFFFF?
>

Surrogate code points (U+D800..U+DFFF) can't be represented in any
*UTF*string. They can, be represented in
*Unicode strings* (ones that are not valid UTF strings) with the one
restriction that in UTF-16, they have to be isolated. In practice, we just
don't find that isolated surrogates in Unicode 16-bit strings cause a
problem, so I think that issue has derailed the more important issues
involved in this discu, which are in the API.


> Incidentally, I think this discussion underscores nicely why I think we
> should work hard to figure out a way to hide UTF-16 encoding details from
> user-end programmers.
>


The biggest issue is the indexing. In Java, for example, iterating through a
string is has some ugly syntax:

int cp;
for (int i = 0; i < string.length(); i += *Character.charCount*(cp)) {
    cp = string.*codePointAt*(i);
    doSomethingWith(cp);
}

But it doesn't have to be that way; they could have supported, with a little
bit of semantic sugar, something like:

for (int cp : aString) {
  doSomethingWith(cp);
}

If done well, the complexity doesn't have to show to the user. In many
cases, as Shawn pointed out, codepoints are not really the right unit
anyway. What the user may actually need are word boundaries, or grapheme
cluster boundaries, etc. If, for example, you truncate a string on just code
point boundaries, you'll get the wrong answer sometimes.

It is of course simpler, if you are either designing a programming language
from scratch *or* are able to take the compatibility hit, to have the API
for strings always index by code points. That is, from the outside, a string
always looks like it is a simple sequence of code points. There are a couple
of ways to do that efficiently, even where the internal storage is not 32
bit chunks.


> Wes
>
> --
> Wesley W. Garland
> Director, Product Development
> PageMail, Inc.
> +1 613 542 2787 x 102
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20110518/0da63fad/attachment.html>


More information about the es-discuss mailing list