Full Unicode strings strawman
waldemar at google.com
Wed May 18 15:46:42 PDT 2011
On 05/16/11 11:11, Allen Wirfs-Brock wrote:
> I tried to post a pointer to this strawman on this list a few weeks ago, but apparently it didn't reach the list for some reason.
> Feed back would be appreciated:
Two different languages made different decisions on how to approach extending their character sets as Unicode evolved:
- Java kept their strings encoded exactly as they were (a sequence of 16-bit code units) and provided extra APIs for the cases where you want to extract a code point.
- Perl widened the concept of characters in strings away from bytes to full Unicode characters. Thus a UTF-8 string can be either represented where each byte is one Perl character or where each Unicode character is one Perl character. There are conversion functions provided to move between the two.
My experience is that Java's approach worked, while Perl's has led to an endless shop of horrors. The problem is that different APIs expect different kinds of strings, so I'm still finding places where conversions should be added but weren't (or vice versa) in a lot of code years after it was written.
1. I would not be in favor of any approach that widens the concept of a string character or introduces two different representations for a non-BMP character. It will suffer from the same problems as Perl, except that they will be harder to find because use of non-BMP characters is relatively rare.
2. Widening characters to 21 bits doesn't really help much. As stated earlier in this thread, you still want to treat clumps of combining characters together with the character to which they combine, worry about various normalized forms, etc. All of these require the machinery to deal with clumps of code units as though they were single characters/graphemes/etc., and once you have that machinery you can reuse it to support non-BMP characters while keeping string code units 16 bits.
More information about the es-discuss