Fwd: Full Unicode strings strawman
Mark Davis ☕
mark at macchiato.com
Thu May 19 08:04:02 PDT 2011
Markus isn't on es-discuss, so forwarding....
---------- Forwarded message ----------
From: Markus Scherer <markus.icu at gmail.com>
Date: Wed, May 18, 2011 at 22:18
Subject: Re: Full Unicode strings strawman
To: Allen Wirfs-Brock <allen at wirfs-brock.com>
Cc: Shawn Steele <Shawn.Steele at microsoft.com>, Mark Davis ☕ <
mark at macchiato.com>, "es-discuss at mozilla.org" <es-discuss at mozilla.org>
On Mon, May 16, 2011 at 5:07 PM, Allen Wirfs-Brock <allen at wirfs-brock.com>wrote:
> I agree that application writer will continue for the foreseeable future
> have to know whether or not they are dealing with UTF-16 encoded data and/or
> communicating with other subsystems that expect such data. However, core
> language support for UTF-32 is a prerequisite for ever moving beyond
> UTF-16APIs and libraries and getting back to uniform sized character
This seems to be based on a misunderstanding. Fixed-width encodings are nice
but not required. The majority of Unicode-aware code uses either UTF-8 or
UTF-16, and supports the full Unicode code point range without too much
trouble. Even with UTF-32 you get "user characters" that require sequences
of two or more code points (e.g., base character + diacritic, Han character
+ variation selector) and there is not always a composite character for such
Windows NT uses 16-bit Unicode, started BMP-only and has supported the full
Unicode range since Windows 2000.
MacOS X uses 16-bit Unicode (coming from NeXT) and supports the full Unicode
range. (Ever since MacOS X 10.0 I believe.) Lower-level MacOS APIs use UTF-8
char* and support the full Unicode range.
ICU uses 16-bit Unicode, started BMP-only and has supported the full range
in most services since the year 2000.
Java uses 16-bit Unicode, started BMP-only and has supported the full range
since Java 5.
KDE uses 16-bit Unicode, started BMP-only and has supported the full range
Gnome uses UTF-8 and supports the full range.
implementations input and render the full range, and updating its spec and
implementations to upgrade compatibly like everyone else seems like the best
processing, and interfaces with the UTF-16 DOM and UTF-16 client OSes, a
UTF-32 string model might be more trouble than it's worth (and possibly a
before the committee became practically defunct for a while.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the es-discuss