Fwd: Full Unicode strings strawman

Mark Davis ☕ mark at macchiato.com
Thu May 19 08:04:02 PDT 2011


Markus isn't on es-discuss, so forwarding....

---------- Forwarded message ----------
From: Markus Scherer <markus.icu at gmail.com>
Date: Wed, May 18, 2011 at 22:18
Subject: Re: Full Unicode strings strawman
To: Allen Wirfs-Brock <allen at wirfs-brock.com>
Cc: Shawn Steele <Shawn.Steele at microsoft.com>, Mark Davis ☕ <
mark at macchiato.com>, "es-discuss at mozilla.org" <es-discuss at mozilla.org>


On Mon, May 16, 2011 at 5:07 PM, Allen Wirfs-Brock <allen at wirfs-brock.com>wrote:

> I agree that application writer will continue for the foreseeable future
> have to know whether or not they are dealing with UTF-16 encoded data and/or
> communicating with other subsystems that expect such data.  However, core
> language support for UTF-32 is a prerequisite for ever moving beyond
> UTF-16APIs and libraries and getting back to uniform sized character
> processing.
>

This seems to be based on a misunderstanding. Fixed-width encodings are nice
but not required. The majority of Unicode-aware code uses either UTF-8 or
UTF-16, and supports the full Unicode code point range without too much
trouble. Even with UTF-32 you get "user characters" that require sequences
of two or more code points (e.g., base character + diacritic, Han character
+ variation selector) and there is not always a composite character for such
a sequence.

Windows NT uses 16-bit Unicode, started BMP-only and has supported the full
Unicode range since Windows 2000.
MacOS X uses 16-bit Unicode (coming from NeXT) and supports the full Unicode
range. (Ever since MacOS X 10.0 I believe.) Lower-level MacOS APIs use UTF-8
char* and support the full Unicode range.
ICU uses 16-bit Unicode, started BMP-only and has supported the full range
in most services since the year 2000.
Java uses 16-bit Unicode, started BMP-only and has supported the full range
since Java 5.
KDE uses 16-bit Unicode, started BMP-only and has supported the full range
for years.
Gnome uses UTF-8 and supports the full range.

JavaScript uses 16-bit Unicode, is still BMP-only although most
implementations input and render the full range, and updating its spec and
implementations to upgrade compatibly like everyone else seems like the best
option.

In a programming language like JavaScript that is heavy on string
processing, and interfaces with the UTF-16 DOM and UTF-16 client OSes, a
UTF-32 string model might be more trouble than it's worth (and possibly a
performance hit).

FYI: I proposed full-Unicode support in JavaScript in 2003, a few months
before the committee became practically defunct for a while.
https://sites.google.com/site/markusicu/unicode/es/unicode-2003
https://sites.google.com/site/markusicu/unicode/es/i18n-2003

Best regards,
markus
(Google/ICU/Unicode)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20110519/21f79fb4/attachment.html>


More information about the es-discuss mailing list