Re: Question about the “full Unicode in strings” strawman

Tab Atkins Jr. jackalmage at gmail.com
Wed Jan 25 08:36:13 PST 2012


On Tue, Jan 24, 2012 at 5:14 PM, Allen Wirfs-Brock
<allen at wirfs-brock.com> wrote:
> The current 16-bit character strings are sometimes uses to store non-Unicode
> binary data and can be used with non-Unicode character encoding with up to
> 16-bit chars.  21 bits is sufficient for Unicode but perhaps is not enough
> for other useful encodings.  32-bit seems like a plausable unit.

People only use strings to store binary data because they didn't have
native binary data types.  Now they do.  Continuing to optimize
strings for this use-case seems unnecessary.


> The real controversy that developed over this proposal regarded whether or
> not every individual Unicode characters needs to be uniformly representable
> as a single element of a String. This proposal took the position that they
> should.  Other voices felt that such uniformity was unnecessary and seem
> content to expose UTF-8 or UTF-16.  The argument was that applications may
> have to look at multiple character logical units anyway, so dealing with UTF
> encodings isn't much of an added burden.

Anyone who argues that authors should have to deal with multibyte
characters spread across >1 elements in a string has never tried to
deal with having a non-BMP name on the web.  UTF-16 is particularly
horrible in this regard, as "most" names authors will see (if they're
not serving a CJK audience explicitly) are in the BMP and thus are a
single element.  UTF-8 at least has the "advantage" that authors are
somewhat more likely to encounter problems if they assume 1 character
= 1 element.

Making strings more complicated is, unfortunately, user-hostile
against people with names outside of ASCII or the BMP.

~TJ


More information about the es-discuss mailing list