Re: Question about the “full Unicode in strings” strawman
Tab Atkins Jr.
jackalmage at gmail.com
Wed Jan 25 08:36:13 PST 2012
On Tue, Jan 24, 2012 at 5:14 PM, Allen Wirfs-Brock
<allen at wirfs-brock.com> wrote:
> The current 16-bit character strings are sometimes uses to store non-Unicode
> binary data and can be used with non-Unicode character encoding with up to
> 16-bit chars. 21 bits is sufficient for Unicode but perhaps is not enough
> for other useful encodings. 32-bit seems like a plausable unit.
People only use strings to store binary data because they didn't have
native binary data types. Now they do. Continuing to optimize
strings for this use-case seems unnecessary.
> The real controversy that developed over this proposal regarded whether or
> not every individual Unicode characters needs to be uniformly representable
> as a single element of a String. This proposal took the position that they
> should. Other voices felt that such uniformity was unnecessary and seem
> content to expose UTF-8 or UTF-16. The argument was that applications may
> have to look at multiple character logical units anyway, so dealing with UTF
> encodings isn't much of an added burden.
Anyone who argues that authors should have to deal with multibyte
characters spread across >1 elements in a string has never tried to
deal with having a non-BMP name on the web. UTF-16 is particularly
horrible in this regard, as "most" names authors will see (if they're
not serving a CJK audience explicitly) are in the BMP and thus are a
single element. UTF-8 at least has the "advantage" that authors are
somewhat more likely to encounter problems if they assume 1 character
= 1 element.
Making strings more complicated is, unfortunately, user-hostile
against people with names outside of ASCII or the BMP.
More information about the es-discuss