Full Unicode strings strawman

Wes Garland wes at page.ca
Tue May 17 12:51:59 PDT 2011

On 17 May 2011 15:00, Phillips, Addison <addison at lab126.com> wrote:

> 2. Allowing unpaired surrogates is a *requirement*. Yes, such a string is
> "ill-formed", but there are too many cases in which one might wish to have
> such "broken" strings for scripting purposes.
> 3. We should have escape syntax for supplementary characters (such as
> \U0010000). Looking up the surrogate pair for a given Unicode character is
> extremely inconvenient and is not self-documenting.

> As Shawn notes, basically, there are three ways that one might wish to
> access strings:
- as code units (encoding units of text)

I don't understand why (except that it is there by an accident of history)
that it is desirable to expose a particular low-level detail about one
possible encoding for Unicode characters to end-user programmers.

Your point about database storage only holds if the database happens to
store Unicode strings encoded in UTF-16. It could just as easily use UTF-8,
UTF-7, or UTF-32. For that matter, the database input routine could filter
all characters not in ISO-Latin-1 and store only the lower half of
non-surrogate-pair UTF-16 code units.


Wesley W. Garland
Director, Product Development
PageMail, Inc.
+1 613 542 2787 x 102
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20110517/560da41b/attachment.html>

More information about the es-discuss mailing list