Full Unicode strings strawman
addison at lab126.com
Tue May 17 13:00:49 PDT 2011
Okay, that example was poorly chosen. However, it is the case that when a given string representation uses a particular code unit you often need to have programmatic access to it--for loops and such that iterate over the text, e.g.
It may be an accident ofhistory, but that doesn't mean that scripters don't need access to it.
Sent from my iPhone
On May 17, 2011, at 12:52 PM, "Wes Garland" <wes at page.ca<mailto:wes at page.ca>> wrote:
On 17 May 2011 15:00, Phillips, Addison <<mailto:addison at lab126.com>addison at lab126.com<mailto:addison at lab126.com>> wrote:
2. Allowing unpaired surrogates is a *requirement*. Yes, such a string is "ill-formed", but there are too many cases in which one might wish to have such "broken" strings for scripting purposes.
3. We should have escape syntax for supplementary characters (such as \U0010000). Looking up the surrogate pair for a given Unicode character is extremely inconvenient and is not self-documenting.
As Shawn notes, basically, there are three ways that one might wish to access strings:
- as code units (encoding units of text)
I don't understand why (except that it is there by an accident of history) that it is desirable to expose a particular low-level detail about one possible encoding for Unicode characters to end-user programmers.
Your point about database storage only holds if the database happens to store Unicode strings encoded in UTF-16. It could just as easily use UTF-8, UTF-7, or UTF-32. For that matter, the database input routine could filter all characters not in ISO-Latin-1 and store only the lower half of non-surrogate-pair UTF-16 code units.
Wesley W. Garland
Director, Product Development
+1 613 542 2787 x 102
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the es-discuss