Full Unicode based on UTF-16 proposal

Mark Davis ☕ mark at macchiato.com
Tue Mar 27 07:39:55 PDT 2012


That, as Norbert explained, is not the intention of the standard. Take a
look at the discussion of "Unicode 16-bit string" in chapter 3. The
committee recognized that fragments may be formed when working with UTF-16,
and that destructive changes may do more harm than good.

x = a.substring(0, 5) + b + a.substring(5, a.length());
y = x.substring(0, 5) + x.substring(6, x.length());

After this operation is done, you want y == a, even if 5 is between D800
and DC00.

Or take:
output = "";
for (int i = 0; i < s.length(); ++i) {
  ch = s.charAt(i);
  if (ch.equals('&')) {
    ch = '@';
  }
  output += ch;
}

After this operation is done, you want "a&\u{10000}b" to become "a@\u{10000}b",
not "a&\u{FFFD}\u{FFFD}b".
It is also an unnecessary burden on lower-level software to always check
this stuff.

Of course, when you convert to UTF-16 (or UTF-8 or 32) for storage or
output, then you do need to either convert to FFFD or take some other
action.

------------------------------
Mark <https://plus.google.com/114199149796022210033>
*
*
*— Il meglio è l’inimico del bene —*
**



On Mon, Mar 26, 2012 at 23:11, Glenn Adams <glenn at skynav.com> wrote:

>
> On Mon, Mar 26, 2012 at 10:37 PM, Norbert Lindenberg <
> ecmascript at norbertlindenberg.com> wrote:
>
>> The conformance clause doesn't say anything about the interpretation of
>> (UTF-16) code units as code points. To check conformance with C1, you have
>> to look at how the resulting code points are actually further interpreted.
>>
>
> True, but if the proposed language
>
> "A code unit that is in the range 0xD800 to 0xDFFF, but is not part of a
> surrogate pair, is interpreted as a code point with the same value."
>
> is adopted, then will not this have an effect of creating unpaired
> surrogates as code points? If so, then by my estimation, this *will* increase
> the likelihood of their being interpreted as abstract characters... e.g.,
> if the unpaired code unit is interpreted as a unpaired surrogate code
> point, and some process/function performs *any* predicate or transform on
> that code point, then that amounts to interpreting it as an abstract
> character.
>
> I would rather see such unpaired code unit either (1) be mapped to
> U+00FFFD, or (2) an exception raised when performing an operation that
> requires conversion of the UTF-16 code unit sequence.
>
>
>> My proposal interprets the resulting code points in the following ways:
>>
>> 1) In regular expressions, they can be used in both patterns and input
>> strings to be matched. They may be compared against other code points, or
>> against character classes, some of which will hopefully soon be defined by
>> Unicode properties. In the case of comparing against other code points,
>> they can't match any code points assigned to abstract characters. In the
>> case of Unicode properties, they'll typically fall into the large bucket of
>> have-nots, along with other unassigned code points or, for example, U+FFFD,
>> unless you ask for their general category.
>>
>> 2) When parsing identifiers, they will not have the ID_Start or
>> ID_Continue properties, so they'll be excluded, just like other unassigned
>> code points or U+FFFD.
>>
>> 3) In case conversion, they won't have upper case or lower case
>> equivalents defined, and remain as is, as would happen for unassigned code
>> points or U+FFFD.
>>
>> I don't think either of these amount to interpretation as abstract
>> characters. I mention U+FFFD because the alternative interpretation of
>> unpaired surrogates would be to replace them with U+FFFD, but that doesn't
>> seem to improve anything.
>>
>> Norbert
>>
>>
>>
>> On Mar 26, 2012, at 15:10 , Glenn Adams wrote:
>>
>> > On Mon, Mar 26, 2012 at 2:02 PM, Gavin Barraclough <
>> barraclough at apple.com> wrote:
>> > I really like the direction you're going in, but have one minor concern
>> relating to regular expressions.
>> >
>> > In your proposal, you currently state:
>> >        "A code unit that is in the range 0xD800 to 0xDFFF, but is not
>> part of a surrogate pair, is interpreted as a code point with the same
>> value."
>> >
>> > Just as a reminder, this would be in explicit violation of the Unicode
>> conformance clause C1 unless it can be guaranteed that such a code point
>> will not be interpreted as an abstract character:
>> >
>> > C1    A process shall not interpret a high-surrogate code point or a
>> low-surrogate code point as an abstract character.
>> >
>> > [1] http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf
>> >
>> > Given that such guarantee is likely impractical, this presents a
>> problem for the above proposed language.
>>
>>
>
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20120327/9e6dbc55/attachment.html>


More information about the es-discuss mailing list