Full Unicode based on UTF-16 proposal

Glenn Adams glenn at skynav.com
Tue Mar 27 08:56:46 PDT 2012


This begs the question of what is the point of C1.

On Tue, Mar 27, 2012 at 9:13 AM, Mark Davis ☕ <mark at macchiato.com> wrote:

> That would not be practical, nor predictable. And note that the 700K
> reserved code points are also not to be interpreted as characters; by your
> logic all of them would need to be converted to FFFD.
>
> And in practice, an unpaired surrogate is best treated just like a
> reserved (unassigned) code point. For example, a lowercase operation should
> convert characters with lowercase correspondants to those correspondants,
> and leave *everything* else alone: control characters, format characters,
> reserved code points, surrogates, etc.
>
> ------------------------------
> Mark <https://plus.google.com/114199149796022210033>
> *
> *
> *— Il meglio è l’inimico del bene —*
> **
>
>
>
> On Tue, Mar 27, 2012 at 08:02, Glenn Adams <glenn at skynav.com> wrote:
>
>>
>>
>> On Tue, Mar 27, 2012 at 8:39 AM, Mark Davis ☕ <mark at macchiato.com> wrote:
>>
>>> That, as Norbert explained, is not the intention of the standard. Take a
>>> look at the discussion of "Unicode 16-bit string" in chapter 3. The
>>> committee recognized that fragments may be formed when working with UTF-16,
>>> and that destructive changes may do more harm than good.
>>>
>>> x = a.substring(0, 5) + b + a.substring(5, a.length());
>>> y = x.substring(0, 5) + x.substring(6, x.length());
>>>
>>> After this operation is done, you want y == a, even if 5 is between D800
>>> and DC00.
>>>
>>
>> Assuming that b.length() == 1 in this example, my interpretation of this
>> is that '=', '+', and 'substring' are operations whose domain and co-domain
>> are (currently defined) ES Strings, namely sequences of UTF-16 code units.
>> Since none of these operations entail interpreting the semantics of a code
>> point (i.e., interpreting abstract characters), then there is no violation
>> of C1 here.
>>
>> Or take:
>>> output = "";
>>> for (int i = 0; i < s.length(); ++i) {
>>>   ch = s.charAt(i);
>>>   if (ch.equals('&')) {
>>>     ch = '@';
>>>   }
>>>   output += ch;
>>> }
>>>
>>> After this operation is done, you want "a&\u{10000}b" to become "a@\u{10000}b",
>>> not "a&\u{FFFD}\u{FFFD}b".
>>> It is also an unnecessary burden on lower-level software to always check
>>> this stuff.
>>>
>>
>> Again, in this example, I assume that the string literal "a&\u{10000}b"
>> maps to the UTF-16 code unit sequence:
>>
>> 0061 0026 D800 DC00 0062
>>
>> Given that 'charAt(i)' is defined on (and is indexing) code units and not
>> code points, and since the 'equals' operator is also defined on code units,
>> this example also does not require interpreting the semantics of code
>> points (i.e., interpreting abstract characters).
>>
>> However, in Norbert's questions above about isUUppercase(int) and
>> toUpperCase(int), it is clear that the domain of these operations are code
>> points, not code units, and further, that they requiring interpretation as
>> abstract characters in order to determine the semantics of the
>> corresponding characters.
>>
>> My conclusion is that the determination of whether C1 is violated or not
>> depends upon the domain, codomain, and operation being considered.
>>
>>
>>> Of course, when you convert to UTF-16 (or UTF-8 or 32) for storage or
>>> output, then you do need to either convert to FFFD or take some other
>>> action.
>>>
>>> ------------------------------
>>> Mark <https://plus.google.com/114199149796022210033>
>>> *
>>> *
>>> *— Il meglio è l’inimico del bene —*
>>> **
>>>
>>>
>>>
>>> On Mon, Mar 26, 2012 at 23:11, Glenn Adams <glenn at skynav.com> wrote:
>>>
>>>>
>>>> On Mon, Mar 26, 2012 at 10:37 PM, Norbert Lindenberg <
>>>> ecmascript at norbertlindenberg.com> wrote:
>>>>
>>>>> The conformance clause doesn't say anything about the interpretation
>>>>> of (UTF-16) code units as code points. To check conformance with C1, you
>>>>> have to look at how the resulting code points are actually further
>>>>> interpreted.
>>>>>
>>>>
>>>> True, but if the proposed language
>>>>
>>>> "A code unit that is in the range 0xD800 to 0xDFFF, but is not part of
>>>> a surrogate pair, is interpreted as a code point with the same value."
>>>>
>>>> is adopted, then will not this have an effect of creating unpaired
>>>> surrogates as code points? If so, then by my estimation, this *will* increase
>>>> the likelihood of their being interpreted as abstract characters... e.g.,
>>>> if the unpaired code unit is interpreted as a unpaired surrogate code
>>>> point, and some process/function performs *any* predicate or transform
>>>> on that code point, then that amounts to interpreting it as an abstract
>>>> character.
>>>>
>>>> I would rather see such unpaired code unit either (1) be mapped to
>>>> U+00FFFD, or (2) an exception raised when performing an operation that
>>>> requires conversion of the UTF-16 code unit sequence.
>>>>
>>>>
>>>>> My proposal interprets the resulting code points in the following ways:
>>>>>
>>>>> 1) In regular expressions, they can be used in both patterns and input
>>>>> strings to be matched. They may be compared against other code points, or
>>>>> against character classes, some of which will hopefully soon be defined by
>>>>> Unicode properties. In the case of comparing against other code points,
>>>>> they can't match any code points assigned to abstract characters. In the
>>>>> case of Unicode properties, they'll typically fall into the large bucket of
>>>>> have-nots, along with other unassigned code points or, for example, U+FFFD,
>>>>> unless you ask for their general category.
>>>>>
>>>>> 2) When parsing identifiers, they will not have the ID_Start or
>>>>> ID_Continue properties, so they'll be excluded, just like other unassigned
>>>>> code points or U+FFFD.
>>>>>
>>>>> 3) In case conversion, they won't have upper case or lower case
>>>>> equivalents defined, and remain as is, as would happen for unassigned code
>>>>> points or U+FFFD.
>>>>>
>>>>> I don't think either of these amount to interpretation as abstract
>>>>> characters. I mention U+FFFD because the alternative interpretation of
>>>>> unpaired surrogates would be to replace them with U+FFFD, but that doesn't
>>>>> seem to improve anything.
>>>>>
>>>>> Norbert
>>>>>
>>>>>
>>>>>
>>>>> On Mar 26, 2012, at 15:10 , Glenn Adams wrote:
>>>>>
>>>>> > On Mon, Mar 26, 2012 at 2:02 PM, Gavin Barraclough <
>>>>> barraclough at apple.com> wrote:
>>>>> > I really like the direction you're going in, but have one minor
>>>>> concern relating to regular expressions.
>>>>> >
>>>>> > In your proposal, you currently state:
>>>>> >        "A code unit that is in the range 0xD800 to 0xDFFF, but is
>>>>> not part of a surrogate pair, is interpreted as a code point with the same
>>>>> value."
>>>>> >
>>>>> > Just as a reminder, this would be in explicit violation of the
>>>>> Unicode conformance clause C1 unless it can be guaranteed that such a code
>>>>> point will not be interpreted as an abstract character:
>>>>> >
>>>>> > C1    A process shall not interpret a high-surrogate code point or a
>>>>> low-surrogate code point as an abstract character.
>>>>> >
>>>>> > [1] http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf
>>>>> >
>>>>> > Given that such guarantee is likely impractical, this presents a
>>>>> problem for the above proposed language.
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> es-discuss mailing list
>>>> es-discuss at mozilla.org
>>>> https://mail.mozilla.org/listinfo/es-discuss
>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20120327/ef0b592f/attachment.html>


More information about the es-discuss mailing list