Full Unicode based on UTF-16 proposal

Glenn Adams glenn at skynav.com
Mon Mar 26 23:11:18 PDT 2012


On Mon, Mar 26, 2012 at 10:37 PM, Norbert Lindenberg <
ecmascript at norbertlindenberg.com> wrote:

> The conformance clause doesn't say anything about the interpretation of
> (UTF-16) code units as code points. To check conformance with C1, you have
> to look at how the resulting code points are actually further interpreted.
>

True, but if the proposed language

"A code unit that is in the range 0xD800 to 0xDFFF, but is not part of a
surrogate pair, is interpreted as a code point with the same value."

is adopted, then will not this have an effect of creating unpaired
surrogates as code points? If so, then by my estimation, this *will* increase
the likelihood of their being interpreted as abstract characters... e.g.,
if the unpaired code unit is interpreted as a unpaired surrogate code
point, and some process/function performs *any* predicate or transform on
that code point, then that amounts to interpreting it as an abstract
character.

I would rather see such unpaired code unit either (1) be mapped to
U+00FFFD, or (2) an exception raised when performing an operation that
requires conversion of the UTF-16 code unit sequence.


> My proposal interprets the resulting code points in the following ways:
>
> 1) In regular expressions, they can be used in both patterns and input
> strings to be matched. They may be compared against other code points, or
> against character classes, some of which will hopefully soon be defined by
> Unicode properties. In the case of comparing against other code points,
> they can't match any code points assigned to abstract characters. In the
> case of Unicode properties, they'll typically fall into the large bucket of
> have-nots, along with other unassigned code points or, for example, U+FFFD,
> unless you ask for their general category.
>
> 2) When parsing identifiers, they will not have the ID_Start or
> ID_Continue properties, so they'll be excluded, just like other unassigned
> code points or U+FFFD.
>
> 3) In case conversion, they won't have upper case or lower case
> equivalents defined, and remain as is, as would happen for unassigned code
> points or U+FFFD.
>
> I don't think either of these amount to interpretation as abstract
> characters. I mention U+FFFD because the alternative interpretation of
> unpaired surrogates would be to replace them with U+FFFD, but that doesn't
> seem to improve anything.
>
> Norbert
>
>
>
> On Mar 26, 2012, at 15:10 , Glenn Adams wrote:
>
> > On Mon, Mar 26, 2012 at 2:02 PM, Gavin Barraclough <
> barraclough at apple.com> wrote:
> > I really like the direction you're going in, but have one minor concern
> relating to regular expressions.
> >
> > In your proposal, you currently state:
> >        "A code unit that is in the range 0xD800 to 0xDFFF, but is not
> part of a surrogate pair, is interpreted as a code point with the same
> value."
> >
> > Just as a reminder, this would be in explicit violation of the Unicode
> conformance clause C1 unless it can be guaranteed that such a code point
> will not be interpreted as an abstract character:
> >
> > C1    A process shall not interpret a high-surrogate code point or a
> low-surrogate code point as an abstract character.
> >
> > [1] http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf
> >
> > Given that such guarantee is likely impractical, this presents a problem
> for the above proposed language.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20120327/e7e944e8/attachment-0001.html>


More information about the es-discuss mailing list