Full Unicode based on UTF-16 proposal

Glenn Adams glenn at skynav.com
Mon Mar 26 23:11:18 PDT 2012

On Mon, Mar 26, 2012 at 10:37 PM, Norbert Lindenberg <
ecmascript at norbertlindenberg.com> wrote:

> The conformance clause doesn't say anything about the interpretation of
> (UTF-16) code units as code points. To check conformance with C1, you have
> to look at how the resulting code points are actually further interpreted.

True, but if the proposed language

"A code unit that is in the range 0xD800 to 0xDFFF, but is not part of a
surrogate pair, is interpreted as a code point with the same value."

is adopted, then will not this have an effect of creating unpaired
surrogates as code points? If so, then by my estimation, this *will* increase
the likelihood of their being interpreted as abstract characters... e.g.,
if the unpaired code unit is interpreted as a unpaired surrogate code
point, and some process/function performs *any* predicate or transform on
that code point, then that amounts to interpreting it as an abstract

I would rather see such unpaired code unit either (1) be mapped to
U+00FFFD, or (2) an exception raised when performing an operation that
requires conversion of the UTF-16 code unit sequence.

> My proposal interprets the resulting code points in the following ways:
> 1) In regular expressions, they can be used in both patterns and input
> strings to be matched. They may be compared against other code points, or
> against character classes, some of which will hopefully soon be defined by
> Unicode properties. In the case of comparing against other code points,
> they can't match any code points assigned to abstract characters. In the
> case of Unicode properties, they'll typically fall into the large bucket of
> have-nots, along with other unassigned code points or, for example, U+FFFD,
> unless you ask for their general category.
> 2) When parsing identifiers, they will not have the ID_Start or
> ID_Continue properties, so they'll be excluded, just like other unassigned
> code points or U+FFFD.
> 3) In case conversion, they won't have upper case or lower case
> equivalents defined, and remain as is, as would happen for unassigned code
> points or U+FFFD.
> I don't think either of these amount to interpretation as abstract
> characters. I mention U+FFFD because the alternative interpretation of
> unpaired surrogates would be to replace them with U+FFFD, but that doesn't
> seem to improve anything.
> Norbert
> On Mar 26, 2012, at 15:10 , Glenn Adams wrote:
> > On Mon, Mar 26, 2012 at 2:02 PM, Gavin Barraclough <
> barraclough at apple.com> wrote:
> > I really like the direction you're going in, but have one minor concern
> relating to regular expressions.
> >
> > In your proposal, you currently state:
> >        "A code unit that is in the range 0xD800 to 0xDFFF, but is not
> part of a surrogate pair, is interpreted as a code point with the same
> value."
> >
> > Just as a reminder, this would be in explicit violation of the Unicode
> conformance clause C1 unless it can be guaranteed that such a code point
> will not be interpreted as an abstract character:
> >
> > C1    A process shall not interpret a high-surrogate code point or a
> low-surrogate code point as an abstract character.
> >
> > [1] http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf
> >
> > Given that such guarantee is likely impractical, this presents a problem
> for the above proposed language.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20120327/e7e944e8/attachment-0001.html>

More information about the es-discuss mailing list