Code points vs Unicode scalar values

Erik Corry erik.corry at gmail.com
Fri Sep 20 03:28:57 PDT 2013


On Wed, Sep 11, 2013 at 12:40 PM, Anne van Kesteren <annevk at annevk.nl>wrote:

> On Tue, Sep 10, 2013 at 8:14 AM, Mathias Bynens <mathias at qiwi.be> wrote:
> > FWIW, here’s a real-world example of a case where this behavior is
> annoying/unexpected to developers: http://cirw.in/blog/node-unicode
>
> That seems like a serious bug in V8 though. A utf-8 encoder should
> never ever generate CESU-8 byte sequences.
>

Just to be clear, V8 does not generate CESU-8 if you give it well formed
UTF-16.

If you give it broken UTF-16 with unpaired surrogates you can either break
the data or emit CESU-8.  In the first case, you overwrite the unpaired
surrogates with some sort of error character code.  In the second case you
can generate three-byte UTF-8 sequences that are not strictly legal.  The
second option will preserve the data if you round-trip it into V8 again (or
feed it to other apps that are liberal in what they accept), so that's what
V8 currently does.

-- 
Erik Corry

>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20130920/7c381fb2/attachment-0001.html>


More information about the es-discuss mailing list