Should Decode accept U+FFFE or U+FFFF (and other Unicode non-characters)?

Jeff Walden jwalden+es at MIT.EDU
Thu Jul 14 22:38:01 PDT 2011

Reraising this issue...

To briefly repeat: Decode, called by decodeURI{,Component}, says to reject %ab%cd%ef sequences whose octets "[do] not contain a valid UTF-8 encoding of a Unicode code point".  It appears browsers interpret this requirement as: reject overlong UTF-8 sequences, and otherwise reject only unpaired or mispaired surrogate "code points".  Is this exactly what ES5 requires?  And if it is, should it be?  Firefox has also treated otherwise-valid-looking encodings of U+FFFE and U+FFFF as specifying that the replacement character U+FFFD be used.  And the rationale for rejecting U+FFF{E,F} also seems to apply to the non-character range [U+FDD0, U+FDEF] and U+xyFF{E,F}.  Table 21 seems to say only malformed encodings and bad surrogates should be rejected, but "valid encoding of a code point" is arguably unclear.

At least one person interested in Firefox's decoding implementation argues that not rejecting or replacing U+FFF{E,F} is a "potential security vulnerability" because those code points (particularly U+FFFE) might confuse code into interpreting a sequence of code points with the wrong endianness.  I find the argument unpersuasive and the potential harm too speculative (particularly as no other browser replaces or rejects U+FFF{E,F}).  But the point's been raised, and it's at least somewhat plausible, so I'd like to see it conclusively addressed.

A last note: two test262 tests directly exercise exercise the Decode algorithm and expect that these two characters decode to U+FFF{E,F}.  (I think at a glance they might also allow throwing, tho it's not clear to me that's intentional.)


More information about the es-discuss mailing list