Should Decode accept U+FFFE or U+FFFF (and other Unicode non-characters)?

Allen Wirfs-Brock allen at wirfs-brock.com
Fri Jul 15 09:08:46 PDT 2011


On Jul 14, 2011, at 10:38 PM, Jeff Walden wrote:

> Reraising this issue...
> 
> To briefly repeat: Decode, called by decodeURI{,Component}, says to reject %ab%cd%ef sequences whose octets "[do] not contain a valid UTF-8 encoding of a Unicode code point".  It appears browsers interpret this requirement as: reject overlong UTF-8 sequences, and otherwise reject only unpaired or mispaired surrogate "code points".  Is this exactly what ES5 requires?  And if it is, should it be?  Firefox has also treated otherwise-valid-looking encodings of U+FFFE and U+FFFF as specifying that the replacement character U+FFFD be used.  And the rationale for rejecting U+FFF{E,F} also seems to apply to the non-character range [U+FDD0, U+FDEF] and U+xyFF{E,F}.  Table 21 seems to say only malformed encodings and bad surrogates should be rejected, but "valid encoding of a code point" is arguably unclear.

I haven't swapped back my technical understanding of the subtleties of UTF8 encodings yet today so I'm not yet prepared to try to provide a technical response.  But I think I can speak to the intent of the spec (or at least the ES5 version):

1) these are legacy functions that have been in browser JS implementations at least since ES3 days.  We didn't want to change them in any incompatible way.
2) Like with RegExp and other similar issues, browser reality (well, legacy browser reality, maybe not newbies) is more important than what the spec. actually says.  If browser all do something different from the spec. then the spec. should be updated accordingly. However, for ES5 we didn't do any deep analysis of this browser reality so we might have missed something.
3)  The intent is pretty clearly stated in the last paragraph note that includes table 21 (BTW, since the table is in a note it isn't normative).  It essentially says throw an exception when decoding anything that RFC 3629 says if not a valid UTF-8 encoding. 


I would prioritizes #3 after #1&#2.  If there is consistent behavior in all major browsers that date prior to ES5 then that is the behavior that should be followed (and the spec. updated if necessary). If there is disagreement among those legacy browsers then I would simply follow the ES5 spec. unless it does something that is contrary to RFC 3629.  If it does, then we need to think about whether we have a spec. bug.
> 
> At least one person interested in Firefox's decoding implementation argues that not rejecting or replacing U+FFF{E,F} is a "potential security vulnerability" because those code points (particularly U+FFFE) might confuse code into interpreting a sequence of code points with the wrong endianness.  I find the argument unpersuasive and the potential harm too speculative (particularly as no other browser replaces or rejects U+FFF{E,F}).  But the point's been raised, and it's at least somewhat plausible, so I'd like to see it conclusively addressed.


It's just a transformation from one JS string to another. It can't do anything that hand written JS code couldn't do.  How would this be any more of a problem then simply providing the code points that that the bogus sequence would be incorrectly interpreted as.  That said, #3 above does that that the intent is to reject anything that is not valid UTF-8. 


Allen


More information about the es-discuss mailing list