Should Decode accept U+FFFE or U+FFFF (and other Unicode non-characters)?
jwalden+es at MIT.EDU
Thu Oct 8 10:52:12 PDT 2009
I was looking at how SpiderMonkey decodes URI-encoded strings, specifically to update it to reject overlong UTF-8 sequences per ES5 (breaking change from ES3 that should generally be agreed to have been necessary, not to mention that existing implementations were loose and strict inconsistently). After SpiderMonkey made that change I noticed some non-standard extra behavior: U+FFFE and U+FFFF decode to the replacement character. ES5 doesn't say to do this -- the decode table categorizes only [0xD800, 0xDFFF] as invalid (when not in a surrogate pair) and resulting in a URIError. (Prose in Decode says "If Octects does not contain a valid UTF-8 encoding of a Unicode code point", which might be interpretable as saying that the "UTF-8 encoding" of U+FFFE isn't valid and therefore a URIError must be thrown if you squinted.)
U+FFFF is not a valid Unicode code point, and U+FFFE conceivably could confuse Unicode decoders into decoding with the wrong endianness under the right circumstances. Theoretically, at least. Might it make sense to throw a URIError upon encountering them (and perhaps also the non-code points [U+FDD0, U+FDEF], and maybe even the code points which are = FFFE mod 0x10000 as well)?
More information about the es-discuss