Behavior of Decode with overlong utf-8

James Graham jgraham at
Wed Feb 18 03:41:04 PST 2009

Unless I am misreading the specification (quite likely), the Decode 
function does not have any logic to protect against decoding overlong, 
but otherwise valid, UTF-8 sequences. Arguably this fails in step 29 
since RFC 3629[1] states:

"It is important to note that the rows of the table are mutually 
exclusive, i.e., there is only one valid way to encode a given 
character. [...] Implementations of the decoding algorithm above MUST 
protect against decoding invalid sequences"

but it is not clear how to handle a faliure here. Existing 
implementations seem to disagree on this point, my limited testing showed:

Spidermonkey: inserts a uFFFD replacement character
Futhark: leaves the original percent-encoded characters
Squirrelfish: Throws URIError
V8: Decodes the overlong sequence
IE/JScript: Decodes the overlong sequence

Since the usual behavior for invalid percent encoded sequences is to 
throw URIError, I suggest making that happen in this case too.


