[caplet] Am I paranoid enough?

David-Sarah Hopwood david.hopwood at industrial-designers.co.uk
Tue Feb 17 03:13:10 PST 2009

Mike Samuel wrote:
> 2009/2/16 David-Sarah Hopwood <david.hopwood at industrial-designers.co.uk>
>> Suppose that S is a Unicode string in which each character matches
>> ValidChar below, not containing the subsequences "<!", "</" or "]]>", and
>> not containing ("&" followed by a character not matching AmpFollower).
>> S encodes a syntactically correct ES3 or ES3.1 source text chosen by
>> an attacker.
>> ValidChar :: one of
>> '\u0009' '\u000A' '\u000D' // TAB, LF, CR
>> [\u0020-\u007E]
>> [\u00A0-\u00AC]
>> [\u00AE-\u05FF]
>> [\u0604-\u06DC]
>> [\u06DE-\u070E]
>> [\u0710-\u17B3]
>> [\u17B6-\u200A]
>> [\u2010-\u2027]
>> [\u202F-\u205F]
>> [\u2070-\uD7FF]
> So no surrogates?

Correct. They're not characters (or even "noncharacters").

>> [\uE000-\uFDCF]
>> [\uFDF0-\uFEFE]
>> [\uFF00-\uFFEF]
> Why include FFEF?

It's unassigned, and there's no particular reason to exclude it.
(\uFFF0-\uFFF8 are also unassigned, but that's an area reserved
for "special" characters.)

>> AmpFollower :: one of
>> '=' '(' '+' '-' '!' '~' '"' '/' [0-9]
>> '\u0027' '\u005C' '\u0020' '\u0009' '\u000A' \u000D'
>> // single quote, backslash, space, TAB, LF, CR
>> (ValidChar excludes format control characters, and some other
>> characters known to be mishandled by browsers. AmpFollower is
>> intended to exclude characters that can start an entity reference.)
>> S is inserted between "<script>" and "</script>" in a place where a
>> <script> tag is allowed in an otherwise valid HTML document, or
>> between "<script><![CDATA[" and "]]></script>" in a place where a
>> <script> tag is allowed in an otherwise valid XHTML document.
>> The HTML or XHTML document starts with a correct <!DOCTYPE or
>> <?xml declaration respectively, and is encoded as well-formed
>> UTF-8.
>> Are these restrictions sufficient to ensure that the embedded
>> script is interpreted as it would have been if referenced from
>> an external file, foiling any attempts of browsers to collude
>> with the attacker in misparsing it?
> You may still be subject to encoding attacks.  I'm sure there are
> valid scripts that look like UTF-7, so if the script appears in the
> first 1024B, you might need to make sure it's preceded by a <meta>
> element specifying an encoding, and/or use the XML prologue form that
> specifies an encoding.

Right; I covered that in a follow-up. Is including a UTF-8 BOM at the
start sufficient for all browsers (that is, are there any browsers
in which a <meta> tag or other content sniffing can override an
explicit initial UTF-8 BOM, in either HTML or XHTML)?

HTML5 section seems to indicate that "if the transport layer
specifies an encoding" (i.e. presumably the charset specified in
a Content-Type header), then that should override a BOM. That's
irritating, because it means that you have to assume that the server
gets the Content-Type right, *as well as* including a BOM for the
browsers in which Content-Type doesn't override sniffing
(Internet Explorer, at least), and for the case where the document
is read from a local file.

David-Sarah Hopwood ⚥

More information about the Es-discuss mailing list