quasi-literal strawman

Mike Samuel mikesamuel at gmail.com
Thu Dec 17 11:11:04 PST 2009


2009/12/17 Andy Chu <andy at chubot.org>:
>> That a lot of formatting can be done in-library is a great point.
>>
>> Hopefully, by providing a desugaring that can easily be back-ported to
>> older code by things like rewriting minifiers, and implementing most
>> of the feature in library code that will run on older versions of JS;
>> we can allow people to use and experiment with
>> formatting/interpolation schemes even in code that needs to run on
>> legacy interpreters.
>
> Right, I like the idea of being able to run in ES3.1/5 implementations
> (at the cost of speed).
>
>> The quasi-literal proposal specifies only a desugaring, and the safe
>> interpolation scheme that I want to do is not part of this proposal
>> and would be done in a library.  I hope to convince W3C that this
>> library is something worth standardizing on and that innerHTML,
>> document.write, cssText, and other language encoding entry points into
>> the DOM internals should be aware of it.
>
> So then my question is why it needs to specify a desugaring.  Why is a
> quasi-literal not a string?

I still don't understand the question?

Why `foo$bar` and not "foo$bar"?  Well, the latter doesn't do anything
useful with the expression (bar).


> What is typeof html`foo`?  Is it "function"?

html hasn't been defined.
In http://google-caja.googlecode.com/svn/changes/mikesamuel/string-interpolation-29-Jan-2008/trunk/src/js/com/google/caja/interp/index.html
the result of the interpolation is an Interpolation object instance.
So it's typeof would be "object".


> Can you elaborate on the relationship to the DOM?  I didn't see it in
> the doc.  So you're saying that innerHTML can be set to a
> quasi-literal now, in addition to a string?  I don't see the situation
> where you can't just expand the quasi-literal to a string and then set
> innerHTML.

Please see the arguments in
http://google-caja.googlecode.com/svn/changes/mikesamuel/string-interpolation-29-Jan-2008/trunk/src/js/com/google/caja/interp/index.html
as to why.  The short answer is that you do not know at the time the
interpolation happens what context it will be used in -- you don't
know whether it is going to be used as HTML, CSS, or SQL.

a_STYLE_element.innerHTML = ...;
would behave very differently from
a_DIV_element.innerHTML = ...;
which would behave still differently from
a_TEXTAREA_element.innerHTML = ...;

> I think some more example applications in the doc would help.  Right
> now I don't see much difference between quasi-literals and a template
> language as a library, but I may be missing something.

The details of HTML escaping are largely orthogonal to this doc.  That
string interpolation doc I linked to above could probably use more
examples.  It does address competing use cases for template languages
and string interpolation a bit though.


>> This scheme could be built on top of quasis with a minor syntactic change.
>>
>> jsont`{$name:html}: <a
>> href="{$url|html-attr-value}">{$anchor|html}</a>{default=html}`
>>
>> function jsont(var_args) {
>>  var literalPortions = Array.prototype.slice.call(arguments, 0);
>>  var escapingModes = [];
>
> Interesting, this API is not that unlike JSON Template's API.  I'm not
> sure I see a big difference in functionality or safety either way.
>
> I would argue with this statement from your doc: "First, full blown
> templating languages, with a few exceptions, do next to nothing to
> solve escaping problems."
>
> This is probably true of PHP and JSP, but more modern template
> languages have "formatters/filters" built in.  Django,
> google-ctemplate, and JSON Template have this.  When combined with an
> option for a default filter, this "solves" escaping AFAICT.  Do
> quasi-literals do it better?  You are making a early/late binding
> argument, but I don't see when this becomes necessary.

PHP and JSP were the gold standard when I built it, and Django and
others have addressed that to some degree.
Do you know of any statistics on how much PHP code is running versus
Django code?

PHP has perfectly good escaping functions.  People just don't use them
consistently.  Django et al are a bit nicer in that they make it more
convenient than PHP does to do the right thing.  But I'm still
skeptical of manually chosen escaping convention because I think
they're error prone, and impose a large maintenance burden -- if I
have "<b>{foo}</b>" and foo changes from being a plain text string to
a bit of markup generated elsewhere in the code, the system will now
overescape.  Overescaping is much more quickly visible during testing
than underescaping, so these manual schemes for specifying escaping
conventions tend to grow holes over time.  Developers will remove
escaping because they were trying to fix a bug that was actually the
result of something else.  And it fundamentally does the wrong thing
with heterogenous inputs ; what is the appropriate convention in (foo
= someCondition ? wellFormedHtmlFromTrustedSource :
plainTextFromUntrustedSource, "<b>{foo}</b>")?  Manual escaping
syntactic sugar does not solve this problem, and late binding does.

See the arguments around:

    Any API that requires the developer to know as much as the implementor
    does not solve any problems. Since choosing the appropriate escaping
    function requires the developer to be a language expert, simply providing
    libraries of escaping functions will not address injection as a class of
    vulnerabilities.


> If it is because variables come from the calling scope rather than the
> scope receiving the quasi-literal, then let me propose just using
> something like locals() in Python.
>
> def foo():
>  a = 1
>  s = expandTemplate("{a}", locals())
>
> Now expandTemplate receives the argument {"a": 1} and can return the string "1".

What is locals()?  Why does expandTemplate need access to all locals
to do its job instead of just the specified ones?
Does this suffer from the "formatting string from untrusted source"
problem that python suffers from, and the "substitution value from
untrusted source" problem?



>>> I like the idea of "enabling DSLs", but I feel like this proposal is a
>>> DSL itself, rather than enabling them, since it has a fairly
>>> particular syntax, and you have defined the parse tree very
>>> specifically.
>>
>> I'm not sure I follow.  Are you referring to the `...` syntax with
>> embedded $foo and ${expression} chunks?
>
> Yes, I don't see why this should be hard-coded in the language.  It's
> a third set of escaping rules to learn (strings and regexes being the
> first 2, and actually regexes have a fourth set -- inside character
> classes [^$] and outside).
>
> I also think the syntax is complicated ( \${}` are special, as opposed
> to strings where ""\ are special, and regexes where / is special).  I
> wouldn't be at all surprised if it needs to grow based on some new use
> cases.
>
> For substitution, let me plug the JSON Template scheme: "{foo}" is a
> substituion.  If the string contains {}, then choose [] as the
> metacharacters: Template("[foo]", meta="[]").  So the default
> meta="{}".
>
> That's it.  IMHO this is the simplest possible scheme that covers all
> applications.  Any character you pick will be suboptimal for some DSL
> -- in particular quasi-literals themselves.  How do you write a
> quasi-literal for quasi-literals?  My guess is it will look pretty
> nasty.
>
> I don't see why the metadata needs to be inside the quasi-literal, as
> opposed to just being another argument to a function that takes a
> quasi-literal.
>
>>> Another Python analogy is that they chose not to embed regex's in the
>>> language, unlike JavaScript/Perl/Ruby.  Instead there is a very
>>> minimal syntactic accomodation -- raw strings which don't have
>>> backslash escaping.  The Go language takes this same approach with
>>> backticks I believe (e.g. `\s+` and not "\\s+").
>>
>> This proposal gets you raw strings easily :)
>>
>> new RegExp(r`\s+foo\s+`, 'i')
>>
>> function r(string) {
>>  if (arguments.length != 1) { throw new Error(); }
>>  return function () { return string; };  // Trivially inlinable
>> }
>
> I view the /\s+/ syntax for regexes as superfluous and overly
> specific, so if this mechanism can somehow generalize that and retire
> the old syntax, that's a plus.
>
>>> I do think JavaScript really needs better string interpolation than
>>> "foo " + var + " bar", which unfortunately a common idiom.  I think
>>> that perhaps all that would be necessary is to have a .format() method
>>> on strings, like Python.  Python switched from the operator % to a
>>> simple method.
>>
>> Yep.  Except that python is planning on supporting the % operator for
>> some time to come, right?  One other nice side-effect of providing a
>> generic platform for DSLs and doing formatting/interpolation in a
>> library is that applications can have as many of these schemes side by
>> side as they like, and when one becomes obsolete, you only have to
>> deprecate library code instead of language syntax or core object
>> methods.
>
> I totally agree with that, but simply using a library gets even more
> of those benefits.  So far, the 2 things I see in quasi-literals that
> you can't do with a library are:
>
> 1) The syntax -- however as mentioned I don't find the syntax to be a
> benefit.  Is html`foo$bar` better than html("foo{bar}") ?  I
> personally like how ES5 introduced no new syntax.
>
> 2) The locals() thing.  This would be a much smaller addition to the language.
>
> Am I missing something? (could be)
>
>> One oft ignored criterion for judging string formatting schemes is how
>> resistant they are to quoting confusion.
>> The python3 string formatting is really bad in this respect.  It's
>> security considerations section does a good job of pointing out that
>> formatting strings from untrusted sources are a problem and should not
>> be used, but does not mention the other side of the problem --
>> substitution values from untrusted sources.
>
> 100% agree.  But, introducing another syntax leads to its own kind of
> "quoting confusion".  As mentioned, how does a quasi-literal for a
> quasi-literal look, or a regex for a quasi-literal, or a quasi-literal
> for a regex?
>
> How hard is it write a program to extract all quasi-literals from JS
> source, and analyze them statically?  New syntax makes this kind of
> thing more complicated.  As it is, JS is not too hard to parse.
>
>> Reference 2 in the proposal argues that requiring developers to
>> specify an escaping scheme, or do it manually, is shifting a large and
>> unnecessary burden onto them, and when they make errors, those errors
>> often result in vulnerabilities.
>
> So this is "auto-escaping", right?  In my example, this is saying that
> you automatically detect based on the literal portions whether you
> need to use the "html" escape or "html-attr-value" escape.  Most
> template languages don't do this.  But there is no reason that they
> can't.  I don't think the JS language is the barrier to doing this
> now.
>
> thanks,
> Andy
>


More information about the es-discuss mailing list