JavaScript parser API

Claus Reinke claus.reinke at talk21.com
Tue Jul 5 10:55:04 PDT 2011


Hi Dave, and other interested parties,

it would be helpful to summarize the issues on a page linked from
the AST API strawman - given the positive discussions on this list, I
thought the idea was implicitly accepted last year, modulo details,
so I was surprised not to see a refined strawman promoted. Would
it help if interested parties started discussing/refining the details,
or are the different users so far apart that agreement is out of reach?

My own impressions, from adding much of a Spidermonkey-based
AST to my parser:

    - it does not support generic traversals, so it definitely needs a
        pre-implemented traversal, sorting out each type of Node
        (Array-based ASTs, like the es-lab version, make this slightly
        easier - Arrays elements are ordered, unlike Object properties);

        at that stage, simple applications (such as tag generation)
        may be better of working with hooks into the parser, rather
        than hooks into an AST traversal? also, there is the risk that
        one pre-implemented traversal might not cover all use cases,
        in which case the boilerplate tax would have to be paid again;

    - it is slightly easier to manipulate than an Array-based AST, but
        lack of pattern matching fall-through (alternative patterns for
        destructuring) still hurts, and the selectors are lengthy, which
        hampers visualization and construction; (this assumes that
        fp-style AST processing is preferred over oo-style processing)

    - it is biased towards evaluation, which is a hindrance for other
        uses (such as faithful unparsing, for program transformations);

        this can be seen clearly in Literals, which are evaluated (why
        not evaluate Object, Array, Function Literals as well? eval should
        be part of AST processing, not of AST construction), but it also
        shows in other constructs (comments are not stored at all, and
        if commas/semicolons are not stored, how does one know
        where they were located - programmers tend to be picky
        about their personal or project-wide style guides?);

    - there are some minor oddities, from spelling differences to
        the spec (Label(l)ed), to structuring decisions (why separate
        UpdateExpression and LogicalExpression, when everything
        else is in UnaryExpression and BinaryExpression?);

        btw, why alternate/consequent instead of then/else, and
        shouldn't that really be consequent->then and alternate->else
        instead of the other way round (as the optional null for
        consequent suggests)?

My main issue is unparsing support for program transformations
(though IDEs will similarly need more info, for comment extraction,
syntax highlighting, and syntax-based operations).

What I did for now was to add a field to each Node, in which I
store an unprocessed Array of the sub-ASTs, including tokens.
Essentially, the extended AST Nodes provide both abstract info
for analysis and evaluation and a structured view of the token
stream belonging to each Node, for lower-level needs.

Whitespace/comments are stored separately, indexed by the
start position of the following token (this is going to work better
for comment-before-token that for comment-after-token, but it
is a start, for unparsing or comment-extraction tools).

This allows for a generic traversal of the Array-based unprocessed
AST fragments, for unparsing, but I still have to rearrange things
so that I can actually store the information I need (can't add info
to null as an AST value) and distinguish meta-info ("computed"
and "prefix" properties) from sub-ASTs.

Overall, the impression is that this AST was designed by someone
resigned to the fact of having to write Node-type-specific traversal
code for each purpose, with a limited number of purposes planned
(such as evaluation). This could be a burden for other uses of such
ASTs (boilerplate tax).

I hope these notes help - I'd really like to see a standard JS
parser API implemented across engines. For language
experimentation, we'd still need separate tweakable parsers,
but access to the efficient engine parsers for current JS would
give tool development a boost.

> But there are also tough questions about what the parser
> should do with engine-specific language extensions.

Actually, that starts before the AST: I'd like to see feature-based
language versioning, instead of the current monolithic version
numbering - take generators as an example feature:

Perhaps JS1.7 ("javascript;version=1.7") happens to be the first
JS version to support "yield", and is backwards compatible with
JS1.5, which might happen to match ES3; and JS1.8.5, which
happens to match ES5, might be backwards compatible with
JS1.7. But it is unlikely that the JSx which happens to match ES6
will be backwards compatible with JS1.7 (while ES5-breaking
changes will be limited, replacing experimental JS1.x features
with standardized variants is another matter).

Whereas, if I was able to specify "use yield", and be similarly
selective about other language features, then either of JS1.7,
JS1.8.5 and ES6 engines might be able to do the job, depending
on what other language features my code depends on. Also,
other engines might want to implement some features -like
"yield"- selectively, without aiming to support all of JS1.7, and
long before being able to support all of ES6.

> I agree about the issue of multiple parsers. The reason I
> was able to do the SpiderMonkey library fairly easily was
> that I simply reflect exactly the parser that exists. But to
> have a standards-compliant parser, we'd probably have
> to write a separate parser. That's definitely a tall order.

It should not be, provided one distinguishes between
standards-compliant and production use. If the ES grammar
is LR(1), it should really be specified in a parser tool format,
both for verification and to generate standards-compliant
tools to compare against. Depending on how efficient the
JS Bison implementation is, this might even lead to useable
parser performance.

There may be problems in finding a tool that generates all
the information needed for a useful AST (source locations,
comments, scope info, ..), but we do not need to solve every
issue immediately to make progress, right? And if the ES
committee were to ask ES parser generator implementors
whether their tools could be extended to serve an AST spec,
response might be favourable.

It would be nice if the spec parser was generated in Javascript,
but any tool-usable standard grammar would be useful - once
the grammar can be processed by a freely available tool, it can
be translated to similar formats, some of which have Javascript
implementations (eg Jison, ANTLR).

Having played a little with the ANTLRWorks environment, it
looks promising, is easy to install (just a .jar), has user-contributed
ES grammars, and can spot some ambiguities easily (though
I don't think its check is complete, and the ES grammar is too
complex to make naïve parse-tree visualization helpful). If other
tools have better ES grammar development support, I'd like to
hear about them.

Without a standard spec-conformant tool-readable grammar,
such tools remain of limited use. With a tool-readable grammar,
adding AST generation might turn out to be an afternoon's work
(followed by years of testing/debugging;-).

Claus
 



More information about the es-discuss mailing list