Code compilation alternative to Function()

Gustavs Tēbergs fixplzsecrets at gmail.com
Sun Jan 12 12:30:21 PST 2014


On 12 January 2014 07:48, Sean Silva <silvas at purdue.edu> wrote:
> Can you provide some measurements where concatenating a string and parsing it is a bottleneck?

I would not call it a bottleneck because the concatenating in most use
cases only happens once at startup, that's why I used the word
"latency". If you want to deliver data or programs to a browser client
in a custom format, generating Javascript can cause a delay of 100 -
500ms before your application can start operating.

I could string up some demo that takes this long in practice and
emulates an approach that authors might want to use in the future.

Simple and acute: take PEG.js and the included Javascript parser and
parse some Javascript library (pretending this library is written in a
different language), or the CSS parser and a CSS framework.

I am proposing this because it occurred to me that the addition of
this API would remove much of the cost of something being foreign to
the Javascript/JSON/HTML system. The same way we all want to remove
the cost of web applications being foreign to the operating system.

> LLVM-dev here. Other than this being an IR, I don't see any resemblance to LLVM at all.

> This is nothing like LLVM at all (not SSA).

It is in the same vicinity. I meant to point out that insertion should
not be done with ASTs or stack opcodes like I've seen in other places.

Note that "op" returns a token that represents the result of the
operation that can be used in subsequent operations. "get" and "set"
would load and store these value tokens from/to variables and
properties. I hope you see what I mean when I say this part is like
LLVM syntax (quite SSA).

> I'm not convinced that decompressing megabytes of JavaScript and parsing it (all in native code) is going to be any slower than having to JIT a JavaScript code generation routine, build up some horribly cache-unfriendly "sea of GC-heap-allocated linked nodes in JS" AST structure, then convert that "sea of linked nodes in JS" to the JIT's internal program representation. Decompress+parse is already a highly optimized code path as it is on the critical path of virtually every page load.

That would depend on what the input to the generation routine is. You
are right that SpiderMonkey AST is not more compact than the textual
representation.

With other AST types one node can correspond to complex Javascript node.
For example the PEG expression "a / b" executes as something like:

    tmp = rule_a()
    if(ok) result = tmp
    if(!ok) {
      tmp = rule_b()
      if(ok) result = tmp
    }
    return result

The output from a parser generator can quickly become large. This is
problematic if I want to deliver the parser definition to the client
to be generated clientside. I estimate that concatenating and
evaluating 100KB of code takes 100ms in Chrome. Make that 2x if using
a last-gen CPU and 2x if using Firefox. How long on an ARM device
would that be?

If we're talking about megabytes of Javascript I assume we're talking
about Asm.js. Then the input would be something specialized to
representing Asm.js programs, probably an array of opcodes. Like I
mentioned, a common criticism of Asm.js is that it needs multiple
characters to represent fundamental operations like "(x+y|0)" and
"[x>>2]". Our custom opcode format could represent these with a single
byte - the native parser may be fast, but can it parse 5x as many
chars spanning multiple AST nodes more efficiently than our opcode
decoder?

Also, if I'm not mistaken, both SM and V8 build GC-allocated AST trees
while parsing, only to later copy the AST to a flat IR.

https://github.com/ricardoquesada/Spidermonkey/blob/master/js/src/frontend/Parser.cpp
https://github.com/v8/v8/blob/master/src/parser.cc

That's why the API suggestion example I gave aims to provide an
abstract write-only interface to a flat SSA register IR. Our Asm.js
decoder could skip tree node allocation altogether (ideally, maybe).

> Unless the program representation used internally by the JS engines is exactly the same as the standardized representation (it isn't, and I doubt JS engines would let that implementation detail be standardized), then they still need to do a format conversion, but this time walking a linked sea of GC-heap allocated JS objects instead of doing a linear scan of the program text.

Indeed, I am unsure about the viability of the API fitting all
engines, and whether those that have a SSA IR could write it directly
through these API calls and have only a verification pass upon
".done()" being called, or if it should be buffered and copied to the
real IR later.

However, the intention of my proposed API is exactly that it can be
written in one go with no intermediate linked parts.
The intermediate representation does not need to be standardized,
because the implementation of the API calls would convert the
parameters to what is suitable for the IR - though the IR does need to
be similar.

If anyone knows of engines that do not fit this flat IR model, I'd
like to hear about it.
Then again, because this API is easy to polyfill (depends only on
Function()), any implementation could start with a good-enough version
and make gradual performance improvements.


More information about the es-discuss mailing list