ES RegExp parser

Jason Orendorff jason.orendorff at gmail.com
Mon Mar 20 15:36:16 UTC 2017


The second approach, hands down.

1. With the first approach, you're setting up a situation where it's very
easy to write buggy analysis code: if you forget to check `re.quantifier`
anywhere, your code will run, but you have a bug. Much easier to only have
to check `re.type`.

2. If you have a regexp `re` and you want to programmatically build a
regexp that matches one or more repetitions of it, it's much easier to
write `{type: '+', expression: re}` than to have to examine `re.quantifier`
and (if it's already present) figure out how to modify it.

3. With the first approach, you don't have to represent `(?:)` group in the
AST at all (rather like how Esprima drops redundant parentheses). With the
latter, I think you have to, because it's possible for a single regexp
"node" to have multiple quantifiers: `/(?:\d{4,6})?/`

To me this is not even a question.

-j



On Sun, Mar 19, 2017 at 4:52 PM, Dmitry Soshnikov <
dmitry.soshnikov at gmail.com> wrote:

> I started working on a ECMAScript regular expressions parser[1] with an
> AST format similar to Mozilla's Parser API. This might later be extended to
> support more powerful constructs, like lookbehind assertions, multiline
> regexes, groups naming, comments, etc.
>
> And while this is mostly an FYI post (probably someone will find it useful
> for regexes analysis in source transformation tools, or source code
> editors), I'd appreciate any feedback on the specification of AST nodes
> (currently totally made up by myself).
>
> E.g. when we have a quantifier from ES spec for RegExp grammar, it doesn't
> tell anything (and shouldn't of course) which AST node this quantifier node
> produces.
>
> This leaves open questions like "whether a quantifier should be a part of
> the parsed expression, or should it vice-versa be a main node itself, and
> have the expression as a sub-node?":
>
> In other words, which format is more appropriate (taking into account AST
> traversal tools in order to implement NFA/DFA engine for it later):
>
> ```
> /a+?/
> ```
>
> Char is main, quantifier is a sub-node:
>
> ```
> {
>   type: 'Char',
>   value: 'a',
>   quantifier: {
>     type: '+',
>     greedy: false,
>   }
> }
> ```
>
> The quantifier is main (creating `Repetition` AST node), char is the
> `expression` sub-node:
>
>
> ```
> {
>   type: 'Repetition',
>   expression: {
>     type: 'Char',
>     value: 'a'
>   },
>   quantifier: {
>     type: '+',
>     greedy: false,
>   }
> }
> ```
>
> Currently I chose the second approach (with `Repetition` node) as more
> practical when building an AST traversal -- it may have `onRepetition`
> generic handler, and call `onChar` internally for its `expression`, instead
> of making `onChar` (or any other node) to check, and handle its
> `quantifier`, and do a repeat.
>
> Anyways, if you have any thought or feedback on AST nodes format, please
> feel free to contact me.
>
> Dmitry
>
>  [1] https://www.npmjs.com/package/regexp-tree
>
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20170320/1fbc68ab/attachment.html>


More information about the es-discuss mailing list