Streaming regexp matching

Jordan Harband ljharb at gmail.com
Mon Jul 30 21:44:48 UTC 2018


(It'd have to be a subclass for `RegExp.prototype.whatever.call` to do the
right thing for your alternative regex)

On Mon, Jul 30, 2018 at 12:36 PM, Isiah Meadows <isiahmeadows at gmail.com>
wrote:

> Not yet, but if I were, it wouldn't be a subclass (it's not
> necessary). But the key trouble implementing this is that I'd have to
> reimplement the regexp matching logic entirely from scratch if I want
> remotely reasonable runtime complexity. As for why:
>
> - If you ignore backreferences (which IIUC makes JS regexps go from
> regular to context-sensitive recognizers), it's *possible* to
> implement partial matching using regexp rewriting, but only because JS
> doesn't have any other arcane enough features to make it infeasible.
> - There is no possible way to extract duplicate matches without
> rewriting the regexp entirely and using `regexp.exec`, and the
> complexity of the logic to do this generally I suspect is NP-complete,
> maybe PSPACE-complete, and in either case, certainly infeasible when
> backreferences enter the picture.
> - It's technically possible to refactor a string for streaming, but I
> then lose the ability to discern a match from a non-match. I also have
> the same issue as per above WRT duplicate matches and partial matching
> with complexity issues.
> - If you were to combine the three, rewriting the regexp generally to
> support streaming matches, including duplicates, I suspect would
> likely be PSPACE-complete because it seems *very* similar to the first
> problem listed here [1].
>
> \* Backreferences bring the grammatical complexity of JS regexps from
> I'm pretty sure regular to context-sensitive.
>
> Or in other words, either you control the raw matching logic yourself,
> or the polyfill runtime complexity could get absurd for this proposal.
> And implementing a pushdown state machine-based regexp engine + parser
> in JS isn't exactly something I'm willing to prototype for a strawman.
>
> [1]: https://en.wikipedia.org/wiki/PSPACE-complete#Regular_
> expressions_and_automata
>
> -----
>
> Isiah Meadows
> contact at isiahmeadows.com
> www.isiahmeadows.com
>
>
> On Mon, Jul 30, 2018 at 2:44 PM, Jordan Harband <ljharb at gmail.com> wrote:
> > Have you tried to implement this as a RegExp subclass, overriding all the
> > necessary extension points?
> >
> > On Mon, Jul 30, 2018 at 11:39 AM, Isiah Meadows <isiahmeadows at gmail.com>
> > wrote:
> >>
> >> There's two things I've found that need suspendable matching:
> >>
> >> 1. Partially matching against a string, which is useful with
> >> interactive form validation and such.
> >> 2. Pattern matching and replacement over a character stream, which is
> >> useful for things like matching against files without loading the
> >> entire thing into memory or easier filtering of requests.
> >>
> >> Also, it'd be nice if there was a facility to get *all* matches,
> >> including duplicate group matches. This is often useful for simple
> >> parsing, where if such support existed, you could just use a Kleene
> >> star instead of the standard `exec` loops (which admittedly get old).
> >>
> >> And finally, we could avoid setting regexp globals here. That would
> >> speed up the matcher quite a bit.
> >>
> >> So, here's my proposal:
> >>
> >> - `regexp.matcher() -> matcher` - Create a streaming regexp matcher.
> >> - `matcher.consume(codePoint, charSize?) -> result | undefined` -
> >> Consume a Unicode code point or `-1` if no more characters exist, and
> >> return a match result, `undefined` if no match occurred. `charSize` is
> >> the number of bytes represented by `codePoint` (default: 1-2 if `/u`
> >> is set, 1 otherwise), so it can work with other encodings flexibly.
> >> - `matcher.nextPossibleStart -> number` - The next possible start the
> >> matcher could have, for more effective buffering and stream
> >> management. This is implementation-defined, but it *must* be be `-1`
> >> after the matcher completes, and it *must* be within [0, N) otherwise,
> >> where N is the next returned match.
> >> - `result.group -> string | number | undefined` - Return the group
> >> index/name of the current match, or `undefined` if it's just issuing a
> >> match of the global regexp.
> >> - `result.start -> number` - Return the matched value's start index.
> >> - `result.end -> number` - Return the matched value's end index.
> >> - This does *not* modify any globals or regexp instance members. It
> >> only reads `regexp.lastIndex` on creation. (It doesn't operate on
> >> strings, so it shouldn't return any it doesn't already have.)
> >>
> >> Most RegExp methods could similarly be built using this as a base: if
> >> they work on strings, they can iterate their code points.
> >>
> >> As for the various concerns:
> >>
> >> - Partial matching is just iterating a string's character codes and
> >> seeing if the matcher ever returned non-`undefined`.
> >> - Streaming pattern matching is pretty obvious from just reading the
> API.
> >> - Getting all matches is just iterating the string and returning an
> >> object with all the groups + strings it matched.
> >>
> >> So WDYT?
> >>
> >> /cc Mathias Bynens, since I know you're involved in this kind of
> >> text-heavy stuff.
> >>
> >> -----
> >>
> >> Isiah Meadows
> >> contact at isiahmeadows.com
> >> www.isiahmeadows.com
> >> _______________________________________________
> >> es-discuss mailing list
> >> es-discuss at mozilla.org
> >> https://mail.mozilla.org/listinfo/es-discuss
> >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20180730/4ff593f3/attachment.html>


More information about the es-discuss mailing list