Streaming regexp matching
Isiah Meadows
isiahmeadows at gmail.com
Mon Jul 30 23:59:56 UTC 2018
BTW, I created a rough sketch of a strawman here:
https://github.com/isiahmeadows/streaming-regexp-proposal
-----
Isiah Meadows
contact at isiahmeadows.com
www.isiahmeadows.com
On Mon, Jul 30, 2018 at 5:48 PM, Isiah Meadows <isiahmeadows at gmail.com> wrote:
> I was thinking in terms of what hooks need overridden (none for the
> base proposal), not where the method itself lived.
> -----
>
> Isiah Meadows
> contact at isiahmeadows.com
> www.isiahmeadows.com
>
>
> On Mon, Jul 30, 2018 at 5:44 PM, Jordan Harband <ljharb at gmail.com> wrote:
>> (It'd have to be a subclass for `RegExp.prototype.whatever.call` to do the
>> right thing for your alternative regex)
>>
>> On Mon, Jul 30, 2018 at 12:36 PM, Isiah Meadows <isiahmeadows at gmail.com>
>> wrote:
>>>
>>> Not yet, but if I were, it wouldn't be a subclass (it's not
>>> necessary). But the key trouble implementing this is that I'd have to
>>> reimplement the regexp matching logic entirely from scratch if I want
>>> remotely reasonable runtime complexity. As for why:
>>>
>>> - If you ignore backreferences (which IIUC makes JS regexps go from
>>> regular to context-sensitive recognizers), it's *possible* to
>>> implement partial matching using regexp rewriting, but only because JS
>>> doesn't have any other arcane enough features to make it infeasible.
>>> - There is no possible way to extract duplicate matches without
>>> rewriting the regexp entirely and using `regexp.exec`, and the
>>> complexity of the logic to do this generally I suspect is NP-complete,
>>> maybe PSPACE-complete, and in either case, certainly infeasible when
>>> backreferences enter the picture.
>>> - It's technically possible to refactor a string for streaming, but I
>>> then lose the ability to discern a match from a non-match. I also have
>>> the same issue as per above WRT duplicate matches and partial matching
>>> with complexity issues.
>>> - If you were to combine the three, rewriting the regexp generally to
>>> support streaming matches, including duplicates, I suspect would
>>> likely be PSPACE-complete because it seems *very* similar to the first
>>> problem listed here [1].
>>>
>>> \* Backreferences bring the grammatical complexity of JS regexps from
>>> I'm pretty sure regular to context-sensitive.
>>>
>>> Or in other words, either you control the raw matching logic yourself,
>>> or the polyfill runtime complexity could get absurd for this proposal.
>>> And implementing a pushdown state machine-based regexp engine + parser
>>> in JS isn't exactly something I'm willing to prototype for a strawman.
>>>
>>> [1]:
>>> https://en.wikipedia.org/wiki/PSPACE-complete#Regular_expressions_and_automata
>>>
>>> -----
>>>
>>> Isiah Meadows
>>> contact at isiahmeadows.com
>>> www.isiahmeadows.com
>>>
>>>
>>> On Mon, Jul 30, 2018 at 2:44 PM, Jordan Harband <ljharb at gmail.com> wrote:
>>> > Have you tried to implement this as a RegExp subclass, overriding all
>>> > the
>>> > necessary extension points?
>>> >
>>> > On Mon, Jul 30, 2018 at 11:39 AM, Isiah Meadows <isiahmeadows at gmail.com>
>>> > wrote:
>>> >>
>>> >> There's two things I've found that need suspendable matching:
>>> >>
>>> >> 1. Partially matching against a string, which is useful with
>>> >> interactive form validation and such.
>>> >> 2. Pattern matching and replacement over a character stream, which is
>>> >> useful for things like matching against files without loading the
>>> >> entire thing into memory or easier filtering of requests.
>>> >>
>>> >> Also, it'd be nice if there was a facility to get *all* matches,
>>> >> including duplicate group matches. This is often useful for simple
>>> >> parsing, where if such support existed, you could just use a Kleene
>>> >> star instead of the standard `exec` loops (which admittedly get old).
>>> >>
>>> >> And finally, we could avoid setting regexp globals here. That would
>>> >> speed up the matcher quite a bit.
>>> >>
>>> >> So, here's my proposal:
>>> >>
>>> >> - `regexp.matcher() -> matcher` - Create a streaming regexp matcher.
>>> >> - `matcher.consume(codePoint, charSize?) -> result | undefined` -
>>> >> Consume a Unicode code point or `-1` if no more characters exist, and
>>> >> return a match result, `undefined` if no match occurred. `charSize` is
>>> >> the number of bytes represented by `codePoint` (default: 1-2 if `/u`
>>> >> is set, 1 otherwise), so it can work with other encodings flexibly.
>>> >> - `matcher.nextPossibleStart -> number` - The next possible start the
>>> >> matcher could have, for more effective buffering and stream
>>> >> management. This is implementation-defined, but it *must* be be `-1`
>>> >> after the matcher completes, and it *must* be within [0, N) otherwise,
>>> >> where N is the next returned match.
>>> >> - `result.group -> string | number | undefined` - Return the group
>>> >> index/name of the current match, or `undefined` if it's just issuing a
>>> >> match of the global regexp.
>>> >> - `result.start -> number` - Return the matched value's start index.
>>> >> - `result.end -> number` - Return the matched value's end index.
>>> >> - This does *not* modify any globals or regexp instance members. It
>>> >> only reads `regexp.lastIndex` on creation. (It doesn't operate on
>>> >> strings, so it shouldn't return any it doesn't already have.)
>>> >>
>>> >> Most RegExp methods could similarly be built using this as a base: if
>>> >> they work on strings, they can iterate their code points.
>>> >>
>>> >> As for the various concerns:
>>> >>
>>> >> - Partial matching is just iterating a string's character codes and
>>> >> seeing if the matcher ever returned non-`undefined`.
>>> >> - Streaming pattern matching is pretty obvious from just reading the
>>> >> API.
>>> >> - Getting all matches is just iterating the string and returning an
>>> >> object with all the groups + strings it matched.
>>> >>
>>> >> So WDYT?
>>> >>
>>> >> /cc Mathias Bynens, since I know you're involved in this kind of
>>> >> text-heavy stuff.
>>> >>
>>> >> -----
>>> >>
>>> >> Isiah Meadows
>>> >> contact at isiahmeadows.com
>>> >> www.isiahmeadows.com
>>> >> _______________________________________________
>>> >> es-discuss mailing list
>>> >> es-discuss at mozilla.org
>>> >> https://mail.mozilla.org/listinfo/es-discuss
>>> >
>>> >
>>
>>
More information about the es-discuss
mailing list