Streaming regexp matching
michael.lee.theriot at gmail.com
Mon Jul 30 19:34:10 UTC 2018
I'll say I have at least a few times desired a way to stream input to
RegExp. To the point I fiddled around implementing a finite state machine.
My use cases:
1. short circuiting on nth capture
2. parsing streams
On Mon, Jul 30, 2018 at 2:39 PM, Isiah Meadows <isiahmeadows at gmail.com>
> There's two things I've found that need suspendable matching:
> 1. Partially matching against a string, which is useful with
> interactive form validation and such.
> 2. Pattern matching and replacement over a character stream, which is
> useful for things like matching against files without loading the
> entire thing into memory or easier filtering of requests.
> Also, it'd be nice if there was a facility to get *all* matches,
> including duplicate group matches. This is often useful for simple
> parsing, where if such support existed, you could just use a Kleene
> star instead of the standard `exec` loops (which admittedly get old).
> And finally, we could avoid setting regexp globals here. That would
> speed up the matcher quite a bit.
> So, here's my proposal:
> - `regexp.matcher() -> matcher` - Create a streaming regexp matcher.
> - `matcher.consume(codePoint, charSize?) -> result | undefined` -
> Consume a Unicode code point or `-1` if no more characters exist, and
> return a match result, `undefined` if no match occurred. `charSize` is
> the number of bytes represented by `codePoint` (default: 1-2 if `/u`
> is set, 1 otherwise), so it can work with other encodings flexibly.
> - `matcher.nextPossibleStart -> number` - The next possible start the
> matcher could have, for more effective buffering and stream
> management. This is implementation-defined, but it *must* be be `-1`
> after the matcher completes, and it *must* be within [0, N) otherwise,
> where N is the next returned match.
> - `result.group -> string | number | undefined` - Return the group
> index/name of the current match, or `undefined` if it's just issuing a
> match of the global regexp.
> - `result.start -> number` - Return the matched value's start index.
> - `result.end -> number` - Return the matched value's end index.
> - This does *not* modify any globals or regexp instance members. It
> only reads `regexp.lastIndex` on creation. (It doesn't operate on
> strings, so it shouldn't return any it doesn't already have.)
> Most RegExp methods could similarly be built using this as a base: if
> they work on strings, they can iterate their code points.
> As for the various concerns:
> - Partial matching is just iterating a string's character codes and
> seeing if the matcher ever returned non-`undefined`.
> - Streaming pattern matching is pretty obvious from just reading the API.
> - Getting all matches is just iterating the string and returning an
> object with all the groups + strings it matched.
> So WDYT?
> /cc Mathias Bynens, since I know you're involved in this kind of
> text-heavy stuff.
> Isiah Meadows
> contact at isiahmeadows.com
> es-discuss mailing list
> es-discuss at mozilla.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the es-discuss