Skip to content

Latest commit

 

History

History
46 lines (38 loc) · 2.2 KB

README.md

File metadata and controls

46 lines (38 loc) · 2.2 KB

regrams

regrams is a go package that implements an engine for text search by regular expression queries. Given a collection of documents that are indexed by trigrams and a regular expression, regrams can generate a boolean query for the trigram index that retrieves good candidates that might match the regular expression.

For example, regrams.MakeQuery("a(bc)+d") returns (abc) (bcb|bcd), which means that any document that matches the regular expression a(bc)+d must contain the trigram abc as well as one of bcb or bcd.

regrams is inspired by Google's codesearch repo and Russ Cox's corresponding blog post explaining the implementation. Instead of using the structure of the parsed regular expression to generate queries, however, regrams creates an NFA from the query with nodes weighted by the size of the reachable trigram sets and repeatedly extracts disjunctions from the NFA by identifying and removing minimum-weight graph cuts. I wrote a post on regular expression search via graph cuts that gives a lot more detail about the technique.

The easiest way to try out this package is to run the command-line wrapper included in this repo:

$ go run cmd/regrams/main.go abcd
(abc) (bcd)
$ go run cmd/regrams/main.go 'Time: [0-9]+ ms'
( ms) (: 0|: 1|: 2|: 3|: 4|: 5|: 6|: 7|: 8|: 9) (Tim) (e: ) (ime) (me:)
$ go run cmd/regrams/main.go '(abc*)+de'
(aba|abc|abd) (bab|bca|bcc|bcd|bde)
$ go run cmd/regrams/main.go '(?i)abc'
(ABC|ABc|AbC|Abc|aBC|aBc|abC|abc)

Running go run cmd/regrams/main.go --help will give you more options.

Alternatively, you can try out examples in a browser at the Go Playground.

To use regrams as a package, import github.com/aaw/regrams, then call regrams.MakeQuery in your code to generate a trigram query from a string representing a regular expression. MakeQuery returns a slice-of-slices of strings, which should be interpreted as a boolean query in conjunctive normal form: the slice [["abc"], ["xyz","xyw"], ["xxx", "yyy", "zzz"]] represents the query abc AND (xyz OR xyw) AND (xxx OR yyy OR zzz).