Forrest Smith's Fuzzy Search, in Racket

Last updated: home

It's old news, but you might be interested to know that Forrest Smith reverse engineered Sublime Text's fuzzy search and released the source.

I ported the implementation to Racket in the fuzzy-search package and added some other goodies.

$ raco pkg install fuzzy-search

Use is as simple as finding a needle in a haystack.

> (require fuzzy-search)
> (fuzzy-search
   "needle"
   "haystack haystack n haystack e e haystack d haystack l e haystack")
#t
206
'#hasheq((0 . 18) (1 . 29) (2 . 31) (3 . 42) (4 . 53) (5 . 55))

The term “needle” is not entirely correct since the string acts as a search pattern, not as an exact substring in the haystack. I only use the term to illustrate the relationship between the arguments.

The return values answer three questions:

Fuzzy searches normally consider edit distance between strings. This one ranks your search by match characteristics, such as adjacency to other characters, or distance from the beginning of the haystack string. By default, I use Forrest's scoring algorithm, which tries to mimic Sublime Text's biases. You can provide your own scoring algorithm if the results don't work for you.

My Changes

While the implementation started as a port of the JavaScript implementation of Forrest's fuzzy match, I've made a few adjustments.

The require subform is a fun one. It comes from fuzzy-search/require. With it, you can write (require (fuzzy "my server module")), which will use fuzzy-search on possible relative paths starting from the directory of the module using fuzzy. I originally set it to start searching from the parent directory of the module, but thought better of it. The online docs may still mention the parent directory as you read this, so just be warned those docs will change about a day after the time of this writing.

Obviously, fuzzy can import the wrong thing. It's just handy when you are shuffling files around and can't be bothered to rewrite relative paths on every use of mv.

Performance

I did not use unsafe bindings, so the performance of the algorithm reflects conventional Racket use. The freshly-ported version of the algorithm will make your CPU burst into flames for long haystacks and needles.

Sublime Text author Jon Skinner suggested (or Forrest/myself heard) that a recursive search should start for the last N characters of the haystack for every pattern character match in position N-1. This means that matches create more work for the CPU. Ironically, several exact substring matches near the beginning of the haystack create the most work. Ha!

I'm not a benchmarking guru, but I can slap together a quick experiment to show you the implications.

The Setup

Let's say that haystacks consist of pseudorandom strings separated by a single space. Each string consists of mixed-case alphanumeric characters and all punctuation from my QWERTY keyboard. For worst-case timings on an iterative part of the algorithm, each haystack must end with the only occurrence of a needle's last character. This guarentees a full traversal of the haystack for needles of a given length, but does not guarentee a full traversal of any needle.

The fuzzy search's runtime still varies quite a bit with pseudorandom strings, since I couldn't know how many recursive searches would start for a given search. I didn't worry about maximizing that, because I already know there will be exponential growth based on needles and haystack length. I'm not bothering with a full asymptotic analysis because I already graduated from my CS program and am free. Freeee!

I just wrote some procedures to compute the mean of Racket's time-apply results. This helps filter out some of the variability, but the algorithm also has a couple of knobs for controlling the recursion depth for child searches and the maximum number of matches to consider before dismissing the pattern as too vague.

I'm not using normalized time because increases in runtime are obvious under the conditions I show, and I learned about the benchmark package after drafting this article. I also don't bother with parallelism since I expect a fuzzy search would be used in interactive (hence, concurrent) applications. I just run my simulations on one 2.5 GHz core.

I will use Racket 7.6 CS to report mean timings from 100 searches of 3, 4, 5, 6, 7, and 8-character long needles. I allow up to 10 recursive searches, and up to 256 character matches between the needle and the haystack.

As mentioned, I will make every haystack end with the last character of a prescribed search pattern to force a full traversal of the haystack.

“Everything is fast for small N” -- Jeff Atwood

For small haystacks (like filesystem paths), speed isn't a concern. In this example, haystacks are just 256 characters long.

[sage@localhost fuzzy-search]$ racket benchmark.rkt 100 3 8 1 256
Mean times of 100 trial(s) using 1 256-character
word(s) and 3 to 8-character needles:

needle-length: 3        cpu: 0.0        real: 0.0       gc: 0.0
needle-length: 4        cpu: 0.0        real: 0.0       gc: 0.0
needle-length: 5        cpu: 0.0        real: 0.0       gc: 0.0
needle-length: 6        cpu: 0.0        real: 0.0       gc: 0.0
needle-length: 7        cpu: 0.0        real: 0.0       gc: 0.0
needle-length: 8        cpu: 0.0        real: 0.0       gc: 0.0

This shows that the fuzzy require sub-form probably won't bottleneck anything. Nor will other fuzzy searches that check a set of sufficiently small strings.

Make it Sweat

Let's increase the size of the data set. Here I use 3000 “word” haystacks where each word is 8 characters long. This sloppily simulates searching the full body of a short English-like essay.

[sage@localhost fuzzy-search]$ racket benchmark.rkt 100 3 8 3000 8
Mean times of 100 trial(s) using 3000 8-character
word(s) and 3 to 8-character needles:

needle-length: 3        cpu: 8.7        real: 8.7       gc: 0.0
needle-length: 4        cpu: 29.44      real: 29.49     gc: 0.0
needle-length: 5        cpu: 123.76     real: 123.87    gc: 0.0
needle-length: 6        cpu: 378.03     real: 378.5     gc: 0.0
needle-length: 7        cpu: 730.77     real: 732.01    gc: 0.0
needle-length: 8        cpu: 1583.37    real: 1585.63   gc: 0.0

Every character you add to a search pattern at least doubles the mean time! So if you need to look up some file or record you don't remember from a set of many candidates, then run raco pkg install fuzzy-search. If you need to run an analysis over a series of books, then, uh, weigh your options first.