Forrest Smith's Fuzzy Search, in Racket
It's old news, but you might be interested to know that and released the source.
I ported the implementation to Racket in the package and added some other goodies.
$ raco pkg install fuzzy-search
Use is as simple as finding a needle in a haystack.
> (require fuzzy-search)
> (fuzzy-search
"needle"
"haystack haystack n haystack e e haystack d haystack l e haystack")
#t
206
'#hasheq((0 . 18) (1 . 29) (2 . 31) (3 . 42) (4 . 53) (5 . 55))
The term “needle” is not entirely correct since the string acts as a search pattern, not as an exact substring in the haystack. I only use the term to illustrate the relationship between the arguments.
The return values answer three questions:
- Is there at least one (character) match?
- What's the overall score for the query? If you hold the needle constant, you can check against several strings to gauge which one has the best match.
- Where do the characters from the needle appear in the haystack?
The zero-based numbers in the
(0 . 18)
pair means that the first character match in the pattern appeared at position 18 in the haystack.
Fuzzy searches normally consider edit distance between strings. This one ranks your search by match characteristics, such as adjacency to other characters, or distance from the beginning of the haystack string. By default, I use Forrest's scoring algorithm, which tries to mimic Sublime Text's biases. You can provide your own scoring algorithm if the results don't work for you.
My Changes
While the implementation started as a port of the JavaScript implementation of Forrest's fuzzy match, I've made a few adjustments.
- Allow custom scoring procedures.
- Allow case-sensitivity.
- Add scoring combinators to the public API.
- Add match positions in search output.
- Add the
fuzzy
require subform. - Remove mutable objects and imperative code.
- Package for distribution on the Racket catalog.
The require
subform is a fun one. It comes from
fuzzy-search/require
. With it, you can write
(require (fuzzy "my server module"))
, which will use
fuzzy-search
on possible relative paths starting from the
directory of the module using fuzzy
. I originally set it to start
searching from the parent directory of the module, but thought better
of it. The online docs may still mention the parent directory as you
read this, so just be warned those docs will change about a day after
the time of this writing.
Obviously, fuzzy
can import the wrong thing. It's just handy when
you are shuffling files around and can't be bothered to rewrite
relative paths on every use of mv
.
Performance
I did not use unsafe bindings, so the performance of the algorithm reflects conventional Racket use. The freshly-ported version of the algorithm will make your CPU burst into flames for long haystacks and needles.
Sublime Text author Jon Skinner suggested (or Forrest/myself heard)
that a recursive search should start for the last N
characters of
the haystack for every pattern character match in position
N-1
. This means that matches create more work for the
CPU. Ironically, several exact substring matches near the beginning of
the haystack create the most work. Ha!
I'm not a benchmarking guru, but I can slap together a quick experiment to show you the implications.
The Setup
Let's say that haystacks consist of pseudorandom strings separated by a single space. Each string consists of mixed-case alphanumeric characters and all punctuation from my QWERTY keyboard. For worst-case timings on an iterative part of the algorithm, each haystack must end with the only occurrence of a needle's last character. This guarentees a full traversal of the haystack for needles of a given length, but does not guarentee a full traversal of any needle.
The fuzzy search's runtime still varies quite a bit with pseudorandom strings, since I couldn't know how many recursive searches would start for a given search. I didn't worry about maximizing that, because I already know there will be exponential growth based on needles and haystack length. I'm not bothering with a full asymptotic analysis because I already graduated from my CS program and am free. Freeee!
I just wrote some procedures to compute the mean of Racket's
time-apply
results. This helps filter out some of the
variability, but the algorithm also has a couple of knobs for
controlling the recursion depth for child searches and the maximum
number of matches to consider before dismissing the pattern as too
vague.
I'm not using normalized time because increases in runtime are obvious under the conditions I show, and I learned about the after drafting this article. I also don't bother with parallelism since I expect a fuzzy search would be used in interactive (hence, concurrent) applications. I just run my simulations on one 2.5 GHz core.
I will use Racket 7.6 CS to report mean timings from 100 searches of 3, 4, 5, 6, 7, and 8-character long needles. I allow up to 10 recursive searches, and up to 256 character matches between the needle and the haystack.
As mentioned, I will make every haystack end with the last character of a prescribed search pattern to force a full traversal of the haystack.
“Everything is fast for small N” -- Jeff Atwood
For small haystacks (like filesystem paths), speed isn't a concern. In this example, haystacks are just 256 characters long.
[sage@localhost fuzzy-search]$ racket benchmark.rkt 100 3 8 1 256
Mean times of 100 trial(s) using 1 256-character
word(s) and 3 to 8-character needles:
needle-length: 3 cpu: 0.0 real: 0.0 gc: 0.0
needle-length: 4 cpu: 0.0 real: 0.0 gc: 0.0
needle-length: 5 cpu: 0.0 real: 0.0 gc: 0.0
needle-length: 6 cpu: 0.0 real: 0.0 gc: 0.0
needle-length: 7 cpu: 0.0 real: 0.0 gc: 0.0
needle-length: 8 cpu: 0.0 real: 0.0 gc: 0.0
This shows that the fuzzy
require sub-form probably won't
bottleneck anything. Nor will other fuzzy searches that check a set of
sufficiently small strings.
Make it Sweat
Let's increase the size of the data set. Here I use 3000 “word” haystacks where each word is 8 characters long. This sloppily simulates searching the full body of a short English-like essay.
[sage@localhost fuzzy-search]$ racket benchmark.rkt 100 3 8 3000 8
Mean times of 100 trial(s) using 3000 8-character
word(s) and 3 to 8-character needles:
needle-length: 3 cpu: 8.7 real: 8.7 gc: 0.0
needle-length: 4 cpu: 29.44 real: 29.49 gc: 0.0
needle-length: 5 cpu: 123.76 real: 123.87 gc: 0.0
needle-length: 6 cpu: 378.03 real: 378.5 gc: 0.0
needle-length: 7 cpu: 730.77 real: 732.01 gc: 0.0
needle-length: 8 cpu: 1583.37 real: 1585.63 gc: 0.0
Every character you add to a search pattern at least doubles the mean
time! So if you need to look up some file or record you don't remember
from a set of many candidates, then run raco pkg install
fuzzy-search
. If you need to run an analysis over a series of books,
then, uh, weigh your options first.