It's old news, but you might be interested to know that Forrest Smith reverse engineered Sublime Text's fuzzy search and released the source.
I ported the implementation to Racket in the
fuzzy-search package and added some other goodies.
$ raco pkg install fuzzy-search
Use is as simple as finding a needle in a haystack.
> (require fuzzy-search) > (fuzzy-search "needle" "haystack haystack n haystack e e haystack d haystack l e haystack") #t 206 '#hasheq((0 . 18) (1 . 29) (2 . 31) (3 . 42) (4 . 53) (5 . 55))
The term “needle” is not entirely correct since the string acts as a search pattern, not as an exact substring in the haystack. I only use the term to illustrate the relationship between the arguments.
The return values answer three questions:
- Is there at least one (character) match?
- What's the overall score for the query? If you hold the needle constant, you can check against several strings to gauge which one has the best match.
- Where do the characters from the needle appear in the haystack?
The zero-based numbers in the
(0 . 18)pair means that the first character match in the pattern appeared at position 18 in the haystack.
Fuzzy searches normally consider edit distance between strings. This one ranks your search by match characteristics, such as adjacency to other characters, or distance from the beginning of the haystack string. By default, I use Forrest's scoring algorithm, which tries to mimic Sublime Text's biases. You can provide your own scoring algorithm if the results don't work for you.
- Allow custom scoring procedures.
- Allow case-sensitivity.
- Add scoring combinators to the public API.
- Add match positions in search output.
- Add the
- Remove mutable objects and imperative code.
- Package for distribution on the Racket catalog.
require subform is a fun one. It comes from
fuzzy-search/require. With it, you can write
(require (fuzzy "my server module")), which will use
fuzzy-search on possible relative paths starting from the directory of the module using
fuzzy. I originally set it to start searching from the parent directory of the module, but thought better of it. The online docs may still mention the parent directory as you read this, so just be warned those docs will change about a day after the time of this writing.
fuzzy can import the wrong thing. It's just handy when you are shuffling files around and can't be bothered to rewrite relative paths on every use of
I did not use unsafe bindings, so the performance of the algorithm reflects conventional Racket use. The freshly-ported version of the algorithm will make your CPU burst into flames for long haystacks and needles.
Sublime Text author Jon Skinner suggested (or Forrest/myself heard) that a recursive search should start for the last
N characters of the haystack for every pattern character match in position
N-1. This means that matches create more work for the CPU. Ironically, several exact substring matches near the beginning of the haystack create the most work. Ha!
I'm not a benchmarking guru, but I can slap together a quick experiment to show you the implications.
Let's say that haystacks consist of pseudorandom strings separated by a single space. Each string consists of mixed-case alphanumeric characters and all punctuation from my QWERTY keyboard. For worst-case timings on an iterative part of the algorithm, each haystack must end with the only occurrence of a needle's last character. This guarentees a full traversal of the haystack for needles of a given length, but does not guarentee a full traversal of any needle.
The fuzzy search's runtime still varies quite a bit with pseudorandom strings, since I couldn't know how many recursive searches would start for a given search. I didn't worry about maximizing that, because I already know there will be exponential growth based on needles and haystack length. I'm not bothering with a full asymptotic analysis because I already graduated from my CS program and am free. Freeee!
I just wrote some procedures to compute the mean of Racket's
time-apply results. This helps filter out some of the variability, but the algorithm also has a couple of knobs for controlling the recursion depth for child searches and the maximum number of matches to consider before dismissing the pattern as too vague.
I'm not using normalized time because increases in runtime are obvious under the conditions I show, and I learned about the benchmark package after drafting this article. I also don't bother with parallelism since I expect a fuzzy search would be used in interactive (hence, concurrent) applications. I just run my simulations on one 2.5 GHz core.
I will use Racket 7.6 CS to report mean timings from 100 searches of 3, 4, 5, 6, 7, and 8-character long needles. I allow up to 10 recursive searches, and up to 256 character matches between the needle and the haystack.
As mentioned, I will make every haystack end with the last character of a prescribed search pattern to force a full traversal of the haystack.
“Everything is fast for small N” -- Jeff AtwoodFor small haystacks (like filesystem paths), speed isn't a concern. In this example, haystacks are just 256 characters long.
[sage@localhost fuzzy-search]$ racket benchmark.rkt 100 3 8 1 256 Mean times of 100 trial(s) using 1 256-character word(s) and 3 to 8-character needles: needle-length: 3 cpu: 0.0 real: 0.0 gc: 0.0 needle-length: 4 cpu: 0.0 real: 0.0 gc: 0.0 needle-length: 5 cpu: 0.0 real: 0.0 gc: 0.0 needle-length: 6 cpu: 0.0 real: 0.0 gc: 0.0 needle-length: 7 cpu: 0.0 real: 0.0 gc: 0.0 needle-length: 8 cpu: 0.0 real: 0.0 gc: 0.0
This shows that the
fuzzy require sub-form probably won't bottleneck anything. Nor will other fuzzy searches that check a set of sufficiently small strings.
Make it Sweat
Let's increase the size of the data set. Here I use 3000 “word” haystacks where each word is 8 characters long. This sloppily simulates searching the full body of a short English-like essay.
[sage@localhost fuzzy-search]$ racket benchmark.rkt 100 3 8 3000 8 Mean times of 100 trial(s) using 3000 8-character word(s) and 3 to 8-character needles: needle-length: 3 cpu: 8.7 real: 8.7 gc: 0.0 needle-length: 4 cpu: 29.44 real: 29.49 gc: 0.0 needle-length: 5 cpu: 123.76 real: 123.87 gc: 0.0 needle-length: 6 cpu: 378.03 real: 378.5 gc: 0.0 needle-length: 7 cpu: 730.77 real: 732.01 gc: 0.0 needle-length: 8 cpu: 1583.37 real: 1585.63 gc: 0.0
Every character you add to a search pattern at least doubles the mean time! So if you need to look up some file or record you don't remember from a set of many candidates, then run
raco pkg install
fuzzy-search. If you need to run an analysis over a series of books, then, uh, weigh your options first.