One year ago, searching for polinomials on Khan Academy's search page would give you no results. If you typed the same thing into Google you'd be efficiently and politely corrected.
I didn't have any illusions of making a solution as good as Google's, but I figured I could improve things for a significant number of users anyways .
The first thing I did was see if there was some library or tool that we could use to do some simple spellchecking. Khan Academy runs on Google App Engine's Python platform, so I needed a pure-Python library (installing CPython extensions is not allowed).
Whoosh was a good candidate, but it wasn't as memory efficient as I wanted . Integrating Whoosh with Google App Engine also looked error-prone. After looking around some more and not finding anything super useable, I decided to try something crazy.
I decided to build my own pure-Python spell checker.
I was expecting my mentor and others to balk at the idea (I was an intern during this time). But all I received were encouraging nods, so off I went.
To get things up and going quickly, I chose to follow a simple brute force algorithm like the one Peter Norvig describes in his awesome blog post. Soon I had something running that worked well and was super fast. There was a problem though.
Storing our English words in a Python dict consumed about 18 MB of space . Since my hope was that this code could run on all our frontend instances and work for all languages , our infrastructure team and I weren't super excited by this.
To reduce memory I first tried using Python's array module to build my own immutable hash table. This did indeed bring our memory usage down but made spellchecking take several seconds per query.
I found that the loss of speed came from doing way more things in Python code instead of CPython's super-fast dict implementation. This ruled out improving the performance through clever data structures. So I had to give up on my plans for an awesome succinct trie implementation and instead go hunting through the standard library to find the best native solution available.
Thus I arrived at the binary search implementation in the bisect module of Python's standard library.
The idea is simple. Store a hash of each word in a sorted array and then do binary search on that array. The hashes are small and can be tightly packed in less than 2 MB. Binary search is fast and allows the spell checking algorithm to service any query.
Best of all, it works.