The last time Hackerfall tried to access this page, it returned a not found error. A cached version of the page is below, or click here to continue anyway

SparkTG Engineering

How we store 400M phone number data with fast lookups

If you live in India and do not wish to receive any unsolicited telemarketer calls, you can register with the National Customer Preference Register (NCPR)[1]. The government body maintains a database of all the numbers that are registered with them. Right now, there are about 400M registered numbers. All registered telemarketers get access to the updated data so that they make call according to the preference of callee.

The data is provided in a bunch of zip files (40 at the moment) with each file containing 10M rows of CSV data. This writeup will deal with how this 2.4GB compressed data can be fitted in 2GB memory in a searchable format based on some simple observations.

The Data

Below is an excerpt from one of the CSV dumps (some digits are garbled for privacy reasons)

"Service Area Code","Phone Numbers","Preferences","Opstype","Phone Type"

Note on storing in a SQL engine

On a 4GB Linode machine, loading data in PostgreSQL table (using COPY) takes 10 minutes
real    10m0.159s
user    2m42.243s
sys     0m26.363s
and adding primary key takes about 1.5 to 2 hours
real    118m21.637s
user    0m0.043s
sys     0m0.020s
and takes 32GB disk space.
   tablename   |   indexname   |  num_rows   | table_size | index_size | unique | number_of_scans | tuples_read | tuples_fetched 
 registry      | registry_pkey | 4.02161e+08 | 20 GB      | 12 GB      | Y      |               1 |           1 |              1

Observations on CSV data

After analysing the data, it can be noticed that

This means that a single row can be represented as two bytes.

Byte1 organised as (1 bit existence flag, 5 bit service area code, 2 bit phone type)
Byte2 organised as (7 bit preferences, 1 bit opstype)

        | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
Byte 1  | E |  sac (5-bit)      |  pt   |
Byte 2  |        pref (7-bit)       | o |

This means that the data can be represented in roughly 2 * 400 MB. The existence flag will be discussed in next section.

Make it searchable

The entries need to be looked up frequently by phone number and we haven't mapped the data to the phone number yet. We need to add bytes storing the phone number. Unfortunately, 10 digits won't fit in 32 bits and 5 * 400 MB for storing number is not a very happy situation and it is NOT readily searchable. What if the data is arranged in a sequence so that the memory location at indices (2 * number) and (2 * number + 1) gives the required two bytes. An empty row can be represented with the existence flag in first byte. This means that we require 20 GB of memory (2 bytes * 10B numbers). Can we squeeze it further? The array looks sparse (40% occupied).

The workaround that we did is to settle on two type of formats. We will come to that later.

More observations

Another observation is that the array is dense for most of the mobile number series[2]. So, if a 10 digit number is split in 2 parts - 4 digit prefix (we will call it head) and 6 digit offset (tail) - that way all possible values for a fixed 4 digit prefix when laid out sequentially fit in 2MB (2 byte for each tail). Searching is easy because it is a direct array access with offset computed from value of the tail.

The sparse series are stored in 5 byte sequences with 3 bytes representing the tail and 2 bytes representing the data. The tails are arranged in ascending order so that searching is easy (binary search!).

For persistent storage, numbers with same prefix are stored in files with the first byte indicating the type of box. This takes a total of 1.8GB at the moment which can be stored in memory and served via webserver.


A quick python script shows that transforming the CSV data into our format takes a lot of time. Profiling shows that most of time is spent on 2M iterations while dumping the data for a fixed head. We tried to optimize this using xrange but 5 hours were too much to process the entire data especially when PostgreSQL was taking only 2 hours. We wanted something fast, something close to metal. Same program was ported to Rust and it processed the entire data in 20-30 minutes.

real    21m4.284s
user    20m58.427s
sys     1m37.607s

Lookup Timings

To measure the speed of our solution, we generated random phone numbers in the same series (fixed head). The results are plotted below. Heads "9818" and "9000" were chosen to get the timings for searching dense box (we call it type 0) and sparse box (type 1) respectively. For SQL solution, density of head doesn't matter. Note that in our solution no disk access is required once data is loaded and fits in memory, the later being facilitated by the virtue of data storage format, although the timings do include disk read and write times for a fair comparison.

All tests were done on 4GB Linode machines - SSD, 4GB RAM, 4 virtual CPU cores, CPU Model: Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz

API and open sourcing

At SparkTG, we respect callee's preference. Although our customers communicate mostly with their registered customer base, we still make sure that they don't end up making an unsolicited call. We have open sourced the project[3] and provided an API[4] to lookup the NCPR status for a number with the spirit that telemarketers finding it difficult to manage the registry database get an easy way out.

- BilalSparkTG


  3. Github repository
  4. NCPR status lookup

If you are looking for an opportunity or need a cloud telephony solution, call us!

Continue reading on