We have a lot of important and common data thats not modified frequently but accessed at a very high rate. One example is our spam domain blacklist. Since we dont want to show Pinners spammy Pins, our app/API server needs to check a Pins domain against this domain blacklist when rendering the Pin. This is just one example, but there are hundreds of thousands of Pin requests every second, which generates enormous demand for access to this list.
Previously, we stored this kind of list in a Redis sorted set which provided us with easy access to keep the list structure in a time sorted order. We also have a local in-memory and file-based cache thats kept in sync via polling the Redis host for any updates. Things went well in the beginning, but as the number of servers and size of the list grew, we began to see a network saturation problem. In the five minutes after the list was updated, all the servers tried to download the latest copy of data from a single Redis master causing the network to saturate on Redis master and resulting in a lot of Redis connection errors.
We identified a few potential solutions:
As each of the above solutions has its own shortcoming, we asked ourselves, how would we design a solution if we were building from ground up?
Formalizing the requirements of the problem:
We engineered a solution by combining the solutions to the smaller problems:
This is conceptually similar to the design of ZooKeeper resiliency, but if we stored the entire data in one ZooKeeper node, it would still cause a huge spike in network traffic on the ZooKeeper nodes during an update. Since ZooKeeper is distributed, the load would be spread across multiple ZooKeeper nodes. Yet, we didnt want to burden ZooKeeper unnecessarily as its a critical piece of our infrastructure.
We finally arrived at a solution where we use ZooKeeper as the notifier and S3 for the storage. Since S3 provides very high availability and throughput, it seemed to be a good fit for our use case in absorbing the sudden load spikes. We call this solution managed list aka config v2.
Config v2 takes full advantage of the work we have already done, except that the source of truth is in S3. Further, we added logic to avoid concurrent updates and to deal with S3 eventual consistency. We store a version number (thats actually a timestamp) in ZooKeeper node which also will be a suffix of the S3 path to identify the current data.
If a managed lists data needs to be modified, a developer has the option to change it via an admin web UI or a console app. The following steps are executed by the Updater app on save:
As soon as the Zk nodes value is updated, ZooKeeper notifies all its watchers. In this case, triggering the Daemon processes on all servers to download the data from S3.
Amazons S3 gives great availability and durability guarantees even under heavy load, but its eventually consistent. What we needed was read after write consistency. Fortunately, it does give read after create consistency in some regions*. Instead of updating the same S3 file, we create a new file for every write. And yet, this introduces a new problem of synchronizing the new S3 filename across all the nodes. We solved this problem by using ZooKeeper to keep the filename in sync across all the nodes.
When a new feature or service is ready for launch, we gradually ramp up traffic in the new code path and check to make sure everything is good before going all in. This resulted in the need to build a switch that can allow a developer to decide how much traffic should be sent to the new feature. Also, this traffic ramp-up tool (aka Decider) should be flexible enough so that developers can add new experiments and change the values of existing experiments without requiring a re-deploy to the entire fleet. In addition, any changes should converge quickly and reliably across the fleet.
Every experiment is a ZooKeeper node and has a value [0-100] that can be controlled from the web UI. When the value is changed from the web UI, its updated in the corresponding node, and ZooKeeper takes care of updating all the watchers. While this solution worked, it was plagued with the same scaling issues we previously experienced since the entire fleet was directly connecting to ZooKeeper.
Our Decider framework consisted of two components: a web-based admin UI to control the experiments and a library (both in Python and Java) that can be plugged in where branch control is needed.
Once we realized the gains of managed list, we built managed hashmap and migrated values of all Zk nodes containing the experiments. Essentially, the underlying managed hashmap file content is a json dump of the hash table that contains experiment names as the keys and an integer [0-100] as the value.
def decide_expermiment(experiment_name): return random.randrange(0, 100, 1)
How this is used in code:
if decide_experiment("my_rocking_experiment"): // new code else // existing code
We use the terminology dark read and dark write when we duplicate the production read or write request and send it to a new service. We call it dark because the response from the new service doesnt impact the original code path whether its a success or failure. If asynchronous behavior is needed then we wrap the the new code path in gevent.spawn().
Heres a code snippet for dark read:
if decider.decide_experiment("dark_read_for_new_service"): new_service.foo() except Exception as e: log.info("new_service.foo exception: %s" % e)
*In the rare event that S3 returns file not found due to eventually consistency, the daemon is designed to refresh all the content every 30 mins, and those nodes will eventually catch up. So far, we havent seen any instances where the nodes got out of sync for more than a few minutes.
If youre interested in working on engineering challenges like this, join our team!
Pavan Chitumalla and Jiacheng Hong are software engineers on the Infrastructure team.