When I joined Pinterest, my first three weeks were spent in Base Camp, where the newest engineering hires work on real production issues across the entire software stack. In Base Camp, we learn how Pinterest is built by building it, and its not uncommon to be pushing code and making meaningful contributions within just a few days. At Pinterest, newly hired engineers have the flexibility to choose which team theyll join, and working on different parts of the code as part of the Base Camp experience can help with this decision. Base Campers typically work on a variety of tasks, but my project was a deep dive into a MySQL performance optimization project.
We work with MySQL running entirely inside Amazon Web Services (AWS). Despite using fairly high-powered instance types with RAID-0 SSDs and a fairly simple workload (many point selects by PK or simple ranges) that peaks around 2,000 QPS, we had been unable to realize anywhere near the expected IO performance levels. Exceeding roughly 800 write IOPS would lead to unacceptable increases in latency and replication lag, and this replication lag or insufficient read performance on the slaves would slow down ETL and batch jobs, which would then have downstream impact on any other team relying on those jobs. The only options available were either to go to an even larger instance size, thus doubling our cost and obliterating our efficiency, or find ways to make the existing systems perform better.
I took over the project from my colleague, Rob Wultsch, who had already made the significant discovery that Linux kernel version appears to be quite important when running on SSD inside AWS. The default 3.2 kernel that ships with Ubuntu 12.04 just doesnt cut it, nor does the 3.8 kernel that AWS recommends as a minimum version (although its still more than twice as fast as 3.2). Running sysbench on an i2.2xlarge (2 disk RAID-0 of SSDs) instance with kernel 3.2 could barely hit 100MB/sec of 16K random writes. Upgrading the kernel to 3.8 got us to 350MB/sec with the same test, but this was still much lower than expected. Seeing this kind of improvement from such a simple change opened up many new questions and hypotheses about other inefficiencies and poor configuration options: Could we get better performance from an even newer kernel? Should we change other settings at the OS level? Are there optimizations to be found in my.cnf? How can we make MySQL go faster?
In pursuit of answers, I set up almost 60 different sysbench fileIO test configurations with different kernels, filesystems, mount options and RAID block sizes. Once the best fit configuration was chosen from these experiments, I ran another 20 or so sysbench OLTP runs with other system permutations. The basic test methodology was identical across all trials: run the test for an hour collecting metrics at one second intervals, then drop the first 600 seconds to account for cache warm-up time and process the remainder. After the optimal configuration had been identified, we rebuilt our largest and most critical servers and rolled out the changes into production.
Lets take a look at the impact of these changes on some basic sysbench OLTP tests via the p99 response times and throughput metrics at 16 and 32 threads for several different configurations.
Here is what each of the numbers represent:
When we enable all of the optimizations, we find we can achieve roughly 500 percent more read and write throughput at both 16 and 32 threads while simultaneously reducing p99 latency by over 500ms in both directions. On the read side, we go from approximately 4100 4600 QPS to just over 22000 25000, depending on concurrency. On the write side, we go from approximately 1000 QPS to 5100 6000 QPS. These are massive gains in headroom and performance achieved with just a few simple changes.
Of course, all the synthetic benchmarks in the world dont mean much if they dont translate into real-world results. The graph below shows latency on our primary clusters from both the client and server perspective from several days before the upgrades were pushed until several days after all the masters were upgraded. The process took just a week to complete.
The red line represents client-perceived latency, and the green represents server-measured latency. From the client perspective, p99 latency dropped from a highly-variable 15-35ms with outliers over 100ms down to a pretty flat 15ms with outliers at 80ms or less. Server-measured latency also declined from a wavy 5-15ms to a basically-flat 5ms, with a daily 18ms spike due to system maintenance. Furthermore, since the beginning of the year, our peak throughput on these clusters has increased about 50 percent, so not only are we handling considerably more load (still well under our estimated capacity), were doing it with much better and more predictable throughput. And, in what can only be termed good news for everyone who enjoys sleeping through the night, the number of pageable incidents related specifically to system performance or general server overload dropped from over 300 in March down to less than 10 in April and May combined.
For more information, including the before-and-after details of what our MySQL and OS configurations look like, check out the slides from All Your IOPS Are Belong To Us, my talk from the 2015 Percona Live MySQL Conference/Expo, and stay tuned for more insights on how we get the most out of MySQL, Redis and other storage technologies.
Ernie Souhrada is a database engineer on the SRE team, part of the Cloud Engineering team.