Covering elections is a staple in American journalism. Ive covered electionsas a reporterand Ive helped display election data in drastically different ways at three news organizations.
So first, a little primer on elections data. Generally speaking, on election night, the data for vote totals is tabulated by county boards of election and then sent to a state-level board. Next, the data is harvested by vendors such as Ipsos and the Associated Press. Until recently, the only nationwide election data vendor for news organizations was the AP. While other data vendors exist, they usually focus on more niche markets, such as campaigns and political parties.
The AP has a physical person in every U.S. countyto report back to them what the current vote totals are for different races. Its incredibly costly, but means you can dive deep into trends in data. The AP has a system that lets you FTP in and download the data in XML or CSV format, which your publication can then display.
The AP doesnt always get state, county or local-level election data in this same manner. Thankfully, most states (and some counties) have online data portals, RSS feeds or APIs that can be downloaded, scraped or accessed to get the data youre looking for. In some places, though, a real person has to sit in an election boards offices and get the election data back to the news organization somehow, typically by calling or emailing.
While displaying data online may get a lot of attention these days, remember that many news organizations still print something every day. So news organizations have also needed to solve the problem of importing AP election data into their print editions, too — generally through decades-old pagination systems.
Now lets talk about the differences between the three places Ive wrangled election data for.
In 2010, I was a newbie developer at thenow-renamedThe New York Times Regional Media Group. I started a few weeks before the 2010 midterm elections. My new coworkers had already built a system to FTP into the AP, import the data into a MySQL database and then display it on our 14 news websites using iframes hitting tables built in PHP.
I helped by load-testing, or seeing how much traffic the project could take, while we were running importation tests of the APs test data runs. By my estimations using Siege, I thought we were in the clear, with 2,500 hits a minute not crippling anything. If election night traffic had indeed been 2,500 hits a minute, we might have been in the clear.We were not.
If memory serves, we had one EC2 medium instance running to importanddisplay the data and a medium MySQL instance running for the database. I didnt know about caching and thought it was just something that was turned on automatically. It wasnt.
On election night, we had two newspapers who received election data first, and things ran smoothly with them, as they were in smaller markets. Then the Florida papers started getting heavy traffic. Our EC2 instances became bottlenecked, stuck at 99 percent CPU usage, unable to read the AP data, let alone write it to the database with updates.
This brought all 14 of the newspaper websites to a crawl because these iframes were getting loaded before almost anything else on the page. In the end, homepage editors took the iframes off the pages, a coworker wrote some SQL to hand-optimize the election tables and, by then, traffic to the sites had subsided to reasonable levels.
It was the scariest night of my professional life. Thankfully, most of the newspapers were happy, as they hadnt ever even attempted to display live election data on their websites, so this was still an improvement for them. And I learned to set up caching — in later cases, Varnish — when attempting to hit a live database in any way.
Next, I was at the Boston Globe during the 2012 general primaries. As then-hopeful Mitt Romney was the former governor of Massachusetts, the Boston Globe was a major source for news and coverage of the GOP primary battle. And the New Hampshire primaries were that papers bread and butter.
But the team I worked on had a fun logistical problem: We needed to display the data on two websites,Boston.comand the newly-launchedBostonGlobe.com. Each ran in a different content management system, each had different styles and each wanted the data displayed a little differently.
The first problem we had to solve was how to pull in the data. The Boston Globes CMS was Methode, which stored everything — stories, photos, etc. — as pieces of content in XML. As the AP already provided data in an XML format, we would just need to import it, change some of the tags to better suit the Methode ingestion system and then I would write the code necessary to display the data.
Thankfully, the Boston Globes systems staff figured out quickly how to go in and download the XML data and put it into a spot in the CMS that I could access. We had created mockups and styles fordisplaying the data responsively— still a new concept at the time — and now had to pull in the data, via some incredibly ugly Java I wrote.
We didnt have time to do something similar with the Boston.com CMS, which was at the time, I believe, going on 12 years old, and was somewhat fragile. So we decided to build separate styles and templates in BostonGlobe.com that would could iframe into Boston.com. Not the best way to do things, but its how we did it.
And then, as the primaries started happening more and more frequently, I would have to make each primary its own chunk of code, violating the DRY principle repeatedly, and just trying to get everything deployed to production in time for the producers to be able to slot the items on the various homepages.
Another coworker had an old Python script that just created basic HTML tables for county/town election totals and pushed them into Boston.com, for a more in-depth look. Lots of moving parts, different content management systems, different styles, a lot of work for the small number of people working on it.
Now Im at the Chicago Tribune. In 2012, my coworkers built a system that pulled in AP election data into a Django site with Varnish in front for caching. For local races, they pulled data entered by Chicago Tribune staffers into Google spreadsheets based off information gleaned from various county board of election sites, which were then turned into flat files as well. And then the AP data was pulled into our pagination system for the print product through tables the AP sent, just like it had been done in previous elections.
Fast forward to a month ago. The Chicago Tribune no longer subscribes to the Associated Press, butReuters has entered the election data game. Instead of having to FTP and download XML files, we hit an API and receive JSON. Its pretty nifty and much more conducive to building web-facing applications.
We wrote a Python wrapper to hit the Reuters API and reformat the data for our purposes, and then we again built flat pages based on that data, usingDjango Medusa. And for local elections and referenda that Reuters wasnt covering, we again had Tribune staffers entering data into Google spreadsheets.
We still had to write a custom system that takes the Reuters and Google spreadsheet data and sends it to our pagination system. This required us figuring out how the data needed to look — basically a mix of XML-ish template tags and tables — and then FTPing it to an area where our pagination system could ingest the files, give them proper templating and allow page designers to put them on pages.
Elections are big events traffic-wise, and static sites take large traffic pretty well. With the Boston Globe and Chicago Tribune solutions of using basically static sites (XML and sites baked to S3), it meant little freaking out at 9 p.m. when youre getting thousands of pageviews a second. If youre having to deal with lots of calls to your database while its also reading and writing, youre going to have a bad time. Static sites are wicked great.
Testing is important, but knowing what to test is more important. At The New York Times Regional Media Group, I thought I knew what I was doing, but I was testing for unrealistically low traffic and didnt think about what would happen while it was trying to write election data to the database, too. I now know I could have asked folks on theNICAR listservfor help, or tweeted questions or really just asked anyone with a few years of experience, Hey, will this work?
Election nights are stressful, so be cheerful and smiley. We at team Trib Apps try to be cheerful and kind whenever working with anyone, but with this many moving parts, it never hurts to just think smile while saying words when conversing with other folks. Were all working hard on these nights, and Im a big fan of not adding any extra stress on peoples lives. Thats also part of what our technology is supposed to do — make things easier for folks in the newsroom.
Have a point person from the tech side to coordinate with the newsroom. When local election data started coming in, I stood in the area where folks were entering it into Google spreadsheets, just so someone was around to help answer any questions on the spot, whileDavid Eads, who was the lead developer on the elections project, made sure the technical side was running smoothly. We had only one minor hiccup that was quickly fixed and we were able to identify it because we were all near one another, able to communicate more effectively. Even though we work with machines, this job is mostly about communication between humans.
Know that youre going to be covering an election again and make your code reusable. When we were writing our code for the primary, we knew a general was coming up in November. We also knew that other Tribune newspapers would be itching to show election results so we needed to get the fundamentals right the first time.
We would love to hear about your experiences with election data. Please feel free to add a comment and tell us your story.