A news and social media case study.
On the 25th April 2015 at 11:56am (6:11 GMT) a devastating earthquake hit Nepal that measured 7.8 on the Richter scale. The epicentre was close to rural Gorkha and, as such many people now refer to the event as the Gorkha Earthquake. The affectedarea was large and included the Nepalese capital, Kathmandu, home to about one million people. The map below, courtesy of USAid, shows the location of the epicentre of the main earthquake and the subsequent aftershocks as well as the intensity as felt in different locations.
USAid Official Map of the region affected by the Nepalese Earthquake.
Whenever natural disasters of this scale occur, there is a large, international humanitarian response with aid agencies and volunteers arriving from around the world to help the afflicted. This work is difficult to co-ordinate (though the UN do fulfil this role), infrastructure can be damaged and information sparse. This proved particularly difficult in the case of Nepal, as aid agencies were also having to deal with landslides, aftershocks and, of course, some of the largest mountains in the world. The monsoon season started in early June adding heavy rain to the list of difficulties.
The Department for International Development (DfiD) in the UK and Nepal were interested in seeing what signals, if any, there are in social media that could help them to understand the situation and needs in the most affected Districts in Nepal. The analysis described below was performed with Ripjars toolset after the event to ascertain whether similar techniques should be considered for future crises.
The first step in this analysis was to collate a large dataset with which these questions might be answered. Searching for terms relating to the distribution of aid in Nepal, across social and mainstream news media for the period 25th April till 17th June yielded a little over 1.5 million items. The following table breaks down the data by source:
Breakdown by source of the Nepal Earthquake dataset
The vast majority of the data comes from Twitter, with a large number of news articles from about 50,000 digital news sources.
In the analysis of this dataset below, two main threads of interest are pursued; the first is how real-world events can be detected within such a dataset and the second looks at whether there is any information in this dataset that could have been useful in discovering where people needed help in the aftermath of this earthquake.
News and social media are both very good sources of data for detecting real world events. Looking at how the data volumes vary with time, we see a couple of significant peaks:
Data Volume by Time for the Nepalese Earthquake and Subsequent 1.5 Month Period
The first peak is the main coverage of the earthquake a monumental spike in the volume of data, peaking at over 120k items in a day. The second peak was caused mainly by news of an American marine helicopter going missing, although there was also a significant amount of data about Cristiano Ronaldo donating $7M to help Nepal and a smaller amount about a significant aftershock.
Looking within the data for the phrase aftershock we get an insight into the severity of the aftershocks that followed the earthquake:
Data Volumes by Time for Items containing the term “aftershock”
In particular the peak around the 12th May maps to one of the most significant aftershocks on the Chinese border, thought to be responsible for 200 deaths.
This earthquake gives a unique insight into the speed of various media to react, as seismic sensors were able to pinpoint the exact time of the earthquake. Comparing the major Western media outlets with Twitter yielded the following diagram (with times given in GMT).
Diagram showing when various media first published information regarding the Earthquake
The news broke on Twitter some 19 minutes before the main news providers, with the first tweet that explicitly mentioned the earthquake (in this dataset – looking primarily at aid, not the earthquake directly) came in 10 minutes after the event occurred. The BBC appeared to be the first big news provider to publish an article to their website, 29 minutes after the shock. Interestingly, bots on Twitter that were studying US Geological Survey seismic data responded slower that their human counterparts. The first tweet from one of these bots relating to the earthquake was some 19 minutes after the shock – perhaps this is a feature of the twitter accounts themselves – but I fully expected them to come almost immediately following the event (do get in touch if you know why!). The first tweet came from an eyewitness in Bhaktapur (about 10km East of Kathmandu):
The first tweet referencing the Nepalese earthquake in the dataset
A stark reminder about how frightening this must have been.
It is an interesting fact that Twitter broke the story before major news outlets did, but there is a broader, more significant point here: mainstream media only publish events that are deemed interesting enough for their audience after a verification process. Some smaller stories never make it to international media, perhaps only local news or social media. A study by Alexandra Olteanu et al. estimates that only 22-25% of social media and news events had the same cause. To keep abreast of many smaller events in a domain, one needs to study all these sources to get the full picture. Further, to see the developments as they happen – it is necessary to study data outside of the main international media. In a similar analysis, Michelle Odlum and Sunmoo Yoon report saw signs of the ebola outbreak in Nigeria, 3-7 days prior to the first official announcement.
As discussed in the introduction, DfiD are interested in any information that would help them to deliver aid more effectively. How best to wade through the 1.5M items to find useful information for them?
The first move was to look at data only within a three week period of the earthquake, cutting down the data size to just under a million items. The reason for this is to find the needs of people immediately following the earthquake. DfiD are also interested in longer term information as well, that could assist in the rebuilding process. Here, for brevity, only the immediate needs following the earthquake are considered. Zooming in with time reveals the usual saw-tooth graph that aligns with North American office hours:
Data volumes by time in the three weeks following the earthquake
This is still far too much data to inspect by hand. To cut the data down further, a conjecture was made that eye witness observations, not second hand relayed stories are more useful. To find this it seemed prudent to consider the location data to remove second hand information. Ripjars toolset presents three locations we could use to filter:
The locations mentioned in the text would not give us the eye-witness data that we are after, merely find people discussing places in Nepal. Looking at information where the account location is inside Nepal left about 38,000 items, still too many to look through in a reasonable time. Filtering by GPS location gives a severe cut down as, typically, only 0.5% of Twitter data has such a criteria (and this proportion is even lower outside of the Western world). This left just 480 tweets – an appropriate amount for hand analysis. This is where they were located:
Map showing the locations of tweets submitted with a GPS co-ordinate
The majority of the data points are focussed on the population centres of Kathmandu and Pokhara with a smaller cluster close to the epicentre of the earthquake.The following tweets (translated into English by Google translate) give a flavour of some of the data, all asking about the delivery of aid to the regions between Kathmandu and Pokhara.
Tweets from Nepal, discussing the need for aid in the Gorkha and Dhading districts
The regions of Ghorka and Dhading between Kathmandu and Pokhara appeared not to receive aid till some time after the earthquake. The majority of the aid came through Kathmandu airport and as such took a long time to propagate out to the rural communities. It is unclear whether this was because of logistics, landslides, broken infrastructure or simply a lack of co-ordination (if you were involved I would be very interested to hear your thoughts). Eventually Dhading, mentioned in the tweets above was reached with the assistance of the Indian airforce on the 5th May, 10 days after the earthquake:
Indian Airforce helicopter helping out in Dhading, Nepal (Imagecourtesy of NYC Medics)
In this article, an analysis of the news and social media surrounding the Nepalese earthquake has been presented, focussed on two main results:
This analysis was done after the earthquake had occurred and as such benefits form the wisdom of hindsight. The real question is; could asimilarapproachbenefit future crises? Well, tosimply staying abreast of the staggering quantity of news and social media data produced,it seems prudent to use an intelligent computer system to help sift through it all. Better, more timely information would result in improved decision making.
With respect to DfiD’s initial request, regarding information about needs on the ground, we’ve seen that social media does contain such data.One gets the feeling that open networks, like Twitter, will provide an increasingly important role in the future of these kind of crises– particularly if the next onehappens somewhere with a higher percentage of internet users (in Nepal this was estimated to be 27% in 2013). In such a scenario, social media wouldbe a fantastic source of data.This has to be a two way street, however, to encourage more people to broadcast what help they need and where–aid agencies and governments should report what they aredoing to provide information to people waiting for aid (some agencies were indeed doing this).
To get the most from this kind of approach, systems need to be embedded prior to crises occurring. The strengths, and indeed weaknesses,of these kinds of data sources need to be understood ahead of time and rolled into the existing information pipelines to produce the best outcomes in the next crisis.About the Author
Simon Smith is a data scientist at Ripjar, a data analytics startup focusing on extracting and presenting insights from social media, news and other publicly available data. Prior to this, Simon worked in a variety of research positions in cryptography, machine learning and visual analytics. Simon has a particular interest in natural language processing; enabling computers to understand human language.