In this post, we explore LDA an unsupervised topic modeling method in the context of twitter timelines. Given a twitter account, is it possible to find out what subjects its followers are tweeting about?
Knowing the evolution or the segmentation of an accounts followers can give actionable insights to a marketing department into near real time concerns of existing or potential customers. Carrying topic analysis of followers of politicians can produce a complementary view of opinion polls.
The goal of this post is to explore my own followers, 698 at time of writing and find out what they are tweeting about through Topic Modeling of their timelines.
We use Python 2.7 and the following packages and methods:
Obtaining Twitter data is not straightforward and takes time since Twitter limits the number of requests one can make in a given amount of time. This forces us to include waiting periods when querying the API. In order to avoid having to query Twitter everytime we want to access the original data, we save all the tweets in a MongoDB database. Since the MongoDB database does not need a predefined static schema this allows us to extend the data structure on the fly if needed.
We only need OAuth2 authentication which is simpler to implement than OAuth1. Once you have obtained your OAuth2 keys and ACCESS_TOKEN, Twython makes it easy to retrieve a list of followers or tweets from a timeline. There is a one to one correspondance between the Twython method and the twitter API call. For each follower we store, the (max) 200 most recent tweets, the language of the account, the number of tweets obtained and the screenname of the account.
The python code to extract the data from twitter is available here
We create one document per follower by aggregating his / her tweets. Each follower then has a unique document which serves as the basis for our topic analysis.
Having documents in several languages will add noise to the topic extraction and we want to filter out timelines that are not 100% in English. Filtering users by the lang parameter of their twitter account is not 100% reliable. Some users tweet in several languages although their account language is declared as en while others have not defined the language of their account. (lang = und for undefined).
To filter out non English timelines we first filter out accounts that have a lang attribute not set to en or und and then use the langid library to further refine our selection.
Other methods using NLTK stopwords (see also this post) or character trigrams could also be considered to detect the language at the tweet level. However since we have documents that are collection of tweets and have a large enough number of words, the langid method is the simplest.
And we end up with 472 documents.
In the meantime, I learned that I had followers in Japanese, French, Russian, Chzech, Spanish, Potuguese, Italian, Dutch and Greek!
We then clean up and prepare the documents for LDA
Then we build a dictionary where for each document, each word has its own id. We end up with a dictionary of 24402 tokens. And finally we build the corpus i.e. vectors with the number of occurence of each word for each document.
This cleanup process was iterative. Looking at the LDA results allowed us to detect frequent words that did not add any meaning to the documents and include them in the stopword list for another cleanup run.
The python code used to process the raw documents is available here
Now we are ready to find what the followers of alexip are tweeting about.
Finding the right parameters for LDA is an art. 3 main parameters need to be optimized:
Since we are dealing with tweets, I assumed that each follower would have a limited number of topics to tweet about and therefore set alpha to a low value 0.001. (default value is 1.0/num_topics). I left beta to its default setting.
We tried several values for K the number of topics. Too few topics result in heterogeneous set of words while too many diffuse the information with the same words shared across many topics. For the record, there are several different ways to estimate the optimum number of topic in a corpus. See also the Hierarchical Dirichlet Process (HDP) which is an extension of LDA where the number of topics is infered from the data and does not have to be specified beforehand.
The default view of the top 10 topics is not the most user-friendly one. For each topic it lists the top 10 words and their associated probability. Its difficult to interpret the lists of words and define associated topics. For instance:
Fortunately, we can use LDAvis a fantastic library to explore and interpret LDA topic results. LDAvis maps topic similartiy by calculating a semantic distance between topics (via Jensen Shannon Divergence)
LDAvis is an extremely powerful tool to tune up your LDA model and visualize the cohesion of your topics.
The results of the topic modeling of my followers is available in this LDAvis notebook.
Too few topics (K=10) result in non cohesive topics all very similar to one another With 50 topics the topic spread is much better. But the best result was obtained with 40 topics, alpha = 0.001 and 100 passes. We also trained the model with 100 topics and 100 passes but did not notice any improvement in the coherence of the topics. All 3 models are presented in the LDAvis notebook
LDA is not a magic wand! The model is difficult to train and the results need a solid dose of interpretation. This is especially true in the context of tweets that construct rather noisy documents. In an unsupervised context, estimating the performance of the model requires either manual assessment or some quality metric. Our manual assessment of the relevance of the topics found by the LDA model is not exhaustive. However it gives a strong indication that the method we followed is sound and does what its supposed to do.
Further ways to improve the LDA results would include
In the second part of this study, we carry out segmentation of the followers with Latent Semantic Analysis and K-Means and compare the results between LSA and LDA. Read all about it Segmentation of Twitter Timelines via Topic Modeling
Some videos / Notebooks on LDAvis
The code and notebook behind this post are:
Final Note: The first version of the python scripts were in 3.4. However after running into problems with LDAvis and the dictionary produced in 3.4 I reverted back to 2.7. I also found out that lda models saved in python 3 where not compatible with LDAvis for python 2. Switching back the whole stack to python 2.7 was the the path of least resistance.
Next: In the following article we compare LSA and LDA and show how these 2 methods can be combined for better results.
If you have any questions or comments, please post them below.