Birds of a feather flock together, or so the saying goes. While some individualists may bristle at the notion of being predictable, mining patterns on reading interests, or app usage, or video watches, or blog follows almost always show us that people who interact with similar items continue to read/use/watch similar items in the future.
Recommendation systems (e.g., collaborative filtering, matrix factorization, factorization machines) tend to exploit the obvious insight that users typically prefer recommendations that come from folks who behave like them. Flavors of recommendation systems have been successfully used in several companies: Netflix, Facebook, YouTube, and Amazon to name a few. However, there are a number of research challenges that need to still be addressed. First, how do recommendation systems work when the lifetime of a content item is limited, as in a breaking news article? Second, how do we make personalized recommendations for each and every user? In Yahoo terms, for the problem of news recommendations this boils down to being able to model hundreds of millions of users against millions of items. Third, how do we develop a methodology that can be applied across several recommendation problems, be it news items, videos, apps, blogs, users, stock tickers, your favorite product, etc.?
Inspired by the above challenges, the Personalization Science team at Yahoo Labs has developed a scalable recommendation system based on the concept of Factorization Machines. To further address the problem of scalability we have implemented a distributed version of the Factorization Machine which leverages the in-house parameter server implementation. Our novel distributed Factorization Machine (DFM) solution has been leveraged in a live production system and easily scales to handle live traffic.
Figure 1: Example of FM cast as a Matrix Factorization problem
Factorization Machines is a user response prediction system that models the relevance between different entities. To illustrate the concept, suppose we have collected a set of users’ interactions on items as shown in Figure 1. Each user-item interaction is then encoded into several possible matrix components. For example, user index, item index, user demographic features, item features (such as entities in the item), contextual features (such as time of day, etc.), along with a target rating indicating the user’s response to the item. FM then builds a vector of latent factors for each entity in each component by minimizing a loss function. Casting the raw vectors into a lower dimensional space allows us to directly compute similarities between the latent factors of each component. The aggregated similarities with some bias adjustments allows us to generate personalized recommendations.
There are many applications to which FM can be applied, such as contextual recommendation, topic modeling, and so on. For each use-case, the generalization capability of FM means we only have to design the input data accordingly. However, to apply FM on Yahoo’s products, we need to contend not just with big data, but also with a large model. For example, for news item recommendation on the Yahoo homepage, there are hundreds of millions of users, millions of news articles, and tens of billions of interaction records between users and articles. So we have to partition data to multiple machines. In addition, if we use a vector with 10 dimensions to represent users and articles, the model size is also on the scale of hundreds of GBs, such that it is more efficient to distribute the model storage as well.
Distributed Factorization Machines
In order to scale FM to Yahoo use-cases, we leveraged the Hadoop Map/Reduce (MR) framework and Yahoo’s home-grown Parameter Server (PS) technology. The innovative PS has empowered Yahoo’s large-scale machine learning algorithms (including logistic regression, word embedding) in production. We utilized MR to implement the data parallelism and let the PS handle the model parallelism. Specifically, in each mapper, we load data from HDFS and partition the data based on an appropriate key. Each reducer, then runs a mini-batch Stochastic Gradient Descent (SGD) optimization method on the data it received. In each model iteration, a pull operation is executed to get the model parameters from the parameter server and a push operation is used to update parameters on servers. Shattering the data to multiple machines and partitioning the models across parameter servers enabled us to scale to not only large data, but also allows us to work with large models.
To test out the performance of our DFM, we used our implementation to build profiles of user interests on news articles. The problem is one of extracting relevant item features, such as phrases in news articles, and assigning them to users based on user-item interaction records. We can then compute the similarity between the users and the items and use the score to recommend relevant items to a user. In this case, we reformatted the input instances to FM by replacing item indices as correspondent feature indices as shown in Figure 2.
Figure 2: Example of FM for User Profile Building
We evaluated our approach using A/B tests on live traffic on the Yahoo home page news feed. In our experimental setup, a random sample of users was partitioned into a control and a treatment group. The users in the control group had their profiles built via the default algorithm in production and the users in the treatment group had their profiles built via DFM. Holding all other features constant, the only difference between the control and treatment group was the computation of the relevance score between the corresponding profile and the content item. The base DFM model was trained over three months of user-item interaction data, after which we did a daily update of the base model using the interaction data for the previous week. It is worthwhile to note that the DFM model was trained in a matter of hours, although it handled on average around 0.5B users, 88 million items, and 6.4B interactions. The results of the experiment are shown in Figure 3. where DFM overall performs better than the baseline on a metric optimized for dwell time on an item. An added benefit was that we were able to expand the profile of our users. As an example, we were able to expand a user’s interest in Chris Smalling to include Manchester United F.C., Premier League, Louis van Gaal, David de Gea, Wayne Rooney, etc.
Figure 3: Online A/B Tests of User Profile Building
DFM is a scalable and elegant solution that generalizes to most recommendation problems. The Personalization Science team has successfully applied DFM on several Yahoo products, such as the Yahoo homepage, Aviate, and Yahoo Recommends. We are always on the lookout for challenging product use-cases, so if you are working on the problem of recommending news articles, apps, videos, blogs, photos, or even birds, we are all ears!