The last time Hackerfall tried to access this page, it returned a not found error. A cached version of the page is below, or click here to continue anyway

A Failed Dating Experiment - Responsiveness Sort

How many dates start with a "hey"? Actually quite a few, but it depends on many things. Your gender and orientation for one, and your overall level of attractiveness for another. There are users for which a single syllable is enough to ignite a conversation, but they are far and few between. For everyone else, it's rough out there. Sending messages on dating apps can feel like talking into an abyss.

Well that's no good. We have our share of "hey"ers on OkCupid, but we also have many users who put a lot of thought into their messages. We'd like to do all we can to make sure they get a reply. How do we deal with the abyss? How do we get you that response? In this post, we'll discuss an attempt to remedy one of the issues - a skewed balance of attention. As the title suggests, this attempt is not a successful one, but we learned a lot along the way.

Illuminating the Abyss

Let's take a look at message reply rates broken down by a few gender and orientation combinations. In the below chart, we have the attractiveness of the sender as a percentile on the x-axis. On the y-axis we have the message response rate for that percentile. If someone gets replies to half their messages (outgoing first contacts), we call it a 50% success rate.

Alright, so you probably already have a couple of questions about this chart. For instance, how are we judging attractiveness?

I know, I know, beauty is subjective, but here at OkCupid, we love crunching through data, and would rather compute your beauty than leave it in the abstract. Every swipe provides us with a vote, some crowdsourced truth on an objective measure of whether someone's profile is appealing to a lot of people or not. Our proxy for beauty is then simply, number of right swipes (likes) to total swipes (likes + passes). This is a very nuanced feature we are trying to measure, and people swipe for different reasons, but by and large this is a good enough approximation for what we are after. Take the swipe ratio, order users by it, and bin them.

Now take a look at the curve for straight men, this is significantly lower than the other 3, and especially so for men that fall in lower attractiveness percentiles. If you happen to be one of those guys, only 15%, or roughly 1 in 7 messages gets a response on average. Even if you are in the 99th percentile of attractiveness, over half of your recipients won't get back to you. Let's take a closer look at where all that attention is going.

Below we have a heatmap of sender vs recipient attractiveness for straight men. The x-axis here represents the score for the women getting the messages, while the y-axis corresponds to the score of the men sending them. Each row here is normalized to sum to 1.0. Cooler colors here indicate a lower percentage of outgoing messages went to the recipients in that bin, whereas hotter colors indicate more attention.

Well that's not very interesting is it? The only thing of note here is the peak in the lower right, which is attractive people talking to other attractive people. It's not as bad as this surface makes it out to be though. That large flat blue area means attention is pretty evenly divided among everyone else. In fact, if people ignored attractiveness altogether, the surface would be completely boring and flat. Take a look at the cross section for men in the 90th percentile.

If attention were evenly divided between all bins (if only love was blind) we would see the green line. Each of the 100 bins would have received 1 percent of messages. Instead we see that rate increase from less than 0.5% to just under 5%. Straight men who are more attractive than 90% of the population tend to send messages to women who are more attractive than 90% of the population. No Surprises here.

Here is the same plot for men in the 10th percentile:

Humility goes a long way they say. These guys distribute their messages much more evenly among recipients. They only send ~0.5% more messages to the top bins then they would have in the uniform case, and they appear to do the same to the bottom few percentiles as well.

Now how about the average across all straight men:

Your time is valuable, we know it and you know it. People will only spend so much time answering messages. When we have an imbalance of attention like this it tends to lead to a bad user experience for the sender and the recipient. People in upper percentiles aren't going to want, or be able to, respond to every message, and everyone competing for the attention of the same handful of people will leave the senders with little to show for their (hopefully) thoughtful correspondence. It is no wonder then, that responsiveness goes down when attractiveness goes up.

Here we plot responsiveness as the percentage of incoming messages replied to, as a function of attractiveness for straight women.

This makes sense, people who receive less messages will reply to a higher fraction of them.

So you see the problem here, the people most likely to respond to messages are the least likely to get them. With all this in mind, we set out on an experiment.

A Model for Success

We made an attempt to increase the rate of success for our senders as follows: Given 2 users, A and B, calculate how likely B is to respond to a first contact made by A. Use these probabilities as sorting criteria, and promote users who are more likely to respond. We had to be subtle with this change due to the effect shown by the above graph. If we sort by probability of response alone, the most attractive users will be sent to the bottom of the pile. This probably won't make our searcher very happy.

We have many sorts in our matching model. We sort by match percentage, by last login time, by attractiveness and so on. All of these sorts are combined geometrically with different sort weights applied to them. It is simple to tweak these coefficients if we wanted to, say, make attractiveness more important, and surface our most-liked users. In our case we wanted to introduce a new signal to sort by, responsiveness, and add a modest sort weight to it. Then we could study the behavior between our test and control groups and see if this had a positive impact.

The analysis will be presented in a second, but first, how do we calculate the probability of response between user pairs?

We were just experimenting here so we compromised between a feature set that would be quick to implement and accurate enough to be meaningful. As a first approximation, consider the probability of response for user B at all, regardless of who user A is. The user's responsiveness in general is a great factor in determining their responsiveness to any given user. Another feature is how often they swipe. Users who are more active on OkCupid, and vote often, are more likely to come across users who messaged them. Since we are predicting the probability of a binary event, will they respond or not, we train a logistic regression model to get our probability.

Now looking at the probability of response as a function of global responsiveness wouldn't be that interesting, but let's look at it as a function of swipe rate to see if the data needs any transformation.

The noise at higher swipe counts is a result of data sampling. We sample data for our analysis and to generate these plots, and the number of users who have swiped any given number of times decreases with N so the data gets noisier as our number of samples decreases. The overall pattern is what matters here. As you can see, the data is not quite linear, the logarithm would perform better for our model. Below we plot the response rate as a function of the log of swipes.

This looks more linear and indeed generated lower errors when training a model.

Another simple-to-compute metric is the attractiveness difference between sender and receiver. People are more likely to respond to someone higher up in attractiveness percentile than they are. Similarly, people are less likely to respond for bigger differences is in the other direction. Here we show the response rate as a function of this difference.

A value of -99 on the x axis would indicate that someone in the 0th percentile messaged someone in the 99th, and more than likely did not get a response, as you can see by the low rate of response for such cases. Meanwhile a value of 99 indicates that someone in the 99th percentile messaged someone in the 0th, a scenario where the rate of success is significantly higher, though still not a sure bet.

There are other features and transformations that lowered the error rate, but not by much, these three provide the most accuracy for the work needed, so we stop here and train on:

The logistic regression model predicts the response rate with an accuracy of about 87%. Accuracy here refers to 1 minus the mean absolute difference when we bin our predictions and compare the expected rate to the corresponding truth. This is not great, but since we are talking about a sort, as long as the relative values between users are ordered properly, the model will do. Here's a quick glance at what it looks like when we use these parameters to calculate our probability and compare it to the truth:

Experiment and Results

Great, so we have everything we need to construct our sort value. With that in mind, we setup a geo-experiment. We randomly divide the US into test and control groups based on regions. Why a geo and not just a random partitioning of the user set? We want to avoid network effects. If this changes the way test group users interact with control group users, the behavioral differences are more difficult to detect. We would like to keep things as independent as possible, and since most users interact with those only within a certain radius, a geo-experiment limits any network effects we might see.

We chose to run this experiment on straight men only to start with, and only for searches preformed within DoubleTake. For the unfamiliar, DoubleTake is our card swiping platform. It is the default view you see when first opening the app, and it is where most site activity occurs.

So for straight men in the test group, we promote responsive women higher up in their DoubleTake queues. The hope being that when we surfaced these users more often, they would be more likely to get a message. More responsive users getting more messages would result in more 2 way communication, and hopefully, more dates. The first thing to check then, would be: did we shift attention to more responsive women? Kind of, votes and likes were more heavily present for responsive women in the test vs control group. One thing that did not go up significantly was messages sent to these women. While votes went up, and swipe rates were consistent with previous patterns, overall messaging behavior was unaffected.

We were wary of how this experiment would affect users in different attractiveness bands. We expected that a sort like this might help users in lower percentiles, but that users in higher percentiles would be left with a worse experience. Response rate is already high when the sender is attractive. The sort does little to change this but does lower the overall attractiveness of the returned match results. We already know that men in high percentiles skew their attention towards the top band. Lowering the overall percentile of match results would then result in less interactions overall for these men. We thought that there might exist some point along the attractiveness scale, where the sort became ineffective, or even detrimental to the searcher. Our aim was to find where the threshold was, and in a future iteration, apply it as a cutoff. This way, we are only applying this sort to the users who benefit from it.

With this in mind, we bin our test and control groups of male senders into 5 bands and take a look at their send rates and response rates.

Because geo-experiments can have different population densities representing test and control, we see a slight decline in the overall message send rates in our test group, both before and during the experiment. What we aim to do is, compare the pre-experiment numbers to those during the experiment and see how those normalized results compare across test and control groups.

Before this experiment, our control group users sent roughly 6.5 messages per week, while our test group sent 5.These numbers jumped 7.6 and 5.8 during the test. Likely, this increase from week to week across both groups was due to seasonality or some other external event. The relative increase needs to be compared across both. That looks like the following:

If the relative increase was the same across test and control groups we would see the baseline. As expected, more attractive users in the test group sent less messages than their control counterparts. The average message send across all bands decreased by about 1.3 percent.

Now for the users who were sending messages, how did the average response rate differ across groups?

The difference in response rates seemed to hardly change at all in the control group, and are not too much better in the test group. The relative difference is marginal.

While response rates did improve across the board, they did not do so in a significant way, only 0.5% increase from the baseline. The improvement was so small that it may well have been noise. This effect decreased further for more attractive users. This improvement was made at the expense of user experience so it is difficult to justify putting it in production. Take a look at how the like-rate (outgoing likes over outgoing likes + passes) changed across groups:

Control activity was consistent for both time periods but test like-rate was down significantly:

Users liked who they saw in DoubleTake almost 10% less in the test group. This could be worthwhile if we saw a meaningful increase in responses, but absent that, it's hard to justify this change. The attention distribution heatmap was unchanged across groups as well. They look almost identical to the one at the top of this post. What this means is that we were unable to shift the focus of users. No matter who we showed them, they ended up messaging the same people they would have absent the experiment. They just had to swipe more to get there... and they did... mostly left.

At OkCupid, we don't make changes to features and algorithms without doing our due diligence. We are constantly running experiments, collecting data, and vetting our ideas. Metrics that drive our decisions vary by experiment, but usually come down to one question: Are users going on more dates? If the answer for the test group is yes, we move forward, in this case it was not. We may iterate on this feature in the future and try to make it more effective, but for now, it does more harm than good.

By the way, if this kind of thing seems interesting to you, know that we're hiring! We would love to have you on board doing analysis like this!

Continue reading on tech.okcupid.com