Selecting users into A/B buckets is the core construct of any web-based experimentation platform. Yet building and designing an A/B selection algorithm is surprisingly difficult, and the process of bucketing users is far more complex than just flipping a coin or spinning a roulette wheel.
Ensuring a persistent experience across multiple devices and various user states is hard. Depending on your requirements, what may start out as a simple selection algorithm can quickly devolve into a complex multi-key user identity system with real-time operational requirements.
In this post, well dive into the various selection schemes that weve used here at Simon as well as in other experimentation platforms that weve built in the past.
At a minimum, a binning algorithm must bucket deterministically based on a users identifier. Given an experiment Elmo and user Urkel, such a function bucket(Elmo, Urkel) must always evaluate to the same bin to ensure a consistent experience and a valid test. If Elmo is an experiment that tests changes to search result rankings, then if Urkel is placed into variant B, he must remain in variant B when browsing subsequent search result pages; otherwise his experience will be inconsistent and the test will be invalid.
If you only need to experiment across logged in users with known emails, then you can perform your A/B selection in an entirely stateless fashion using a basic hashing scheme. Heres some pseudocode:
def hash_bin(experiment_name, user_email) hash = md5(experiment_name + - + user_email) bin = int(hash) mod 100 bucket = bin < 50 ? A : B return bucket
The first line ingests the experiment name and users email to deterministically form a token that drives the randomization process. The second line then converts the hash to an integer (ignoring overflow issues that need to be dealt with) and assigns the user into a bucket. Finally, in line three, the bin is chosen assuming two variants with a 50/50 split.
The above scheme works equally well when applied to any set of users who are identified by a single consistent set of identifiers - e.g. emails, device ids, or cookies.
If, however, your experimental system requires binning across multiple forms of identity - say, both logged out users represented by cookies and logged in users represented by user ids - then things get much more complicated.
Before diving into technical details on how to deal with multiple forms of identity, lets dig into a simple example that shows the limitations of the problems were solving.
Day 1: Urkel surfs over to your homepage on his laptop, logs into the site, and is binned into variant B.
Day 2: Urkel surfs over to your homepage from his iPad but doesnt log in and is binned into variant A; he spends another 5 minutes browsing the site and then eventually logs in.
In this case, were faced with two options on how to treat Urkel in his day 2 experience:
Option 1: Switch Urkel back to bin B on login. This will maintain consistency across his logged in experience, yet will result in a potentially awkward user experience when he abruptly switches buckets from one screen to the next.
Option 2: Keep Urkel in bin A on login. This will ensure a consistent user experience across day 2, but the resulting statistics for the test may be skewed since Urkel saw both variants A and B in his logged in state.
Of course, both result in different forms of inconsistencies; neither is ideal.
If Urkel then revisits the site on his iPad on day 3, Option 1 would maintain his inclusion in variant B for both logged in as well as logged out contexts.
To implement this, we need to record state. When Urkel logs in on day 2, we need to explicitly record that were going to override the simple hash function that assigned him into bucket A during his logged out iPad browsing. When Urkel returns on day 3, we now need to check the database first to see if hes a known subject with a pre-existing bin.
The pseudocode should look something like this:
def stateful_bin(experiment_name, user) # Lookup against current identifier scope (email or cookie) bin = lookup_bin(experiment_name, user.identifier) if bin is empty bin = hash_bin(experiment_name, user.identifier) # Save bin across all available user identifiers assign_bin(experiment_name, user, bin) return bin
Unfortunately, the cost of implementing this can be quite high. First, the function has interleaved reads and writes, so some sort of persistent random access and ideally transactional database supported may be needed - mysql, redis, etc. Second, if youre supporting both logged in as well as logged out users, the size of this database can grow quite large as the number of logged out users in many contexts can be an order of magnitude larger than the number of registered or logged in users. And finally, whereas a deterministic hashing scheme requires only a handful of machine instructions and a fraction of a microsecond of compute time, a database lookup with a network connection can take milliseconds and can have a material impact on overall system performance.
Fundamental here is a problem of identity management. Users can be represented via multiple logged out states, various customer identifiers, or different marketing contexts from email to phone number to mailing address.
The state-based method outlined above is akin to lazily building an identity management repository as users are added into your test. One could imagine another approach in which identity management is dealt with independently and buckets are assigned to a user as a whole instead of their specific identifiers one at a time.
If the problem is approached as one of identity management first, then you can preemptively enforce even better consistency. For example, in the case of Urkels day 1 and day 2 experience above, if his logged out identities were known beforehand, then he could have pre-allocated consistent buckets across each of his devices. For an experiment such as a 50% promotional offer, this can be critically important.
Weve collectively built and architected three separately designed A/B testing platforms over the course of our previous startup, Etsy, and at Simon. And across each of these contexts, requirements and infrastructure were quite different.
At Etsy, we got a very long way with simple hash-based bucketing that keyed off of either logged out cookies or logged in user ids, but not both. This was effective in managing a wide array of tests conducted across many tens of millions of unique devices per month.
At Simon, our testing problems are more diverse and multi-channeled, and weve built significant infrastructure around multi-key customer identification to support a wider array of use cases.
Both solutions represent distinct points on either end of the complexity curve. Neither solution is perfect, and results certainly vary from company to company and from experiment to experiment. The goal here isnt to eliminate mistakes, but rather to setup your experimental context to minimize them.
Stay tuned for future blog posts as we dive into more detail around this and related challenges surrounding user identity management.