The Saassy Guide to... 5 key take-aways from Experimentation & Testing @ Reforge
Over the last few months, I dug deep into experimentation & testing during a fantastic Reforge course, led by Hila Qu. Through a series of live case studies from companies like Netflix, SurveyMonkey, Uber or Coursera, presented by who’s-who of Growth, five key take-aways emerged:
1: Humans don’t do what they say they’ll do
Humans say one thing, but they may end up doing the complete opposite. This was the take-away across several sessions.
Jen Dante provided an example of our erratic nature from her time at Netflix. Users were required to provide credit card information when registering for a free trial. But research kept yielding consistent feedback: people hated doing that at this stage. If the free trial persuaded them, they would gladly provide credit card details when converting to paid. So Netflix ran an experiment. In Jen’s words “it was the single worst performing AB test in the history of Netflix”, with about a 20% drop in driving paid subscribers.
The free users just didn’t end up doing what they said they would.
2: Minimize effort and complexity at every step
The economics of experimentation are against you.
Experiments have a low success rate - about 70% of them fail. The success rate is even lower at well optimized companies like Google, where only 1 in 10 experiments is successful (Kohavi, Tang, and Xu: Trustworthy online controlled experiments).
At the same time, experiments are costly to run. There are three key costs to consider:
Cost to build: the engineering and design resources required to build an experiment
Opportunity cost: a user segment should be exposed to only one experiment, so by prioritizing one experiment, you have to delay/sacrifice another
Loss of revenue: if your experiment has a negative impact, you will lose revenue from the variation group
You don’t want to spend a lot of resources on an experiment that has a 70% - 90% chance of failing. You have to minimize effort at every stage. Before you design an experiment, you can leverage cheaper ways of gathering insights, such as surveys, data analysis or painted door testing, to ensure you are on the right track with your hypothesis. Yee Chen used the example of Reddit where a painted door test showed a 30% lift in sign-ups if an automatic Google sign up option was added. This gave the team the confidence to commit their limited resources to the real test.
Once you are ready to design the experiment, minimize the build complexity. For example, for web experiments at Spendesk, we frequently leverage Typeform instead of creating new web blocks.
3: Even with a right hypothesis, the experiment may still fail
Experiments are hard because many different factors, beyond your hypothesis, influence the outcome.
Netflix tested a hypothesis that helping people set up Netflix on a TV would lead to a higher conversion rate. The experiment turned out negative but looking closely at the data, the hypothesis was still correct. The issue was that the experiment guided users to set their TV at a wrong time - before users got to their aha moment of viewing content.
The experiment didn’t fail because the hypothesis was wrong. The hypothesis was correct. The experiment failed because the timing wasn’t correct. The hypothesis was inserted into a wrong moment of the user experience. This shows what is so hard about experimentation - there are many factors that influence the outcome.
4: Don’t get fooled by averages
Averages may hide important differences between segments, and they may not be always visible until you start digging.
A few years ago, SurveyMonkey evolved their pricing and Elena Verna ran monetization experiments to test the users’ willingness to pay. But the conversion rate “tanked” for all pricing options - from lower to slightly higher and even high prices. The results didn’t make sense and were contrary to the previous user research. Until the team started segmenting. Turns out, 90% of visits to the pricing page came from free users who had already visited the page before. The new pricing confused them. The results made sense once SurveyMonkey zoomed in on the 10% of free users who visited the page for the first time.
Segmentation for testing happens at two stages. Before the experiment, you have to exclude users that are not relevant (e.g., excluding mobile users when testing on desktop). After the test, you want to identify segments that can explain your results better than the average.
5: It is not just about the one metric
Probably the most important aspect of experiments is not the numbers or statistics, but understanding the “why” behind them. Often, the outcomes of experiments don’t make sense at first - like for Netflix or SurveyMonkey. In those cases, growth teams rely on other research, data or tracking to analyze the results, and to try to understand the rationale behind the numbers.
In addition, you may move your primary metric in the right direction, but it may have surprising or unintended consequences somewhere else.
Tom Willerer saw a surprising consequence when testing Coursera’s monetization model. He found that the subscription model had an additional impact on learners’ behavior: it increased learner completion through courses. Willerer speculated that learners felt a bigger ownership ("I'm paying, I should use this.") or they were hoping to save money ("Maybe I can get this more cheaply because if I did six months worth of content in two month.”).
At Spendesk, we are trying our hardest to avoid any unintended consequences. We are currently running the classic “book a demo” test, where we update the design and reduce the number of fields we are asking prospects to fill in. We have seen an increase in conversion, but we are also verifying with our sales team that the quality of leads hasn’t dropped.