A Complete Guide to A/B Testing in Python

Image by Author

Experimentation is the backbone of every product company.

Spotify’s AI Playlist Generator, Meta’s Personalized Threads features, and Google’s new search update — these features aren’t just launched after someone in the product team wakes up one day and has a great idea. Rather, these companies only launch products after extensive testing.

Features are released and upgraded through constant experimentation, and the goal of these companies is to retain customer attention and keep users on the platform.

A group of data scientists work on these experiments — using a method called A/B testing. As a data scientist, I perform and analyze A/B tests almost daily, and have been thoroughly questioned about A/B testing in every interview I attended.

In this article, I will show you how to perform A/B tests in Python. By the end of this tutorial, you will understand what A/B tests are, when to use them, and the statistical concepts required to launch and analyze them.

What are A/B Tests?

An A/B test allows you to compare two versions of the same thing. For instance, if you had a website and wanted to check if people made more purchases with a red checkout button rather than a blue one, you could perform an A/B test.

Essentially, you could show half your users the blue button, while the other half sees the red button. Then, after running this experiment for a month, you can launch your website with the button variation with the most clicks.

Sounds simple, right?

However, there are some nuances that must be considered when you run an A/B test.

For example:

If the red button got 100 clicks and the blue button got 99 clicks, what if the difference between them is random? What if it isn’t the color of the button driving the additional click, but rather an external factor like user behavior or time of day?

Image by Author

How would you decide on which user sees the red button and who sees the blue one?

How many clicks are needed before you make a decision on which button is better? 10 clicks per group? One hundred? Or perhaps a thousand?

If an A/B test isn’t set up properly, your results will not be accurate, and you might end up making a decision that costs you (or the company) hundreds of thousands of dollars in sales.

In this tutorial, we will explore some best practices you must follow when implementing an A/B test.

I will provide you with an A/B testing framework — a step-by-step guide to creating a successful A/B test, along with sample Python code to implement each step.

You can refer to this guide and repurpose the code if you need to create your own A/B test.

You can also use the frameworks provided in this tutorial to prepare for A/B test related questions in data science and data analyst interviews.

How to Run an A/B Test in Python?

Let’s take the example of an e-commerce website.

The owner of this website, Jean, wants to change the color of her landing page from white to pink. She thinks this will increase the number of clicks on her landing page.

To decide whether to change the color of her landing page, Jean decides to run an A/B test, which includes the following steps:

1. Create a hypothesis

A hypothesis is a clear statement that defines what you are testing and what outcome you expect to observe.

In Jean’s example, the hypothesis would be:

“Changing the landing page color from white to pink will have no impact on clicks.”

This is called a null hypothesis (H0), which assumes that there will be no significant difference between the control group (white page) and treatment (pink page).

After running the A/B test, we can either:

Reject H0 — there is a significant difference between control and treatment.
Or fail to reject H0 —we couldn’t detect a significant difference between control and treatment.

In this example, if we reject the null hypothesis (H0), it means that there is a significant difference when the landing page color changes from white to pink.

If this difference is positive (i.e. increased clicks), then we can proceed to change the landing page color.

2. Defining Success Metrics

After formulating a hypothesis, you must define a success metric for your experiment.

This metric will decide if your null hypothesis should be rejected.

In the example of Jean’s landing page, the primary success metric can be one of the following:

Click-Through-Rate (CTR)
Clicks per User
Clicks per Website Visit

To keep things simple, we will choose “Click-Through-Rate (CTR)” as our primary success metric.

This way, if the pink landing page (treatment) has significantly more clicks per user than the white page (control), then we will reject the null hypothesis.

3. Calculate Sample Size and Duration

After defining our hypothesis and success metric, we need to define the sample size and duration for which the experiment will run.

Let’s say Jean’s website gets 100,000 monthly visitors.

Is it sufficient for her to run the experiment on 10% of the population? 50%? Or maybe she should run the A/B test on her entire user base.

This is where concepts like statistical power and MDE (Minimum Detectable Effect) come in.

In simple terms, MDE is the smallest change we care about detecting.

For instance, if Jean sees a 0.1% increase in CTR with the new landing page, is this difference meaningful to her business?

Or does she need to see at least a 5% improvement to justify the development cost?

The MDE helps you determine your sample size. If Jean cares about detecting a 0.0001% change in CTR with high confidence, she might have to run the experiment on a population of 1 million users.

Since she only has 100K monthly visitors, this means that Jean would have to run the A/B test for 10 months on 100% of her website visitors.

In the real world, it isn’t practical to implement an experiment with such a small MDE since business decisions need to be made quickly.

Therefore, when running any experiment, a tradeoff must be made between statistical rigor and speed.

To simplify:

Lower MDEs = Larger sample size
Higher MDEs = Smaller sample size

The longer you run an experiment, the more likely you are to find minute differences between your control and treatment groups.

However, is a negligible difference worth running a single experiment for a whole year?

To learn more about experiment sizing and finding the right tradeoff between MDEs and sample sizes, you can read this comprehensive tutorial.

Here is some sample Python code to compute the sample size and duration of an A/B test at different MDE thresholds:

Step 1: Defining Sample Size and Duration Functions

First, let’s create functions that take in your baseline conversion (in this case, Jean’s website’s current CTR), and return the required sample size and duration at various MDE thresholds:

(Note: if you’re not familiar with concepts like significance levels, MDE, and statistical power, refer to this tutorial.)


import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import pandas as pd

def calculate_sample_size(baseline_conversion, mde, power=0.8, significance_level=0.05):
   expected_conversion = baseline_conversion * (1 + mde)
  
   z_alpha = stats.norm.ppf(1 - significance_level/2)
   z_beta = stats.norm.ppf(power)
  
   sd1 = np.sqrt(baseline_conversion * (1 - baseline_conversion))
   sd2 = np.sqrt(expected_conversion * (1 - expected_conversion))
  
   numerator = (z_alpha * np.sqrt(2 * sd1**2) + z_beta * np.sqrt(sd1**2 + sd2**2))**2
   denominator = (expected_conversion - baseline_conversion)**2
  
   sample_size_per_variant = np.ceil(numerator / denominator)
  
   return int(sample_size_per_variant)

def calculate_experiment_duration(sample_size_per_variant, daily_visitors, traffic_allocation=0.5):
   visitors_per_variant_per_day = daily_visitors * traffic_allocation / 2
   days_required = np.ceil(sample_size_per_variant / visitors_per_variant_per_day)
  
   return int(days_required)

The first function calculates how many users you need for the experiment.

The second function then takes the output of the first function and uses it to calculate the experiment duration, given the number of daily users available (in this case, the daily traffic to Jean’s website).

Step 2: Calculating Sample Sizes For a Range of MDEs

Now, we can create a data frame that gives us a range of sample sizes for different MDEs:


# Example MDE/sample size tradeoff for Jean's website
daily_visitors = 100000 / 30  # Convert monthly to daily visitors
baseline_conversion = 0.05    # Jean's current landing page CTR (baseline conv rate of 5%)

# Create a table of sample sizes for different MDEs
mde_values = [0.01, 0.02, 0.03, 0.05, 0.10, 0.15]  # 1% to 15% change
traffic_allocations = [0.1, 0.5, 1.0]  # 10%, 50%, and 100% of website traffic

results = []
for mde in mde_values:
   sample_size = calculate_sample_size(baseline_conversion, mde)
  
   for allocation in traffic_allocations:
       duration = calculate_experiment_duration(sample_size, daily_visitors, allocation)
       results.append({
           'MDE': f"{mde*100:.1f}%",
           'Traffic Allocation': f"{allocation*100:.0f}%",
           'Sample Size per Variant': f"{sample_size:,}",
           'Duration (days)': duration
       })

# Create a DataFrame and display the results
df_results = pd.DataFrame(results)
print("Sample Size and Duration for Different MDEs:")
print(df_results)

If you’d like to repurpose the above code, you just need to change the following parameters:

Daily users
Baseline conversion — Change this to the metric you’re observing, such as “app open rate” or “user cancellation rate”
MDE values — In this example we’ve listed a range of MDEs from 1% to 15%. This will differ based on your specific business scenario. For example, if you’re running an A/B test for a large tech company with millions of monthly users, you’re probably looking at MDEs in the range of 0.01% to 0.05%.
Traffic allocation — This will vary depending on the amount of users you are willing to experiment on.

Step 3: Visualizing the relationship between sample size and MDEs

To make your results more interpretable, you can create a graph to help you visualize the tradeoff between MDE and sample size:


# Visualize the relationship between MDE and sample size
plt.figure(figsize=(10, 6))
mde_range = np.arange(0.01, 0.2, 0.01)
sample_sizes = [calculate_sample_size(baseline_conversion, mde) for mde in mde_range]

plt.plot(mde_range * 100, sample_sizes)
plt.xlabel('Minimum Detectable Effect (%)')
plt.ylabel('Required Sample Size per Variant')
plt.title('Required Sample Size vs. MDE')
plt.grid(True)
plt.yscale('log')
plt.tight_layout()
plt.savefig('sample_size_vs_mde.png')
plt.show()

Charts like this are useful when presenting your results to business stakeholders. This can help business teams easily make a decision as to which MDE/sample size tradeoff is acceptable when running an experiment.

4. Analyze A/B Test Results

After deciding on a sample size and experiment duration, you can finally run the A/B test and collect the results required to make a business decision.

When analyzing the results of an A/B test, we need to ask the following questions:

Is there a difference in performance between Variant A and Variant B?

In our example, this question becomes: “Is there a difference in click-through rate between the pink and white landing page?”

Is this difference statistically significant?

Here are the elements you must measure when analyzing the results of an A/B test:

Statistical significance — Is the observed difference between your control and treatment group statistically significant?
Confidence interval— The range where your true effect likely lies. If the confidence interval contains 0, it indicates that there is no statistically significant difference between control and treatment.
Effect size — What is the magnitude of the difference between control and treatment?

Here is a block of Python code that can be used to perform the above calculations:


import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import pandas as pd

def analyze_ab_test_results(control_visitors, control_conversions,
                          treatment_visitors, treatment_conversions,
                          significance_level=0.05):

   # Calculate conversion rates
   control_rate = control_conversions / control_visitors
   treatment_rate = treatment_conversions / treatment_visitors
  
   # Calculate absolute and relative differences
   absolute_diff = treatment_rate - control_rate
   relative_diff = absolute_diff / control_rate
  
   # Calculate standard errors
   control_se = np.sqrt(control_rate * (1 - control_rate) / control_visitors)
   treatment_se = np.sqrt(treatment_rate * (1 - treatment_rate) / treatment_visitors)
  
   # Calculate z-score
   pooled_se = np.sqrt(control_se**2 + treatment_se**2)
   z_score = absolute_diff / pooled_se
  
   # Calculate p-value (two-tailed test)
   p_value = 2 * (1 - stats.norm.cdf(abs(z_score)))
  
   # Calculate confidence interval
   z_critical = stats.norm.ppf(1 - significance_level/2)
   margin_of_error = z_critical * pooled_se
   ci_lower = absolute_diff - margin_of_error
   ci_upper = absolute_diff + margin_of_error
  
   # Determine if result is statistically significant
   is_significant = p_value

You just need to enter the number of users and conversions into the above function, and you will get a summary table that looks like this:

Image by Author

Going back to Jean’s landing page, this table makes it clear that the pink landing page improves CTR significantly by 10%.

She can then make the business decision to change her landing page color from white to pink.

Takeaways

If you’ve come this far in the article, congratulations!

You now have a solid grasp of what A/B testing is, how to run A/B tests, and the statistical concepts behind this practice.

You can also repurpose the code provided in this article to run and analyze the results of other A/B tests.

Additionally, if you found some of the concepts in this article confusing, don’t fret!

A/B testing isn’t always straightforward, and if you are a beginner to statistics, it can be difficult to determine sample sizes, run tests, and interpret results.

As a next step, I suggest taking Udacity’s A/B Testing course if you’d like a more comprehensive tutorial on the subject. This course is taught by data scientists at Google and is entirely free.

Then, to put your skills into practice, you can find an A/B test data set on Kaggle and analyze it to generate a business recommendation.

Natassha Selvaraj is a self-taught data scientist with a passion for writing. Natassha writes on everything data science-related, a true master of all data topics. You can connect with her on LinkedIn or check out her YouTube channel.

A Complete Guide to A/B Testing in Python

What are A/B Tests?

How to Run an A/B Test in Python?

1. Create a hypothesis

2. Defining Success Metrics

3. Calculate Sample Size and Duration

Step 1: Defining Sample Size and Duration Functions

Step 2: Calculating Sample Sizes For a Range of MDEs

Step 3: Visualizing the relationship between sample size and MDEs

4. Analyze A/B Test Results

Takeaways

Recent Articles

Feeling Like I Have No Release: A Journey Towards Sane Deployments

Meta exec denies the company artificially boosted Llama 4’s benchmark scores

Advanced tracing and evaluation of generative AI agents using LangChain and Amazon SageMaker AI MLFlow

CISA and FBI Warn Fast Flux is Powering Resilient Malware, C2, and Phishing Networks

Case Study: Ciel Rose | Codrops

Related Stories

Leave A Reply Cancel reply