Expediting Decision-Making in Mobile Gaming: A/B Testing Meets Machine Learning | by Varun Tyagi | Operations Research Bit | Sep, 2024


Image by the author with DALL-E 3

Reward structures, game mechanics, and user engagement strategies must evolve continually, driven by data, to retain players and keep them coming back. But how can a company decide which changes work best, especially in a fast-paced market where decisions must be made quickly? The answer lies in A/B testing, a tried-and-true method for measuring the impact of changes in a controlled environment.

In today’s highly competitive mobile gaming market, the difference between a successful game and one that falters can come down to how effectively a company rewards and engages its players. Mobile game development companies thrive on understanding user behavior, fine-tuning in-game mechanics, and iterating on features quickly. One of the most crucial tools that can help companies refine their player experience is A/B testing. It allows developers to test different versions of their game, user interface, or rewards system by splitting users into groups and observing which version yields better results.

However, traditional A/B testing can be slow, and waiting for statistical significance often takes too long resulting in missed opportunities. Imagine running a test for a new rewards mechanism in a game — while the traditional A/B testing method takes time to reach a statistically significant result, your competitors are already launching their next big feature. Companies in the mobile gaming sector need faster results to stay competitive. This is where advanced probabilistic methods and machine learning models can revolutionize A/B testing, allowing companies to make data-driven decisions in real-time, without waiting for the long tail of user data

In this blog post, we’ll explore not just A/B testing but also sophisticated Bayesian A/B testing, multi-armed bandits, and machine learning models that can supercharge your testing process and give you an edge in the mobile gaming industry.

A/B testing, at its core, is a randomized experiment where two variants (let’s say version A and version B) of a particular feature are compared. The purpose is to determine which version performs better regarding a specific metric — whether it’s conversions, engagement, or retention. For mobile gaming companies, A/B testing is critical in evaluating player engagement, optimizing game rewards, and increasing revenue.

For example, you may want to test a new reward system or a different tutorial flow for new users. Group A gets the old system, while Group B experiences the new one. After a set period, you compare key performance indicators (KPIs) like user retention or in-app purchases.

Given the number of features mobile gaming companies experiment with — rewards, difficulty levels, in-app purchases — it’s common to run multiple A/B tests simultaneously. However, there’s a catch: The results of one test can potentially influence the others. Managing these overlapping tests while maintaining the integrity of each is challenging but essential for avoiding skewed or biased outcomes.

In addition, the longer you wait for results, the more competitive ground you can lose. This is why timely decision-making is vital. Delayed action on A/B test results can result in missed opportunities, lost players, or falling behind competitors who can act faster.

Traditional A/B testing requires a certain sample size to achieve statistical significance, often delaying decisions. What if you could confidently predict the outcome before reaching that critical mass of users? That’s where probabilistic methods like Bayesian A/B testing and Multi-Armed Bandits come into play. These approaches allow you to infer results with fewer data, reducing the time it takes to make decisions and deploy changes.

  1. Bayesian A/B Testing: Unlike traditional (frequentist) A/B testing, which requires a pre-determined sample size to calculate the probability of an outcome, Bayesian A/B testing provides a more flexible, iterative approach. It uses prior knowledge combined with new data to continuously update the probability that one variation is better than the other.
    In a mobile gaming context, Bayesian A/B testing can provide quicker insights about whether a new reward mechanic is likely to outperform the existing one, even with fewer users tested. By calculating the probability of one variant being superior, you can confidently roll out changes before reaching statistical significance in the classical sense.
  2. Multi-Armed Bandits: Imagine a scenario where you’re pulling levers on multiple slot machines (arms), each with a different probability of winning. Multi-armed bandit algorithms optimize decision-making by dynamically adjusting the focus on more promising variants (e.g., A or B) as more data comes in. This approach minimizes regret by balancing exploration (testing new variants) with exploitation (choosing the best-performing one).
  3. Epsilon-Greedy Algorithm: The epsilon-greedy algorithm is a simple multi-armed bandit strategy that chooses the best-performing version with a probability of (1 — ε) and a random version with a probability of ε. This ensures that the algorithm explores different versions while exploiting the best-performing one.
  4. Upper Confidence Bound (UCB) Algorithm: The UCB algorithm is another multi-armed bandit strategy that selects the version with the highest upper confidence bound. This approach balances exploration and exploitation by considering both the estimated performance and the uncertainty of each version
  5. Thompson Sampling: A specific strategy for solving the Multi-Armed Bandit problem, Thompson Sampling uses Bayesian inference to select the arm (variant) that’s most likely to provide the best result based on the data collected so far.
  6. Reinforcement Learning: A more sophisticated approach that allows the system to “learn” optimal actions based on past rewards. In the context of A/B testing, reinforcement learning can continuously adjust and adapt to evolving data in real-time, providing actionable insights faster than traditional methods.

Machine learning models can further enhance A/B testing by automating the process of identifying the best variants and segmenting users based on complex patterns in the data. Traditional A/B testing treats all users as equal, but machine learning allows for personalized testing strategies where each user’s response to a change can be weighted differently, based on their profile.

  1. Logistic Regression: A simple yet effective model for binary outcomes (such as conversions). Logistic regression is often the first step in modeling A/B test results, providing insights into how different user attributes (age, location, etc.) affect the outcome.
  2. Support Vector Machines (SVM): SVM can classify results more robustly when the data is not linearly separable, making it suitable for complex A/B test data.
  3. Random Forest: A popular ensemble learning method, Random Forest can model complex, non-linear relationships and provide feature importance scores, which can help identify which variables (e.g., age, location, game type) are most influential in determining user behavior.
  4. Neural Networks: These deep learning models can capture very complex relationships between user features and outcomes, making them ideal for high-dimensional data typical of mobile gaming companies.
  5. Reinforcement Learning: Unlike traditional supervised learning models, reinforcement learning allows for continuous learning and adaptation as the test progresses. This is ideal for dynamic environments where user behavior can change over time.

Each of these models offers unique advantages, but they also require careful tuning and enough data to be effective. Machine learning can help prioritize which A/B tests to run, predict outcomes, or even optimize the testing process in real-time.

Image by the author

Now, let’s dive into the code that brings these concepts to life. Below, we will walk through different sections of the code, explaining how it simulates and analyzes data using A/B testing, Bayesian methods, and machine learning algorithms.

Generating Synthetic Data

The first part of any project involving machine learning and A/B testing starts with importing the necessary libraries. Here’s a breakdown of what each library is used for:

  • Numpy and Pandas: Numpy is used for numerical operations, while Pandas is crucial for data manipulation.
  • Sklearn: This suite is used for machine learning models, preprocessing, evaluation metrics, and data splitting.
  • Scipy: Essential for statistical tests, especially chi-squared tests and Mann-Whitney U tests.
  • Datetime and Timedelta: These help manage and manipulate date-time operations.
  • Matplotlib and Seaborn: Both libraries are visualization tools, allowing us to plot results and confusion matrices.

Each library plays a vital role in building a solid foundation for A/B testing and machine learning experiments.

Next part is to generate the synthetic data that mimics user interactions in a mobile game. This data simulates various factors such as user_created_date, age_group, install_date, channel (organic vs. paid), game_types, locations , last_activity_dates, cohort_ages, and groups that could influence whether a user converts (e.g., generates in-app revenue, installs the app, plays the games etc.).

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
from scipy import stats
from scipy.stats import chi2_contingency, beta
from datetime import datetime, timedelta
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
import seaborn as sns
import matplotlib.pyplot as plt
import random

def generate_synthetic_data(n=20000):
np.random.seed(42)
user_ids = np.arange(n)
created_dates = [datetime.now() - timedelta(days=np.random.randint(0, 1000)) for _ in range(n)]
age_groups = np.random.choice(['18-24', '25-34', '35-44', '45-54', '55+'], size=n)
install_dates = [created_date + timedelta(days=np.random.randint(0, 30)) for created_date in created_dates]
channels = np.random.choice(['Organic', 'Paid', 'Referral'], size=n)
game_types = np.random.choice(['Puzzle', 'Strategy', 'Arcade'], size=n)
locations = np.random.choice(['US', 'UK', 'CA', 'AU', 'IN'], size=n)
last_activity_dates = [install_date + timedelta(days=np.random.randint(0, 300)) for install_date in install_dates]
cohort_ages = [(datetime.now() - created_date).days for created_date in created_dates]
groups = np.random.choice(['A', 'B'], size=n)

age_group_numeric = (age_groups == '25-34').astype(float)
channel_numeric = (channels == 'Paid').astype(float)
game_type_numeric = (game_types == 'Puzzle').astype(float)
location_numeric = (locations == 'US').astype(float)

base_log_odds_A = -2.0
base_log_odds_B = 2.0

coef_age = 1.5
coef_channel = 1.8
coef_game = 1.2
coef_location = 1.0
coef_interaction = 2.0

log_odds_A = (base_log_odds_A +
coef_age * age_group_numeric +
coef_channel * channel_numeric +
coef_game * game_type_numeric +
coef_location * location_numeric +
coef_interaction * age_group_numeric * channel_numeric +
np.random.normal(0, 0.3, n))

log_odds_B = (base_log_odds_B +
coef_age * age_group_numeric +
coef_channel * channel_numeric +
coef_game * game_type_numeric +
coef_location * location_numeric +
coef_interaction * age_group_numeric * channel_numeric +
np.random.normal(0, 0.3, n))

probs_A = 1 / (1 + np.exp(-log_odds_A))
probs_B = 1 / (1 + np.exp(-log_odds_B))

conversions = np.where(
groups == 'A',
np.random.binomial(1, probs_A),
np.random.binomial(1, probs_B)
)

return pd.DataFrame({
'user_id': user_ids,
'created_date': created_dates,
'age_group': age_groups,
'install_date': install_dates,
'channel': channels,
'game_type': game_types,
'location': locations,
'last_activity_date': last_activity_dates,
'cohort_age': cohort_ages,
'group': groups,
'converted': conversions
})

The generate_synthetic_data(n=20000) function is designed to generate synthetic user data for an A/B test simulation, where users are grouped into two cohorts (Group A and Group B). It models user behavior, including whether users convert (perform a specific action, such as a purchase) based on various characteristics.

Apart from the normal user characteristics, you will observe that I am creating log_odds for each variant.

The conversion likelihood is modeled using logistic regression, where each user in Group A or Group B has different log-odds based on several factors:

Base log-odds:

  • Group A: base_log_odds_A = -2.0
  • Group B: base_log_odds_B = 2.0

Coefficients for features:

  • coef_age: Impact of being in the ’25-34′ age group.
  • coef_channel: Impact of being acquired via Paid channels.
  • coef_game: Impact of playing Puzzle games.
  • coef_location: Impact of being located in the US.
  • coef_interaction: Interaction between age group and acquisition channel.

The log-odds are calculated separately for Group A and Group B using these features and coefficients. Using the log-odds, the conversion probability is calculated using the logistic function. For each user, whether they convert is determined by sampling from a binomial distribution using their calculated probability. Hence finishing the generation of our user data.

Preprocessing Data

Once the data is generated, it needs to be preprocessed for machine learning models. Here, categorical variables are encoded as numeric values.

def preprocess_data(data):
data = data.copy()
categorical_columns = ['age_group', 'channel', 'game_type', 'location', 'group']
for col in categorical_columns:
data[col] = data[col].astype('category').cat.codes
return data

The preprocess_data function converts categorical columns (e.g., age group, channel) into numeric codes, making them suitable for machine learning models.

Evaluating the Model

The next step involves evaluating the performance of machine learning models using various metrics like accuracy, precision, recall, and F1 score.

def evaluate_model(y_true, y_pred):
cm = confusion_matrix(y_true, y_pred)
accuracy = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)
chi2, p_value, _, _ = chi2_contingency(cm)
return cm, accuracy, precision, recall, f1, p_value

def plot_confusion_matrix(cm, model_name):
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title(f'Confusion Matrix - {model_name}')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

The evaluate_model function computes standard evaluation metrics, such as:

  • Confusion Matrix: A breakdown of predicted vs. actual values.
  • Accuracy: The proportion of correctly classified instances.
  • Precision: The ratio of true positive predictions to all positive predictions.
  • Recall: The ratio of true positives to all actual positives.
  • F1 Score: The harmonic mean of precision and recall, balancing the two metrics.
  • Chi-Squared Test: This statistical test checks the independence of predicted and actual results, giving us a p-value to measure statistical significance.

The plot_confusion_matrix function uses Seaborn to plot a confusion matrix, providing a visual representation of how well a model predicts outcomes. The matrix allows for easy identification of true positives, true negatives, false positives, and false negatives.

Visualizing the results of the model is a key component of machine learning and A/B testing analysis. It helps stakeholders better understand the results.

Bayesian A/B Testing

Bayesian A/B testing uses prior knowledge and new data to compute the probability that one variant is better than the other.

def bayesian_ab_test(data_a, data_b, n_simulations=100000):
a_success = np.sum(data_a)
a_trials = len(data_a)
b_success = np.sum(data_b)
b_trials = len(data_b)

a_posterior = beta(a_success + 1, a_trials - a_success + 1)
b_posterior = beta(b_success + 1, b_trials - b_success + 1)

a_samples = a_posterior.rvs(n_simulations)
b_samples = b_posterior.rvs(n_simulations)

prob_b_better = np.mean(b_samples > a_samples)
expected_loss = np.mean(np.maximum(a_samples - b_samples, 0))

return prob_b_better, expected_loss

def plot_posterior_distributions(data_a, data_b):
a_success = np.sum(data_a)
a_trials = len(data_a)
b_success = np.sum(data_b)
b_trials = len(data_b)

a_posterior = beta(a_success + 1, a_trials - a_success + 1)
b_posterior = beta(b_success + 1, b_trials - b_success + 1)

x = np.linspace(0, 1, 1000)
plt.figure(figsize=(10, 6))
plt.plot(x, a_posterior.pdf(x), label='A')
plt.plot(x, b_posterior.pdf(x), label='B')
plt.xlabel('Conversion Rate')
plt.ylabel('Density')
plt.title('Posterior Distributions')
plt.legend()
plt.show()

The bayesian_ab_test function computes the probability that variant B is better than A, based on Bayesian inference. It also calculates the expected loss if we were to choose A over B.

The plot_posterior_distributions function serves to provide a visual comparison of the posterior distributions of the success probabilities (conversion rates) for two groups, A and B, using Bayesian inference. This comparison helps assess which group has a higher likelihood of success based on the observed data.

Multi-Armed Bandit

Multi-Armed Bandit algorithms dynamically allocate users to different variants, favoring those that appear more promising as more data is collected.

def multi_armed_bandit(data, n_rounds=1000, alpha=0.05):
arms = ['A', 'B']
counts = {arm: 0 for arm in arms}
rewards = {arm: 0 for arm in arms}

for _ in range(n_rounds):
arm = random.choice(arms)
counts[arm] += 1
if arm == 'A':
reward = np.random.binomial(1, data[data['group'] == 'A']['converted'].mean())
else:
reward = np.random.binomial(1, data[data['group'] == 'B']['converted'].mean())
rewards[arm] += reward

# Calculate statistical significance (p-value)
_, p_value = stats.ttest_ind(
data[data['group'] == 'A']['converted'],
data[data['group'] == 'B']['converted']
)

significant = p_value < alpha
return rewards, p_value, significant

The multi_armed_bandit function implements a Multi-Armed Bandit algorithm, dynamically adjusting which variant to allocate users to, based on observed performance.

Reinforcement Learning for A/B Testing

Reinforcement Learning (RL) can be applied to continuously optimize decisions by learning from the environment’s feedback. In the context of A/B testing, RL helps to dynamically adjust the selection of variants (A, B, etc.) based on observed user behavior and outcomes.

def reinforcement_learning(data, alpha=0.05, n_iterations=1000):
np.random.seed(42)
total_reward = 0
rewards_A = []
rewards_B = []

# Calculate mean conversion rates for groups A and B
mean_conversion_A = data[data['group'] == 'A']['converted'].mean()
mean_conversion_B = data[data['group'] == 'B']['converted'].mean()

for _ in range(n_iterations):
action = np.random.choice([0, 1]) # 0 for group A, 1 for group B
if action == 0:
reward = np.random.binomial(1, mean_conversion_A) # Simulate reward for group A
rewards_A.append(reward)
else:
reward = np.random.binomial(1, mean_conversion_B) # Simulate reward for group B
rewards_B.append(reward)
total_reward += reward

# Use Mann-Whitney U test for significance testing
_, p_value = stats.mannwhitneyu(rewards_A, rewards_B, alternative='two-sided')

significant = p_value < alpha
avg_reward = total_reward / n_iterations

# Return dummy values for cm, accuracy, precision, recall, and f1
cm = None
accuracy = None
precision = None
recall = None
f1 = None

return cm, accuracy, precision, recall, f1, avg_reward, p_value, significant

Reinforcement learning (RL) is a paradigm where agents learn to make decisions by receiving feedback from the environment. In this section, the code implements a basic reinforcement learning mechanism to explore the performance of groups A and B over multiple iterations.

1.) Mean Conversion Calculation:

  • The code starts by calculating the mean conversion rate for both groups A and B. These means serve as the probabilities for the simulated reward generation:
  • mean_conversion_A and mean_conversion_B are calculated using the mean conversion rates for the respective groups.

2.) Simulated Decision-Making:

  • The loop runs for n_iterations (default 1,000 iterations) to simulate decision-making.
  • np.random.choice([0, 1]) randomly selects an action, where 0 represents choosing group A and 1 represents choosing group B.
  • Based on the chosen action, the code simulates a reward using a binomial distribution (np.random.binomial). The probability of success is determined by the group’s mean conversion rate (mean_conversion_A or mean_conversion_B).

3.) Rewards Collection:

  • Rewards for each action are collected in rewards_A or rewards_B lists, and the total reward is updated.

4.) Statistical Testing:

  • After collecting rewards, the code applies the Mann-Whitney U test to determine if there is a statistically significant difference between the rewards from groups A and B. This non-parametric test is suitable for comparing two independent samples.
  • The result (p_value) is compared against a significance level (alpha), and significant is set to True if the p-value is below the threshold.

5.) Returning Results:

  • The function returns a set of dummy values for evaluation metrics (e.g., None for confusion matrix and accuracy) along with the average reward, p-value, and a boolean indicating whether the difference was statistically significant.

Model Training and Evaluation

This section provides a comprehensive comparison of multiple machine learning models and a simplified reinforcement learning approach. Each model offers unique advantages:

  • Logistic Regression is simple and interpretable.
  • SVM is effective for non-linear decision boundaries.
  • Random Forest is robust and handles feature interactions well.
  • Neural Network can capture complex patterns in the data.
  • Reinforcement Learning simulates decision-making over time.

By comparing these models, one can select the most suitable approach for their specific A/B testing scenario.

# Generate and preprocess data
data = generate_synthetic_data()
preprocessed_data = preprocess_data(data)

# Prepare features and target
feature_columns = ['age_group', 'channel', 'game_type', 'location', 'cohort_age', 'group']
X = preprocessed_data[feature_columns]
y = preprocessed_data['converted']

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Define models
models = {
"Logistic Regression": LogisticRegression(random_state=42, max_iter=1000),
"SVM": SVC(random_state=42),
"Random Forest": RandomForestClassifier(random_state=42),
"Neural Network": MLPClassifier(hidden_layer_sizes=(64, 32), max_iter=1000, random_state=42),
"Reinforcement Learning": reinforcement_learning
}

# Train and evaluate models
for model_name, model in models.items():
if model_name == "Reinforcement Learning":
cm, accuracy, precision, recall, f1, avg_reward, p_value, significant = model(data)
# Print Reinforcement Learning Results
print(f"\n{model_name} Results:")
print(f"Average Reward: {avg_reward:.4f}")
print(f"P-value: {p_value:.4e}")
print(f"Statistically Significant: {significant}")
else:
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
cm, accuracy, precision, recall, f1, p_value = evaluate_model(y_test, y_pred)

print(f"\n{model_name} Results:")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")
print(f"P-value: {p_value:.4e}")
print(f"Statistically Significant: {p_value < 0.05}")
print("Confusion Matrix:")
print(cm)

plot_confusion_matrix(cm, model_name)

The code basically does the following:

Data Preparation

  • The synthetic data is generated using the generate_synthetic_data function and then preprocessed with preprocess_data.
  • The relevant features (age_group, channel, game_type, location, cohort_age, and group) are selected, and the target variable (converted) is extracted.

Feature Scaling

  • StandardScaler is used to scale the features to ensure uniformity in their ranges, which is essential for some models like Support Vector Machines (SVM) and neural networks.

Train-Test Split

  • The data is split into training (80%) and test sets (20%) using train_test_split. This split is critical for evaluating model performance on unseen data.

Model Definitions

  • Four traditional machine learning models (Logistic Regression, SVM, Random Forest, Neural Network) and one custom implementation (Reinforcement Learning) are defined in a dictionary.

Training and Evaluation

  • For each model (except reinforcement learning), the code trains the model on the training data (model.fit) and evaluates it using the test data (model.predict).
  • For the reinforcement learning model, the function is called directly with the dataset to perform the simulation and statistical testing.

Evaluation

  • Various evaluation metrics (accuracy, precision, recall, F1 score) and a confusion matrix are printed for each traditional model.

The Output (ML models)

The KPIs of all the ML models, along with the statistical significance are printed.

Logistic Regression Output (Image by the author)
SVM Output (Image by the author)
Random Forest Output (Image by the author)
Neural Network Output (Image by the author)
Reinforcement Learning Output (Image by the author)

Applying Bayesian A/B, Multi-Armed Bandit and Plotting Results

As mentioned above, Bayesian and Multi-Armed Bandit methods provide alternative approaches to traditional statistical tests, such as t-tests or chi-square tests, offering more flexible and actionable insights in certain contexts.

# Bayesian A/B Test
data_a = data[data['group'] == 'A']['converted'].values
data_b = data[data['group'] == 'B']['converted'].values

prob_b_better, expected_loss = bayesian_ab_test(data_a, data_b)

print("\nBayesian A/B Test Results:")
print(f"Probability that B is better than A: {prob_b_better:.4f}")
print(f"Expected loss of choosing A over B: {expected_loss:.4f}")

plot_posterior_distributions(data_a, data_b)

# Multi-Armed Bandit
rewards, p_value, significant = multi_armed_bandit(data)

print("\nMulti-Armed Bandit Results:")
print(f"Rewards for A: {rewards['A']}")
print(f"Rewards for B: {rewards['B']}")
print(f"P-value: {p_value:.4f}")
print(f"Statistically Significant: {significant}")

# Plot Multi-Armed Bandit results
plt.figure(figsize=(10, 6))
plt.bar(rewards.keys(), rewards.values())
plt.title("Multi-Armed Bandit Rewards")
plt.xlabel("Arm")
plt.ylabel("Total Reward")
plt.show()

# Print conversion rates for groups A and B
conversion_rate_A = data[data['group'] == 'A']['converted'].mean()
conversion_rate_B = data[data['group'] == 'B']['converted'].mean()
print(f"\nConversion Rate A: {conversion_rate_A:.4f}")
print(f"Conversion Rate B: {conversion_rate_B:.4f}")

Let us understand what the code is achieving here:

  • data_a = data[data[‘group’] == ‘A’][‘converted’].values
    data_b = data[data[‘group’] == ‘B’][‘converted’].values
    code extracts the converted values for each group (A and B), which represent whether a user has converted (e.g., made a purchase or clicked a link). This is the binary outcome used to measure the success of each group in the A/B test.
  • The function bayesian_ab_test is called with data_a and data_b as inputs. The exact implementation of this function is not provided here, but it typically involves Bayesian inference to estimate the probability distribution of conversion rates for groups ‘A’ and ‘B’. It returns two values: prob_b_better , which is the probability that group B’s conversion rate is higher than that of group A and expected_loss, which is the expected cost (or loss) of choosing the less effective option based on the posterior distribution.
  • plot_posterior_distributions(data_a, data_b)function plots the posterior distributions of the conversion rates for groups A and B. Posterior distributions in Bayesian analysis represent the probability of various possible values of a parameter (e.g., conversion rate) after considering the observed data.
  • The multi_armed_bandit function is called with the entire dataset. It uses a simple version of the Epsilon-Greedy Algorithm algorithm to decide how to allocate “arms” (i.e., variants A and B) during the experiment. It is called a simple version because instead of choosing the arm with the highest estimated reward with a probability of (1 — ε) and a random arm with a probability of ε, we are simply choosing an arm randomly at each round. This function returns, rewards as a dictionary containing the accumulated rewards (conversions) for both groups, p_value, resulting from a statistical test (comparing the conversion rates), and significant, a boolean indicating whether the observed difference is statistically significant.
  • Here conversion_rate_A = data[data[‘group’] == ‘A’][‘converted’].mean()
    conversion_rate_B = data[data[‘group’] == ‘B’][‘converted’].mean()
    , we are calculating and printing the conversion rates for groups A and B by taking the mean of the converted column for each group.

The Output (Bayesian and Multi-Armed Bandit)

Bayesian A/B Output (Image by the author)
Multi-Armed Bandit Output (Image by the author)
Image by the author

In summary, all the models applied in this analysis have provided statistically significant results, indicating that the differences observed between groups A and B are not due to random chance. The Bayesian A/B test reveals a high probability that one variant outperforms the other while also assessing the risk of choosing the less effective option. The multi-armed bandit approach dynamically optimizes the allocation of users to each variant, achieving significant rewards while reducing the opportunity cost.

In addition to the statistical tests, machine learning models like Logistic Regression, SVM, Random Forest, and Neural Networks have been trained on the dataset to predict conversions. These models delivered high accuracies in their predictions, showcasing their effectiveness in identifying the factors driving user behavior. Together, the statistical tests and machine learning models offer robust and reliable insights for decision-making, enabling us to identify the best-performing variant with confidence.

The mobile gaming industry moves fast, and so must your decision-making. Traditional A/B testing, while reliable, is often too slow to provide actionable insights in real-time. By incorporating probabilistic methods like Bayesian A/B testing, Multi-Armed Bandits, and machine learning models, mobile gaming companies can optimize their experiments, making data-driven decisions more quickly and efficiently.

These modern approaches not only speed up the process but also open the door to more personalized, targeted experimentation. In a world where user preferences are constantly shifting, the ability to adapt and make rapid decisions could be the key to long-term success. By implementing the code and techniques described here, mobile gaming companies can stay ahead of the competition, continually improving their games in a data-driven, scalable manner.

Now, it is your turn to leverage the power of A/B testing and machine learning to accelerate decision-making in your mobile games.

  1. ab_testing_with_ml_and_bayesian_methods.py

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here