Social Proof Analyses and A/B Testing (Measurement Problems) | by AKAY AYDIN | Nov, 2024

General Assessment:

Product rating systems should use a combination of methods to accurately measure user satisfaction. While average rating methods are simple, more dynamic and weighted systems that consider time and user interactions yield much more accurate results. For example, changes in ratings over time should be accounted for. Similarly, assigning more value to experienced users’ ratings enhances the reliability of social proof.

In conclusion, complex rating systems allow companies to improve their products while offering consumers a more trustworthy and realistic shopping experience.

Sorting products, people, or other objects is a crucial need in various business processes. Instead of focusing on a single factor, considering all relevant factors together provides more accurate results. Therefore, when using multiple factors, their effects should be standardized and weighted as needed, incorporating business knowledge into the process.

For instance, ranking candidates for a job might involve considering their graduation scores, language proficiency, and interview performance. In this case, a standardized model with appropriate weights for each factor can be utilized.

2.1. Sorting by Rating

Sorting a product based on user ratings is a quick and simple method. However, this approach focuses solely on the average star rating, neglecting other factors like the number of reviews or purchase counts that contribute to social proof.
For example, a product rated 5 stars but reviewed by only two people might lack credibility. On the other hand, another product with an average rating of 4.7 from 500 reviews demonstrates stronger social proof.

2.2. Sorting by Comment Count or Purchase Count

Sorting a product based on the number of comments or purchases can reflect social proof. However, this method alone is insufficient.
For instance, a sponsored product on an e-commerce platform may have been purchased thousands of times due to a promotion. If the product is of low quality, ranking it solely based on purchase count can mislead consumers.

Purchase data must be complemented with metrics like comment count and ratings. Additionally, whether the purchases stem from genuine demand or promotional campaigns should be analyzed.

2.3. Sorting by Rating, Comment, and Purchase

A more balanced ranking can be achieved by combining factors like ratings, comment count, and purchase count.

In this method, each factor is scaled to a standard range (e.g., 1 to 5) using techniques like MinMaxScaler. Appropriate weights are then assigned to each factor to calculate a composite score.

For instance, consider the following scores for products on an e-commerce platform:

Product A: Rating: 4.8, Comments: 100, Purchases: 500
Product B: Rating: 4.6, Comments: 300, Purchases: 1000

The scores for these products can be calculated using a formula like:
Score = (Rating × 0.5) + (Comments × 0.3) + (Purchases × 0.2)

This score reflects both the social proof and the overall performance of the products.

The Bayesian Average Rating Score (BAR Score) relies solely on the ratings provided to products and considers their distributions for a more reliable ranking. If the sole focus is on ratings and their distributions, BAR Score is applicable. Unlike a simple average, BAR Score probabilistically evaluates the potential of a product based on the distribution of ratings.

For instance, in the same category, one product rated 5 stars by just three people and another rated 4.6 stars by 300 people would be balanced by BAR Score. It accounts for the potential of a product to receive ratings from a larger audience. This method yields statistically robust results but does not consider factors like comment count or purchase count.

Scores derived from statistical methods like BAR Score may not be sufficient on their own.

When we ask whether we can ignore social proof factors like comment count or purchase count, the hybrid sorting approach comes into play. Combining BAR Score with factors such as comment count and purchase count creates a hybrid ranking method. Hybrid Sorting integrates the potential of products with their existing social proof for a more balanced ranking.

This method combines the BAR Score variable and the Weight Sorting Score (WSS) variable — calculated by weighting ratings, comments, and purchases — into a single function. The combined and weighted result constitutes hybrid sorting.

This approach allows products with high potential but insufficient social proof to stand out while ensuring that products with strong social proof retain their deserved positions.

In conclusion, the methods used in product sorting should be determined through an approach that harmonizes business knowledge with relevant factors. Rather than relying on a single method, combining multiple methods tailored to the situation and needs will yield more accurate and reliable results.

Sorting reviews is used to rank user feedback about a product in the most accurate way possible. This process considers not only the content of the reviews but also the quality of the reviewers and the reliability of the interactions. For instance, the opinion of a user who has purchased a single product is not equivalent to that of a user who has purchased 100 products. The goal is to support decision-making by presenting the most meaningful and trustworthy reviews.

3.1. Up-Down Difference Score

This method is based on the difference between positive (up) and negative (down) votes received by reviews.

Key features of the method:

Simplicity: It calculates the difference between the number of positive (up) and negative (down) votes.
Frequency independence: It focuses on the magnitude of the difference rather than the total number of votes.
Potential drawback: A review with a small total number of votes but a large difference may outrank highly voted reviews.

Example:

Review A: 20 up, 5 down → Score = 15
Review B: 100 up, 90 down → Score = 10

In this case, Review A ranks higher despite receiving fewer votes.

3.2. Average Rating Score

This method is based on the ratio of positive votes to total votes (up / total ratings).

Advantages and challenges:

Utility ratio: Useful for measuring the positivity of reviews.
Low-frequency bias: A review with few positive votes but a low total vote count (e.g., 3 up — 0 down) can rank highly.

Example:

Review C: 3 up, 0 down → Score = 3/3 = 1.00
Review D: 100 up, 50 down → Score = 100/150 = 0.67

Here, Review C ranks higher despite having significantly fewer interactions. This may lead to highly engaged but relatively lower-rated reviews being ranked lower.

3.3. Wilson Lower Bound Score (WLB)

This method is used in rankings based on binary interactions (e.g., helpful/not helpful, like/dislike) and relies on statistical reliability. Its primary aim is to show the “worst-case scenario” reliability of a review and to prevent low-vote reviews from unfairly ranking high.

Mathematical basis:

WLB is grounded in the Bernoulli distribution and calculates the lower bound of the positive vote ratio within a confidence interval.

Lower bound calculation:

WLB provides a more conservative lower bound as the total number of votes (n) decreases, making it harder for low-vote reviews to surpass high-vote ones.

Example:

Review E: 600 up, 400 down → p (positive rate) = 600/1000 = 0.6, n (total votes) = 1000
WLB score: 0.5693 (We can state with 95% confidence that at least 56.93% of votes are positive).
Review F: 30 up, 10 down → p = 0.75, n = 40
WLB score: 0.5981 (We can state with 95% confidence that at least 59.81% of votes are positive).

Despite Review F having a higher positive rate, Review E may rank higher due to its higher reliability.

Advantages:

Provides fairer rankings for low-frequency reviews.
Focuses on statistical reliability rather than pure ratios.

3.4. Advanced Techniques in Review Sorting

A. User Segmentation:

Prioritizing reviews from users with higher purchase volumes.
Assigning more weight to reviews from expert users.

B. NLP (Natural Language Processing):

Analyzing review text to derive meaningful ranking criteria:
Review length, use of positive/negative expressions.
For instance, expressions like “very bad” may lower a review’s score.

C. Hybrid Scores:

Combining WLB, average rating, and user behavior for hybrid rankings.

Example Hybrid Formula:

Final Score = 0.5 × WLB + 0.3 × Average Rating + 0.2 × User Engagement

3.5. Hybrid Sorting Examples

Social Proof + WLB:

Reviews with low counts but strong WLB scores are weighted with social proof factors.

Example:

Confidence level for z-score (default: 0.95, z = 1.96)
Review G: WLB = 0.7, 10 votes → Final Score = 0.5 × 0.7 + 0.5 × 0.2 = 0.45
Review H: WLB = 0.6, 100 votes → Final Score = 0.5 × 0.6 + 0.5 × 0.8 = 0.70

Review H ranks higher due to greater social proof.

Comparison of Methods

Up-Down Difference Score: Simple but may favor low-frequency reviews.
Average Rating Score: Highlights highly positive reviews but can overemphasize low-frequency interactions.
Wilson Lower Bound Score: Combines frequency and ratio to provide more reliable rankings, ensuring statistical reliability.
Hybrid Methods: Combine multiple criteria to deliver more balanced rankings.

These methods can be used to rank the most helpful reviews about a product. However, the best results are achieved when these methods are combined with domain knowledge and tailored weights specific to the product.

A/B testing is a powerful method used in digital marketing and product development to optimize decision-making processes. This test helps determine which variant is more effective by comparing two or more variants.

4.1. Sampling

Sampling is one of the cornerstones of A/B testing. The phrase “The future of AI will be about less data, not more” highlights the importance of working with fewer, but higher quality, data. Sampling allows generalizations to be made by selecting a sample that represents a small portion of the population. This enables more accurate results with less data in the world of data science. The accuracy of the sample in representing the population is crucial for the reliability of the results.

4.2. Descriptive Statistics

Descriptive statistics is the first step in understanding a dataset and is often referred to as Exploratory Data Analysis (EDA). In this phase, basic statistics such as the mean, standard deviation, minimum, and maximum values of the variables in the dataset are calculated. This information helps us understand the general structure of the dataset and lays the groundwork for further analysis.

4.3. Confidence Intervals

Confidence intervals provide a range that can contain the estimated value of the population parameter. For example, when the mean of a sample is calculated with a 95% confidence interval, this interval indicates that the population mean is likely to fall between these two values with a 95% probability. This reduces uncertainty in the decision-making process and ensures more reliable results.

4.4. Correlation

Correlation is a statistical method used to measure the relationship between two variables, as well as the direction and strength of that relationship. Correlation analysis allows us to understand the relationships between variables and make predictions based on these relationships.

4.5. Hypothesis Testing

Hypothesis testing examines whether observed differences between two or more groups are due to chance. For example, it may be used to test whether a change in the user interface of a mobile app increases the time users spend in the app. A p-value is used to determine whether there is a statistically significant difference.

4.6. A/B Test (Independent Two-Sample T Test)

How is A/B Testing Used in Industry?

A/B testing can be used for various purposes across different industries. Here are two examples:

Digital Marketing: A/B testing is used to measure the effectiveness of digital marketing elements such as email campaigns, website designs, and ad texts. For example, testing different subject lines in an email campaign can determine which one leads to a higher open rate. E-Commerce: In e-commerce websites, different designs of product pages or different versions of payment processes can be tested to determine which design leads to more sales. This is critical for optimizing user experience and increasing conversion rates.

4.7. A/B Test Process

The A/B testing process follows specific steps:

4.7.1. Hypothesis Formation: The first step is determining the hypothesis to be tested. For example, a hypothesis could be “The conversion rate of the new design is higher than the old design.” Or, in a restaurant, the hypothesis could be that there is no difference between the tips given by smokers and non-smokers. This is typically expressed as H0: μ1 = μ2 (the means of the two groups are equal).

4.7.2. Variance Control: Variance control is performed to ensure the validity of the test. The Shapiro-Wilk test is applied for the normality assumption. If the p-value is smaller than 0.05, the normality assumption is rejected, and non-parametric tests (e.g., Mann-Whitney U test) are used.

4.7.3. Applying the Test: If the assumptions are met, an independent two-sample t-test (parametric test) is applied. If the assumptions are not met, non-parametric tests like the Mann-Whitney U median test are preferred.

4.7.4. Interpreting Results: Test results are interpreted based on the p-value. If the p-value is smaller than 0.05, the null hypothesis (H0) is rejected, and it is concluded that there is a statistically significant difference between the variants.

4.7.5. Decision Making: Based on the test results, a decision is made on which variant is more effective and that variant is implemented.

4.8. Two-Sample Proportion Test

The two-sample proportion test is a statistical method used to compare the proportions of two different groups. This test is used when comparing the frequency of a specific event or characteristic between two groups. For example, in a drug study, it can be used to compare the recovery rates between the treatment and placebo groups. This test helps determine whether there is a statistically significant difference between the proportions. Methods like Proportions_ztest are commonly used in such analyses to enhance the reliability of the results.

4.9. Comparing More Than Two Group Averages (ANOVA)

ANOVA (Analysis of Variance) is a statistical test used to compare the means of three or more groups. ANOVA determines whether there is a statistically significant difference between the groups. For example, it can be used to examine the impact of different teaching methods on students’ performance scores. ANOVA checks for homogeneity of variances and normality assumptions. If these assumptions are met, parametric ANOVA is applied; otherwise, non-parametric tests like Kruskal-Wallis are used. ANOVA is widely used in fields such as medicine, education, and social sciences, and is an effective tool for making comparisons among multiple groups. These tests help researchers and analysts gain a deeper understanding of the data and make better decisions. The two-sample proportion test and ANOVA offer suitable statistical tools for different types of data and research questions.