Understanding Subscription Decline: A Machine Learning Approach: Part 2 | by Shill | Feb, 2025


Remember when flipping through a magazine was a common pastime? With the rise of digital content, one might expect magazine subscriptions to decline, but what if that’s not the full story? Even during periods when people spent more time at home, subscription rates didn’t see the expected boost. Instead, they dropped.

What’s driving this shift? Is it changing consumer preferences, economic constraints, or something deeper?

In this analysis, we take a data-driven approach to uncover the factors behind declining magazine subscriptions. By examining demographic, behavioral, and transactional data, we aim to identify key patterns using machine learning. After rigorous data cleaning and preparation, we’ll compare the performance of logistic regression and support vector machines (SVM) to determine which model provides the best insights.

Through this study, we hope to offer magazine companies actionable strategies to retain subscribers and adapt to evolving reader behaviors. Let’s dive in.

This is a continuation of Part 1. Access it here: [Insert Link]

In this phase, we build and evaluate machine learning models to understand the factors behind magazine subscription decline. We compare logistic regression and SVM to identify key drivers and improve retention strategies.

The dataset was split into an 80% training set and a 20% test set. Numeric variables were scaled using standard scaling to ensure they were on the same scale. We then used both Logistic Regression and Support Vector Machine (SVM) models to predict the target classes and evaluate their performance.

Balancing the Data

The dataset in question exhibits a significant class imbalance, with 1,879 instances of non-subscribers (class 0) and only 333 instances of subscribers (class 1).

This imbalance can lead to biased model performance, particularly in classification tasks, where the model tends to predict the majority class more often, neglecting the minority class. Before addressing the class imbalance, both the Logistic Regression and SVM models failed to predict subscribers (class 1) effectively. The precision and recall for Logistic Regression were 0.74 and 0.37, respectively, while for SVM, they were 0.81 and 0.31, respectively.

Given the importance of accurately predicting subscribers (class 1) for a magazine company trying to understand and maintain its subscriber base, addressing class imbalance is crucial. This is because:

Impact on Business Decisions

  • Subscribers (Class 1) are crucial for the magazine’s growth. Accurately predicting who is likely to subscribe or cancel helps the company improve customer retention, create targeted marketing campaigns, and prevent churn. Missing these predictions can lead to lost revenue and missed opportunities.

Increased Cost of Misclassification

  • Misclassifying subscribers (false negatives) as non-subscribers wastes marketing resources on the wrong customers, while ignoring those who need retention efforts. Misclassifying non-subscribers (false positives) can also result in unnecessary spending, offering subscriptions to people who aren’t interested.

Balanced Performance Using Balancing Techniques

We applied three balancing techniques — SMOTE, RandomUnderSampler, and SMOTETomek — to address the class imbalance in the dataset.

Among these techniques, SMOTETomek provided the best balance between precision and recall for both classes, resulting in improved model performance. Thus, the SMOTETomek approach was chosen.

Balanced Performance: SMOTETomek offers a good compromise between identifying subscribers (recall) and minimizing false positives (precision), ensuring a more accurate prediction of potential subscribers.

Improved Recall: While not as high as RandomUnderSampler, the recall of 0.63 represents a significant improvement over the original model, enabling the company to identify more potential subscribers.

Manageable False Positives: The precision of SMOTETomek is similar to RandomUnderSampler, but the slightly lower recall means fewer false positives to manage, helping to optimize resource allocation.

Overall Accuracy: SMOTETomek maintains a good overall accuracy, which is essential for model reliability and consistency in predictions.

Representation of Both Classes: SMOTETomek both oversamples the minority class (subscribers) and cleans the class boundary, potentially providing a more representative dataset for the company’s analysis.

Hence for the magazine company’s goal of understanding and retaining subscribers, the SMOTETomek approach offers:

• A well-balanced approach to identifying potential subscribers without overwhelming the company with false positives.

• An improved ability to detect potential subscription issues compared to the original, imbalanced model.

• A more nuanced view of the data, potentially revealing insights that were previously obscured by the class imbalance.

While RandomUnderSampler showed a higher recall, the more balanced approach of SMOTETomek is likely to offer more reliable and actionable insights, which are crucial for the company’s strategy to address subscription declines effectively.

Classification Report - SMOTE:
precision recall f1-score support

0 0.90 0.98 0.94 376
1 0.74 0.39 0.51 67

accuracy 0.89 443
macro avg 0.82 0.68 0.72 443
weighted avg 0.88 0.89 0.87 443

Classification Report - RandomUnderSampler:
precision recall f1-score support

0 0.95 0.78 0.86 376
1 0.38 0.76 0.51 67

accuracy 0.78 443
macro avg 0.67 0.77 0.68 443
weighted avg 0.86 0.78 0.80 443

Logistic Regression

The Logistic Regression Model using the SMOTETomek technique achieved an accuracy of 0.80. The precision for predicting non-subscribers (class 0) is high at 0.93, indicating few false positives. However, the recall for subscribers (class 1) is 0.63, suggesting the model is capturing a significant portion of potential subscribers. The F1-score of 0.49 for class 1 reflects a trade-off between precision and recall. While the overall accuracy is solid, the model still struggles with identifying subscribers, as shown by the lower precision and recall for class 1. The weighted average F1-score of 0.82 indicates a decent balance overall.

Model Accuracy: 0.8036
SMOTETomek - Logistic Regression
precision recall f1-score support

0 0.93 0.83 0.88 376
1 0.41 0.67 0.51 67

accuracy 0.80 443
macro avg 0.67 0.75 0.69 443
weighted avg 0.85 0.80 0.82 443

The logistic regression results indicate several significant variables impacting subscription behavior. Higher Total Spending and Campaign Acceptance positively influence the likelihood of subscription, suggesting that targeted campaigns to high-spending customers could increase subscriptions. Tenure also plays a key role, with longer customer relationships boosting the probability of subscribing. On the other hand, higher Income, more Recent Interactions, and Basic Education are associated with a decreased likelihood of subscribing. This is very interesting, especially since our exploratory analysis established that higher income is typically associated with a higher likelihood of subscribing.

Furthermore, Marital Status significantly influences subscription likelihood, with married, single, and widowed individuals showing lower chances of subscribing, which is an unexpected finding. This could suggest that family or relationship status, or associated financial priorities, might play a role in subscription behavior.

Among all the input variables only Marital_Status_Other seem not to be significant since p-value >0.05.

Optimization terminated successfully.
Current function value: 0.381803
Iterations 7
Logit Regression Results
==============================================================================
Dep. Variable: Response No. Observations: 3002
Model: Logit Df Residuals: 2984
Method: MLE Df Model: 17
Date: Thu, 30 Jan 2025 Pseudo R-squ.: 0.4492
Time: 11:48:09 Log-Likelihood: -1146.2
converged: True LL-Null: -2080.8
Covariance Type: nonrobust LLR p-value: 0.000
===========================================================================================
coef std err z P>|z| [0.025 0.975]
-------------------------------------------------------------------------------------------
const 1.7428 0.198 8.802 0.000 1.355 2.131
Income -0.2516 0.101 -2.482 0.013 -0.450 -0.053
Total_Children -0.3773 0.095 -3.978 0.000 -0.563 -0.191
Total_Spending 0.7219 0.206 3.512 0.000 0.319 1.125
TotalPurchases -0.7746 0.180 -4.300 0.000 -1.128 -0.422
Campaign_Acceptance 1.2993 0.087 14.943 0.000 1.129 1.470
Tenure 1.0161 0.065 15.649 0.000 0.889 1.143
Recency -1.0233 0.060 -16.929 0.000 -1.142 -0.905
NumWebPurchases 0.4776 0.108 4.425 0.000 0.266 0.689
Education_Basic -3.5465 0.797 -4.448 0.000 -5.109 -1.984
Education_Graduation -1.2562 0.160 -7.848 0.000 -1.570 -0.942
Education_Master -1.5755 0.204 -7.710 0.000 -1.976 -1.175
Education_PhD -0.5239 0.181 -2.901 0.004 -0.878 -0.170
Marital_Status_Married -2.2756 0.151 -15.106 0.000 -2.571 -1.980
Marital_Status_Other -2.8335 1.589 -1.783 0.075 -5.948 0.281
Marital_Status_Single -1.4430 0.159 -9.096 0.000 -1.754 -1.132
Marital_Status_Together -2.7873 0.176 -15.826 0.000 -3.133 -2.442
Marital_Status_Widow -2.4924 0.355 -7.012 0.000 -3.189 -1.796
===========================================================================================

SVM

The SMOTETomek technique with SVM achieved an overall accuracy of 81%. It performed well with non- subscribers (class 0), with high precision (0.94) and recall (0.83), indicating accurate predictions. However, for subscribers (class 1), the precision (0.40) was lower, indicating some false positives, while recall (0.69) showed decent identification of actual subscribers. The F1-score for subscribers (0.51) reflects a moderate balance between precision and recall.

SMOTETomek - SVM
Accuracy: 0.7968397291196389
precision recall f1-score support

0 0.94 0.82 0.87 376
1 0.40 0.69 0.51 67

accuracy 0.80 443
macro avg 0.67 0.75 0.69 443
weighted avg 0.85 0.80 0.82 443

The confusion matrix shows that the model correctly predicted 315 non-subscribers (true negatives) and 45 subscribers (true positives). However, it misclassified 61 non-subscribers as subscribers (false positives) and 21 subscribers as non-subscribers (false negatives). While the model is effective at identifying non-subscribers, it faces challenges in accurately predicting subscribers, which is crucial for the business to target potential customers and prevent churn.

Comparing Model

Among the two models, SMOTETomek with SVM appears to be the better choice for the business context of a magazine company. While both models show similar accuracy, the key differentiators are recall and F1-score.

  • SMOTETomek with SVM has a higher recall of 0.69 compared to 0.63 with Logistic Regression, which is crucial for identifying subscribers (Class 1) and minimizing missed opportunities.
  • Additionally, the F1-score for SVM (0.51) is same for Logistic Regression (0.51), indicating a more balanced performance between precision and recall.
  • Precision: Both models have same the precision for identifying subscribers.
  • The AUC (Area Under the Curve) scores for Logistic Regression (0.8547) and SVM (0.8565) are very close, indicating that both models perform similarly well in distinguishing between the two classes. These scores, being above 0.85, suggest strong predictive power for both models. The slightly higher AUC for SVM (0.8565) implies it may have a marginal edge in overall classification performance.
ROC curve

Given that the goal is to effectively target and retain subscribers, SMOTETomek with SVM is more adept at identifying potential subscribers with fewer false positives, making it the more reliable choice for the company’s strategy.

The SVM model identifies several key variables influencing subscription behavior. Total Spending (1.02) and Campaign Acceptance (0.65) are positively associated with a higher likelihood of subscribing, suggesting that customers who spend more and engage with campaigns are more likely to convert. In contrast, Income (-0.25) and Recency (-0.77) show negative relationships, indicating that wealthier customers or those with more recent interactions are less likely to subscribe, which aligns with unexpected findings from the logistic regression. Educational background, especially Basic Education (-2.10), plays a significant role, suggesting that lower education levels may reduce the likelihood of subscription. Additionally, Marital Status variables, such as Married, Single, and Widow, are linked to a lower probability of subscribing.

In conclusion, the decline in subscriptions may be influenced by factors such as recency, spending, and tenure. To address this, the magazine company should focus on building long-term relationships with customers, as longer tenure correlates with higher subscription likelihood. Targeted campaigns for high-spending customers and those accepting offers could also drive subscriptions. Additionally, reconsidering the timing of subscription offers may help mitigate the negative impact of recent interactions. Personalizing marketing efforts based on factors like education and marital status can further improve subscription rates and retention.

Access Part 1 here: https://medium.com/@maabena1859/understanding-subscription-decline-a-machine-learning-approach-part-1-5601bdbaedc9

#LinearRegression, #SVM #Machinelearning, #Classification, #Subscriptiondecline

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here