Detecting Online Fraud with Precision — A Machine Learning Case Study | by Rahul Khandelwal | Mar, 2025

Fraud or Fair? A Machine Learning Approach to Transaction Fraud Detection

With online transactions surging globally, fraud prevention is a top priority. Our team worked with the IEEE-CIS Fraud Detection dataset to design a robust solution that accurately identifies fraud, minimizes false alarms, and uncovers meaningful patterns in suspicious activity.

We focused on answering four key research questions:

How effective is our fraud detection, and which features matter most?
How can we minimize false positives while maximizing detection?
Can meaningful clusters be identified within the data?
What role does each feature play in predicting fraud?

Size: 590K training records, 500K test records
Type: Credit/debit card transactions (mostly card-not-present)
Features: Transaction timestamps, amounts, card/email/domain info, and masked identity features

We handled missing values, imputed key features, and engineered new ones such as TransactionHour, TransactionAmt_bin, and card_type_combo.

New attribute “Card Type Combination” with respective counts

How effective is our fraud detection, and which features matter most?

Random Forest
XGBoost
Bagging
Gradient Boosting
Logistic Regression
CART (Decision Tree)

ROC AUC: 0.929

Logistic Regression, limited by linearity and class imbalance, failed with an F1 of just 0.001. Tree-based models captured complex fraud behavior far more effectively.

How can we minimize false positives while maximizing detection?

Minimizing false alarms is critical to avoid investigation overload and maintain customer trust.

Stratified K-Fold CV: Preserved class distribution during validation
SMOTE: Generated synthetic fraud samples for better training coverage

Two-Stage Review System:

Random Forest flags high-risk transactions
Business rules/manual review filters final decisions

📉 Result:
Random Forest flagged only 183 false positives out of ~91,200 non-fraud cases — a 0.2% false positive rate.

Can meaningful clusters be identified within the data?

Yes — clustering helped us segment transactions by risk level.

Applied PCA for dimensionality reduction
Tested K-Means and Gaussian Mixture Models (GMM)
Used Silhouette Score to measure cluster quality

➡️ GMM offered stronger separation, making it ideal for risk-based stratification.

What role does each feature play in predicting fraud?

TransactionAmt: Large deviations flagged as high risk
card_type_combo: Certain card types were more fraud-prone
TransactionHour: Fraud tended to spike during specific hours
DeviceType & DeviceInfo: Suspicious patterns in mobile vs. desktop usage
Email domains: Non-corporate emails (e.g., free providers) had higher fraud rates

We also visualized distributions using log scales to highlight hidden trends across wide transaction amount ranges.

Tree-based models outperform linear methods in fraud tasks
Class imbalance demands deliberate techniques like SMOTE
Clustering enhances operational efficiency by tiering response strategies
Feature engineering remains critical for turning raw data into business value

Deploy Random Forest as the primary fraud detection model
Prioritize review of GMM-defined high-risk clusters
Automate low-risk approvals to streamline operations
Continuously retrain models as fraud behavior evolves
Use explainable AI to gain trust and meet regulatory demands

📂 Want to explore our full work?
👉 Download Full Report

Detecting Online Fraud with Precision — A Machine Learning Case Study | by Rahul Khandelwal | Mar, 2025

Recent Articles

Automate video insights for contextual advertising using Amazon Bedrock Data Automation

Rogue npm Packages Mimic Telegram Bot API to Plant SSH Backdoors on Linux Systems

Google rolls out Gemini 2.5 Flash preview on April 17

Budget-Aware Fashion Matching With Gemini | by Arwa Awad | Apr, 2025

NVIDIA Introduces CLIMB: A Framework for Iterative Data Mixture Optimization in Language Model Pretraining

Related Stories

Leave A Reply Cancel reply