E-Commerce Churn - Insights Lab

Analysis Methodology

Data Acquisition

5,630 customer records with 20 features from e-commerce platform

EDA & Cleaning

Statistical analysis, outlier detection, missing value handling

Feature Engineering

Created tenure buckets, interaction terms, behavioral scores

Model Training

Decision Tree, Logistic Regression, hyperparameter tuning

Model Performance Metrics

Best Model: Decision Tree Classifier

Overall Accuracy 89.3%

Churner Recall (Sensitivity) 52%

Non-Churner Recall (Specificity) 96%

AUC-ROC Score 0.88

F1-Score (Churner Class) 0.59

Logistic Regression (Baseline)

Overall Accuracy 87.1%

Churner Recall 45%

AUC-ROC Score 0.85

Model Scope Note

This project compares an interpretable baseline (Logistic Regression) with a non-linear tree model (Decision Tree) for business explainability. Gradient boosting models (XGBoost/LightGBM) are listed as next-step experiments and were not included in this published benchmark.

Why This Matters

While the model achieves 89.3% overall accuracy, class imbalance makes recall the critical metric. The 52% churner recall means we successfully identify about half of all customers who will churn. In business terms, this translates to catching approximately 490 out of 947 churners, allowing for proactive intervention. The high non-churner recall (96%) minimizes false alarms, ensuring retention efforts are efficiently targeted.

Feature Importance Analysis

Understanding which customer attributes drive churn predictions

Tenure 52%

Complain 14%

CashbackAmount 9%

DaySinceLastOrder 7%

SatisfactionScore 6%

Other Features 12%

Key Takeaways

Tenure dominates (52%): Time as a customer is the single most important predictor. The 0-3 month window is critical.
Complaints are red flags (14%): Customer service quality directly impacts retention.
Cashback drives engagement (9%): Financial incentives matter, but must be strategically deployed.
Recency matters (7%): Days since last order indicates declining engagement.

Technical Implementation

Tech Stack

Python 3.9+ - Core language
Pandas & NumPy - Data manipulation
Scikit-learn - ML models & preprocessing
Matplotlib & Seaborn - Visualization
Jupyter Notebook - Analysis environment
Tableau - Interactive dashboards

Model Configuration

Algorithm: Decision Tree Classifier
Max Depth: 8 (to prevent overfitting)
Min Samples Split: 50
Class Weight: Balanced
Cross-Validation: 5-fold stratified
Train/Test Split: 80/20

Data Processing

Missing Values: Median/mode imputation
Outliers: IQR method (removed 3%)
Encoding: Label + One-Hot encoding
Scaling: StandardScaler for numeric features
Sampling: Stratified to maintain class distribution
Feature Engineering: Created 5 new features

Sample Code: Model Training

Key implementation code from the analysis notebook

# Import libraries
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

# Prepare features and target
X = df_cleaned.drop(['Churn', 'CustomerID'], axis=1)
y = df_cleaned['Churn']

# Split data with stratification
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Initialize and train Decision Tree
dt_model = DecisionTreeClassifier(
    max_depth=8,
    min_samples_split=50,
    class_weight='balanced',
    random_state=42
)

# Train model
dt_model.fit(X_train, y_train)

# Make predictions
y_pred = dt_model.predict(X_test)
y_pred_proba = dt_model.predict_proba(X_test)[:, 1]

# Evaluate performance
accuracy = dt_model.score(X_test, y_test)
auc_score = roc_auc_score(y_test, y_pred_proba)

print(f"Accuracy: {accuracy:.3f}")
print(f"AUC-ROC: {auc_score:.3f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Feature importance
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': dt_model.feature_importances_
}).sort_values('importance', ascending=False)

print("\nTop 5 Features:")
print(feature_importance.head())

Download Complete Jupyter Notebook

Technical Analysis & Model Development