For the best viewing experience, please open this page on a desktop or laptop.

Analysis Methodology

1

Data Acquisition

5,630 customer records with 20 features from e-commerce platform

2

EDA & Cleaning

Statistical analysis, outlier detection, missing value handling

3

Feature Engineering

Created tenure buckets, interaction terms, behavioral scores

4

Model Training

Decision Tree, Logistic Regression, hyperparameter tuning

Model Performance Metrics

Best Model: Decision Tree Classifier

Overall Accuracy 89.3%
Churner Recall (Sensitivity) 52%
Non-Churner Recall (Specificity) 96%
AUC-ROC Score 0.88
F1-Score (Churner Class) 0.59

Logistic Regression (Baseline)

Overall Accuracy 87.1%
Churner Recall 45%
AUC-ROC Score 0.85

Why This Matters

While the model achieves 89.3% overall accuracy, the 52% churner recall means we successfully identify about half of all customers who will churn. In business terms, this translates to catching approximately 490 out of 947 churners, allowing for proactive intervention. The high non-churner recall (96%) minimizes false alarms, ensuring retention efforts are efficiently targeted.

Feature Importance Analysis

Understanding which customer attributes drive churn predictions

Tenure 52%
Complain 14%
CashbackAmount 9%
DaySinceLastOrder 7%
SatisfactionScore 6%
Other Features 12%

Key Takeaways

  • Tenure dominates (52%): Time as a customer is the single most important predictor. The 0-3 month window is critical.
  • Complaints are red flags (14%): Customer service quality directly impacts retention.
  • Cashback drives engagement (9%): Financial incentives matter, but must be strategically deployed.
  • Recency matters (7%): Days since last order indicates declining engagement.

Technical Implementation

Tech Stack

  • Python 3.9+ - Core language
  • Pandas & NumPy - Data manipulation
  • Scikit-learn - ML models & preprocessing
  • Matplotlib & Seaborn - Visualization
  • Jupyter Notebook - Analysis environment
  • Tableau - Interactive dashboards

Model Configuration

  • Algorithm: Decision Tree Classifier
  • Max Depth: 8 (to prevent overfitting)
  • Min Samples Split: 50
  • Class Weight: Balanced
  • Cross-Validation: 5-fold stratified
  • Train/Test Split: 80/20

Data Processing

  • Missing Values: Median/mode imputation
  • Outliers: IQR method (removed 3%)
  • Encoding: Label + One-Hot encoding
  • Scaling: StandardScaler for numeric features
  • Sampling: Stratified to maintain class distribution
  • Feature Engineering: Created 5 new features

Sample Code: Model Training

Key implementation code from the analysis notebook

# Import libraries
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

# Prepare features and target
X = df_cleaned.drop(['Churn', 'CustomerID'], axis=1)
y = df_cleaned['Churn']

# Split data with stratification
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Initialize and train Decision Tree
dt_model = DecisionTreeClassifier(
    max_depth=8,
    min_samples_split=50,
    class_weight='balanced',
    random_state=42
)

# Train model
dt_model.fit(X_train, y_train)

# Make predictions
y_pred = dt_model.predict(X_test)
y_pred_proba = dt_model.predict_proba(X_test)[:, 1]

# Evaluate performance
accuracy = dt_model.score(X_test, y_test)
auc_score = roc_auc_score(y_test, y_pred_proba)

print(f"Accuracy: {accuracy:.3f}")
print(f"AUC-ROC: {auc_score:.3f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Feature importance
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': dt_model.feature_importances_
}).sort_values('importance', ascending=False)

print("\nTop 5 Features:")
print(feature_importance.head())

Explore the Interactive Dashboard

See the insights come to life through Tableau visualizations

View Dashboard