Third Week in Data Science: The 94% Accurate Model That Was Actually Terrible

16 Dec, 2025

This week shattered my assumptions about what makes a "good" machine learning model. Spoiler alert: it's not accuracy.

The Accuracy Trap

I built a cardiovascular disease detection model with 94% accuracy. Impressive, right? Dead wrong.

The confusion matrix revealed the ugly truth: the model caught only 33% of at-risk patients. With severe class imbalance (most people have healthy hearts), it had learned to predict "healthy" for almost everyone. High accuracy, terrible recall—and potentially fatal in medical contexts.

Lesson learned: Recall (aka Sensitivity) = Of all the truly at-risk patients, how many did the model actually catch? Low recall = your model is confidently missing the cases that matter.

How to fix low recall (fast):

Stop optimizing accuracy → use recall, F1, PR-AUC, or cost-weighted loss
Reweight the classes / use focal loss → make missing positives expensive
Lower or tune the decision threshold → catch more positives
Use stratified CV & resampling → don’t starve the model of minority cases

Rule of thumb: If false negatives are dangerous, recall > accuracy, always.

Preprocessing: Where Models Are Actually Won

Working with NBA and housing datasets, I discovered that 300 duplicate rows can silently kill your model through data leakage. But the real revelation? Not all missing values are created equal.

In our housing data, NaN values in GarageFinish didn't mean "missing"—they meant "no garage." Blindly imputing them would have been disastrous. Always read your data dictionary first.

Scaling choices matter enormously too. Using the wrong scaler on our KNN model dropped performance from 65% to 61%—a 4-point swing that translates to thousands of dollars in real estate predictions.

The K-Nearest Neighbors Revelation

KNN taught me three critical lessons:

Scale sensitivity is real. After normalizing features, our model jumped from 60.8% to 64.9%. KNN calculates distances, and unscaled features dominate those calculations.

K is not just a number. Testing values from 1 to 25, I found:

K=2: Severe overfitting (90% training, 48% test score)
K=14: Sweet spot (scores converged at 66%)
K>20: Performance degraded

Learning curves are diagnostic gold. They revealed exactly when models were overfitting, underfitting, or ready for production.

The Threshold Tuning Secret

Here's the power move nobody talks about: most tutorials stop at the default 0.5 classification threshold. But what if your business demands different trade-offs?

For player career predictions, I needed 90% precision to give a coach his "guarantee." The default model? Only 74% precision.

The solution: generate probability predictions, use precision_recall_curve to map every threshold to its metrics, then select the threshold meeting your requirement (0.86 in this case).

Result: 90% precision achieved. Without this tuning, we'd have been stuck at 74%—not good enough for real-world decisions.

The Elegant Cyclical Feature Fix

Linear models see December (12) and January (1) as far apart, but they're adjacent months. The solution? Transform into sine and cosine components:

sin_MoSold = np.sin(2π × MoSold / 12)
cos_MoSold = np.cos(2π × MoSold / 12)

This preserved time's cyclical nature and improved performance. Small, thoughtful transformations separate mediocre models from production-ready ones.

The Bottom Line

Machine learning isn't about finding the "best" algorithm. It's about:

Aligning metrics with actual business needs
Understanding your data intimately
Preprocessing properly (it matters more than most think)
Tuning every available knob
Validating honestly with confusion matrices and learning curves

The journey from 57% to 87% R² on housing predictions wasn't magic—it was systematic problem-solving. Each "failure" taught something valuable.

Next time someone brags about their accuracy score, ask to see their confusion matrix. The real story is always in the details.