Week 4 in Data Science: Building ML Systems From the Ground Up

28 Dec, 2025

Hey there, fellow data enthusiasts! I'm buzzing with excitement to share what I've been working on during my fourth week of diving deep into data science. This week marked a pivotal shift in my journey—moving from understanding individual models to building complete, production-ready machine learning systems. Let me walk you through the highlights!

Going Under the Hood: Gradient Descent From Scratch

You know that feeling when you finally understand how something really works? That's exactly what happened when I implemented gradient descent from scratch. Instead of just calling .fit() and trusting the magic, I built every single component myself.

Here's a taste of what that looked like:

import numpy as np
import matplotlib.pyplot as plt

def hypothesis(X, theta):
    """Linear hypothesis: h(x) = theta_0 + theta_1*x"""
    return X.dot(theta)

def compute_loss(X, y, theta):
    """Mean Squared Error loss function"""
    m = len(y)
    predictions = hypothesis(X, theta)
    loss = (1/(2*m)) * np.sum((predictions - y)**2)
    return loss

def compute_gradient(X, y, theta):
    """Compute gradient of the loss with respect to theta"""
    m = len(y)
    predictions = hypothesis(X, theta)
    gradient = (1/m) * X.T.dot(predictions - y)
    return gradient

def gradient_descent(X, y, theta, learning_rate, num_iterations):
    """Full gradient descent with visualization"""
    m = len(y)
    loss_history = []
    
    for i in range(num_iterations):
        # Compute gradient
        gradient = compute_gradient(X, y, theta)
        
        # Update parameters
        theta = theta - learning_rate * gradient
        
        # Track loss
        loss = compute_loss(X, y, theta)
        loss_history.append(loss)
        
        if i % 100 == 0:
            print(f"Iteration {i}: Loss = {loss:.4f}")
    
    return theta, loss_history

# Example usage
np.random.seed(42)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)

# Add bias term
X_b = np.c_[np.ones((100, 1)), X]

# Initialize parameters
theta = np.random.randn(2, 1)

# Run gradient descent
theta_final, loss_history = gradient_descent(X_b, y, theta, learning_rate=0.1, num_iterations=1000)

# Visualize the descent
plt.plot(loss_history)
plt.xlabel('Iteration')
plt.ylabel('Loss')
plt.title('Gradient Descent Convergence')
plt.show()

Watching that loss curve drop was incredibly satisfying! It's one thing to know gradient descent conceptually, but seeing those parameters converge based on code you wrote? That's where the real learning happens.

Exploring Loss Functions and Solvers

I also spent time getting intimate with different loss functions using scikit-learn's SGDRegressor. The behavior difference between MSE and MAE under different scaling conditions was eye-opening:

from sklearn.linear_model import SGDRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Create dataset with outliers
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# MSE loss (sensitive to outliers)
sgd_mse = SGDRegressor(loss='squared_error', max_iter=1000, tol=1e-3)
sgd_mse.fit(X_train, y_train.ravel())

# MAE loss (robust to outliers)
sgd_mae = SGDRegressor(loss='epsilon_insensitive', max_iter=1000, tol=1e-3)
sgd_mae.fit(X_train, y_train.ravel())

print(f"MSE Model Score: {sgd_mse.score(X_test, y_test):.4f}")
print(f"MAE Model Score: {sgd_mae.score(X_test, y_test):.4f}")

This comparison taught me that choosing the right loss function isn't just academic—it has real implications for how your model handles messy, real-world data.

Stochastic Optimization with Logistic Regression

Moving into classification, I experimented with SGDClassifier to understand stochastic optimization better. The beauty of stochastic gradient descent is how it can handle massive datasets by updating parameters with mini-batches instead of the entire dataset:

from sklearn.linear_model import SGDClassifier
from sklearn.datasets import make_classification

# Generate classification dataset
X_clf, y_clf = make_classification(n_samples=1000, n_features=20, 
                                     n_informative=15, n_redundant=5, 
                                     random_state=42)

# Train with stochastic gradient descent
sgd_clf = SGDClassifier(loss='log_loss', max_iter=1000, tol=1e-3, 
                        learning_rate='optimal', random_state=42)
sgd_clf.fit(X_clf, y_clf)

print(f"Training accuracy: {sgd_clf.score(X_clf, y_clf):.4f}")

What This Week Taught Me

Week 4 wasn't just about writing code—it was about developing intuition. By building gradient descent from scratch, I can now debug optimization issues, understand convergence problems, and make informed decisions about learning rates and iteration counts.

The shift from "blackbox ML" to "understanding the mechanics" is transformative. When you know what's happening under the hood, you make better decisions about:

Which optimizer to use
How to set learning rates
When to scale features
How to diagnose training issues

Looking Ahead

This foundation in optimization and loss functions set me up perfectly for the weeks ahead. I'm already seeing how these concepts connect to regularization, kernel methods, and eventually ensemble models. The AI revolution isn't just about using powerful tools—it's about understanding them deeply enough to push their boundaries.