COURSE - Cross Entropy

07 Dec, 2024

You: Write a course for me to learn about cross entropy. I am a software developer and do best when writing software and solving exercises and then getting it to run to isualize the results.

ChatGPT: Below is a structured course outline designed to help you learn about cross-entropy by exploring its concepts and then implementing them in code. The course is broken down into modules, each containing an explanation, a coding exercise, and a suggested way to visualize results. You’ll use Python and packages like NumPy, Matplotlib, and potentially PyTorch or TensorFlow for more advanced exercises. Throughout, you’ll deepen your understanding by actually computing cross-entropy for various distributions and modeling scenarios.

Prerequisites

Mathematics: Basic understanding of probability distributions, sums, and logarithms.
Python Skills: Familiarity with Python, NumPy, and basic plotting with Matplotlib.
Machine Learning Basics: Understanding of concepts such as probability distributions, classification tasks, and model outputs (like probabilities).

Setup Instructions:

Install Python 3.x
Install NumPy, Matplotlib

(Later Modules) Install PyTorch or TensorFlow if desired:

pip install numpy matplotlib torch
# or
pip install numpy matplotlib tensorflow

Module 1: Introduction to Cross-Entropy

Goal: Understand the definition and intuition behind cross-entropy.

Concepts:

What is Cross-Entropy? Cross-entropy measures the "distance" between two probability distributions. In machine learning, it’s often used to measure the difference between the predicted probability distribution and the target distribution.
Mathematical Definition: Given two distributions ( p ) (true) and ( q ) (predicted), for discrete classes, cross-entropy is: [ H(p, q) = -\sum_{x} p(x) \log q(x) ]
Intuition: If the predicted distribution ( q ) matches the target ( p ) exactly, cross-entropy is minimized. Deviations increase the value, indicating a poor fit.

Exercise:

Consider a simple binary classification problem. Suppose the true distribution for a label is ( p = [1, 0] ) (class 0 is correct) and you have two different predictions:
- ( q_1 = [0.9, 0.1] )
- ( q_2 = [0.2, 0.8] )
Compute the cross-entropy for both and see which is lower.

Code Skeleton:

import numpy as np

# True distribution (class 0 is correct)
p = np.array([1.0, 0.0])

# Predicted distributions
q1 = np.array([0.9, 0.1])
q2 = np.array([0.2, 0.8])

def cross_entropy(p, q):
    # Ensure numerical stability
    q = np.clip(q, 1e-12, 1.0)
    return -np.sum(p * np.log(q))

ce_q1 = cross_entropy(p, q1)
ce_q2 = cross_entropy(p, q2)

print("Cross-entropy for q1:", ce_q1)
print("Cross-entropy for q2:", ce_q2)

Visualization:

Since these are single values, just print them out for now.
You’ll see q1’s cross-entropy is lower because q1 places higher probability on the correct class.

Module 2: Cross-Entropy vs. Other Metrics

Goal: Distinguish cross-entropy from related metrics like KL divergence, mean squared error (MSE) for probability distributions, and categorical error rate.

Concepts:

Cross-Entropy vs. KL Divergence: KL divergence is ( \sum p(x)\log\frac{p(x)}{q(x)} ). Cross-entropy can be expressed as (H(p, q) = H℗ + D_{KL}(p||q)). Minimizing cross-entropy also minimizes KL divergence when ( p ) is fixed.
Cross-Entropy vs. MSE: Using MSE on probabilities often doesn’t match well with the actual distribution shape and tends to converge slower. Cross-entropy aligns better with maximizing likelihood.

Exercise:

Implement functions for cross-entropy, KL divergence, and MSE on the same distributions and compare the results.
Given a small set of distributions, compute these metrics and see how they differ.

Code Skeleton:

def kl_divergence(p, q):
    q = np.clip(q, 1e-12, 1.0)
    p = np.clip(p, 1e-12, 1.0)
    return np.sum(p * np.log(p/q))

def mse(p, q):
    return np.mean((p - q)**2)

# Test on the same distributions from Module 1
print("KL Divergence:", kl_divergence(p, q1))
print("MSE:", mse(p, q1))

Visualization:

Just print the values for these metrics. Note how values scale differently and think about what it means.

Module 3: Cross-Entropy in Classification Tasks

Goal: Understand how cross-entropy loss is used as an objective function in machine learning classification tasks.

Concepts:

Softmax Outputs: In many classification models (like neural networks), the final layer outputs logits (unnormalized scores) which are passed through a softmax to produce a probability distribution.
Cross-Entropy Loss in Training: Minimizing cross-entropy encourages the model to assign high probability to the correct class.

Exercise:

Implement a softmax function.
Compute cross-entropy loss for a multi-class problem. For instance, imagine a 3-class classification where the model’s logits and the true class label are given.

Code Skeleton:

def softmax(logits):
    exps = np.exp(logits - np.max(logits))
    return exps / np.sum(exps, axis=-1, keepdims=True)

# Simulate a 3-class problem
logits = np.array([2.0, 0.5, -1.0])  # Model outputs
p = np.array([1.0, 0.0, 0.0])       # True class is class 0

q = softmax(logits)
ce_loss = cross_entropy(p, q)
print("Softmax probabilities:", q)
print("Cross-entropy Loss:", ce_loss)

Visualization:

Plot the softmax probabilities for different logits values to see how the distribution changes. For example, vary one logit and visualize the resulting probability distribution.

import matplotlib.pyplot as plt

x_values = np.linspace(-2, 2, 50)
prob_values = []
for x in x_values:
    q_test = softmax(np.array([x, 0.5, -1.0]))
    prob_values.append(q_test[0])  # Probability of class 0

plt.plot(x_values, prob_values, label='P(class 0)')
plt.xlabel('Logit value for class 0')
plt.ylabel('Probability')
plt.title('Probability of class 0 as its logit changes')
plt.legend()
plt.show()

Module 4: Implementing Cross-Entropy Loss for a Simple Classifier

Goal: Apply cross-entropy to a simple model and watch it learn.

Concepts:

Gradient Descent: Use cross-entropy as the loss function to drive gradient updates.
Binary Classifier Example: Train a logistic regression model on a tiny dataset and monitor the loss over epochs.

Exercise:

Generate a small 2D synthetic dataset (e.g., points from two Gaussian clusters).
Implement logistic regression:
- Initialize weights and bias.
- Compute logits, apply sigmoid (binary form of softmax).
- Compute cross-entropy loss.
- Perform gradient descent updates.
Plot the decision boundary and observe how it changes over training.

Code Skeleton:

# Synthetic binary dataset
np.random.seed(42)
num_samples = 100
X_class0 = np.random.randn(num_samples, 2) + np.array([-2, -2])
X_class1 = np.random.randn(num_samples, 2) + np.array([2, 2])

X = np.vstack([X_class0, X_class1])
y = np.array([0]*num_samples + [1]*num_samples)

# Initialize weights
W = np.zeros((2,))
b = 0.0
lr = 0.1

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def binary_cross_entropy(y_true, y_pred):
    y_pred = np.clip(y_pred, 1e-12, 1-1e-12)
    return -np.mean(y_true * np.log(y_pred) + (1 - y_true)*np.log(1 - y_pred))

losses = []
for epoch in range(100):
    # Forward pass
    logits = X.dot(W) + b
    preds = sigmoid(logits)
    
    # Compute loss
    loss = binary_cross_entropy(y, preds)
    losses.append(loss)
    
    # Backprop
    grad_W = np.dot(X.T, (preds - y)) / len(y)
    grad_b = np.mean(preds - y)
    
    # Update
    W -= lr * grad_W
    b -= lr * grad_b

print("Final Weights:", W, "Final Bias:", b)

plt.plot(losses)
plt.title('Training Loss over Epochs')
plt.xlabel('Epoch')
plt.ylabel('Cross-Entropy Loss')
plt.show()

# Visualize decision boundary
x_min, x_max = X[:,0].min()-1, X[:,0].max()+1
y_min, y_max = X[:,1].min()-1, X[:,1].max()+1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100), np.linspace(y_min, y_max, 100))
Z = sigmoid(np.c_[xx.ravel(), yy.ravel()].dot(W) + b)
Z = Z.reshape(xx.shape)

plt.contourf(xx, yy, Z, levels=[0,0.5,1], alpha=0.5, cmap='RdBu')
plt.scatter(X[:num_samples,0], X[:num_samples,1], c='blue', label='Class 0')
plt.scatter(X[num_samples:,0], X[num_samples:,1], c='red', label='Class 1')
plt.legend()
plt.title('Decision Boundary')
plt.show()

Module 5: Cross-Entropy in Multi-Class Neural Networks (with PyTorch or TensorFlow)

Goal: Use a framework that has a built-in cross-entropy loss and train a simple neural network.

Concepts:

Using Built-in Loss Functions: PyTorch’s nn.CrossEntropyLoss() or TensorFlow’s tf.keras.losses.CategoricalCrossentropy.
Training a Simple NN: Create a small neural network for a multi-class problem and observe how the cross-entropy loss decreases as the network learns.

Exercise:

Generate a synthetic 3-class dataset (e.g., three clusters of points).
Train a small feedforward network to classify the points into one of the three classes.
Visualize the decision boundaries at the end of training.

Code Skeleton (PyTorch):

import torch
import torch.nn as nn
import torch.optim as optim

# Synthetic dataset
N = 300
X_class0 = np.random.randn(N, 2) + np.array([0,0])
X_class1 = np.random.randn(N, 2) + np.array([3,3])
X_class2 = np.random.randn(N, 2) + np.array([-3,3])
X_all = np.vstack([X_class0, X_class1, X_class2])
y_all = np.array([0]*N + [1]*N + [2]*N)

X_tensor = torch.tensor(X_all, dtype=torch.float32)
y_tensor = torch.tensor(y_all, dtype=torch.long)

model = nn.Sequential(
    nn.Linear(2, 3)
)

loss_fn = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

losses = []
for epoch in range(200):
    optimizer.zero_grad()
    logits = model(X_tensor)
    loss = loss_fn(logits, y_tensor)
    loss.backward()
    optimizer.step()
    losses.append(loss.item())

plt.plot(losses)
plt.title('NN Training Loss (Cross-Entropy)')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.show()

Visualization:

Plot the decision boundaries similarly to the logistic regression case, but now with a model that can classify three classes.
After training, visualize how the cross-entropy goes down and the decision regions form around each cluster.

Module 6: Practical Considerations

Goal: Learn about numerical stability, log-softmax implementations, and common pitfalls.

Concepts:

Numerical Stability: Directly computing (\log(\text{softmax}(z))) can cause overflow. Many libraries implement a stable log-softmax internally.
Label Smoothing: Sometimes, pure one-hot targets (like [1,0,0]) are replaced with slightly softened targets (like [0.9,0.05,0.05]) to help generalization.

Exercise:

Implement a numerically stable log-softmax using logsumexp.
Compare the cross-entropy computed with a stable version vs. a naive implementation for a set of large logits.

Code Skeleton:

from scipy.special import logsumexp

def stable_log_softmax(logits):
    lse = logsumexp(logits)
    return logits - lse

logits = np.array([10.0, 20.0, 100.0])  # Large values
log_softmax_vals = stable_log_softmax(logits)
print("Stable log-softmax:", log_softmax_vals)
print("Exponentials sum to ~1:", np.exp(log_softmax_vals).sum())

Going Further

Cross-Entropy for Continuous Distributions: In more advanced scenarios (like language modeling), you’ll encounter cross-entropy over large vocabularies and continuous distributions.
Advanced Visualization: Integrate TensorBoard or other visualization tools to track training progress over time.
Hyperparameter Tuning: Explore how learning rate or batch size affects how quickly cross-entropy decreases.

Final Project

Description:

Build a small neural network to classify MNIST digits (0-9) using cross-entropy loss and visualize the training curves.
Inspect how cross-entropy evolves over epochs and compare it to accuracy.
Plot some misclassified examples and analyze what the network got wrong and how that corresponds to the loss.

Outline:

Load MNIST via PyTorch datasets or TensorFlow Datasets.
Set up a simple MLP or CNN model.
Train it with cross-entropy loss.
Plot training and validation cross-entropy and accuracy.
Visualize sample predictions and show how the cross-entropy of those predictions changes as you train more epochs.