Scapegoat TIL

COURSE - Cross Entropy

Original o1 Transcript

You: Write a course for me to learn about cross entropy. I am a software developer and do best when writing software and solving exercises and then getting it to run to isualize the results.

ChatGPT: Below is a structured course outline designed to help you learn about cross-entropy by exploring its concepts and then implementing them in code. The course is broken down into modules, each containing an explanation, a coding exercise, and a suggested way to visualize results. You’ll use Python and packages like NumPy, Matplotlib, and potentially PyTorch or TensorFlow for more advanced exercises. Throughout, you’ll deepen your understanding by actually computing cross-entropy for various distributions and modeling scenarios.


Prerequisites

Setup Instructions:


Module 1: Introduction to Cross-Entropy

Goal: Understand the definition and intuition behind cross-entropy.

Concepts:

Exercise:

  1. Consider a simple binary classification problem. Suppose the true distribution for a label is ( p = [1, 0] ) (class 0 is correct) and you have two different predictions:

    • ( q_1 = [0.9, 0.1] )
    • ( q_2 = [0.2, 0.8] )

    Compute the cross-entropy for both and see which is lower.

Code Skeleton:

import numpy as np

# True distribution (class 0 is correct)
p = np.array([1.0, 0.0])

# Predicted distributions
q1 = np.array([0.9, 0.1])
q2 = np.array([0.2, 0.8])

def cross_entropy(p, q):
    # Ensure numerical stability
    q = np.clip(q, 1e-12, 1.0)
    return -np.sum(p * np.log(q))

ce_q1 = cross_entropy(p, q1)
ce_q2 = cross_entropy(p, q2)

print("Cross-entropy for q1:", ce_q1)
print("Cross-entropy for q2:", ce_q2)

Visualization:


Module 2: Cross-Entropy vs. Other Metrics

Goal: Distinguish cross-entropy from related metrics like KL divergence, mean squared error (MSE) for probability distributions, and categorical error rate.

Concepts:

Exercise:

  1. Implement functions for cross-entropy, KL divergence, and MSE on the same distributions and compare the results.
  2. Given a small set of distributions, compute these metrics and see how they differ.

Code Skeleton:

def kl_divergence(p, q):
    q = np.clip(q, 1e-12, 1.0)
    p = np.clip(p, 1e-12, 1.0)
    return np.sum(p * np.log(p/q))

def mse(p, q):
    return np.mean((p - q)**2)

# Test on the same distributions from Module 1
print("KL Divergence:", kl_divergence(p, q1))
print("MSE:", mse(p, q1))

Visualization:


Module 3: Cross-Entropy in Classification Tasks

Goal: Understand how cross-entropy loss is used as an objective function in machine learning classification tasks.

Concepts:

Exercise:

  1. Implement a softmax function.
  2. Compute cross-entropy loss for a multi-class problem. For instance, imagine a 3-class classification where the model’s logits and the true class label are given.

Code Skeleton:

def softmax(logits):
    exps = np.exp(logits - np.max(logits))
    return exps / np.sum(exps, axis=-1, keepdims=True)

# Simulate a 3-class problem
logits = np.array([2.0, 0.5, -1.0])  # Model outputs
p = np.array([1.0, 0.0, 0.0])       # True class is class 0

q = softmax(logits)
ce_loss = cross_entropy(p, q)
print("Softmax probabilities:", q)
print("Cross-entropy Loss:", ce_loss)

Visualization:

import matplotlib.pyplot as plt

x_values = np.linspace(-2, 2, 50)
prob_values = []
for x in x_values:
    q_test = softmax(np.array([x, 0.5, -1.0]))
    prob_values.append(q_test[0])  # Probability of class 0

plt.plot(x_values, prob_values, label='P(class 0)')
plt.xlabel('Logit value for class 0')
plt.ylabel('Probability')
plt.title('Probability of class 0 as its logit changes')
plt.legend()
plt.show()

Module 4: Implementing Cross-Entropy Loss for a Simple Classifier

Goal: Apply cross-entropy to a simple model and watch it learn.

Concepts:

Exercise:

  1. Generate a small 2D synthetic dataset (e.g., points from two Gaussian clusters).
  2. Implement logistic regression:
    • Initialize weights and bias.
    • Compute logits, apply sigmoid (binary form of softmax).
    • Compute cross-entropy loss.
    • Perform gradient descent updates.
  3. Plot the decision boundary and observe how it changes over training.

Code Skeleton:

# Synthetic binary dataset
np.random.seed(42)
num_samples = 100
X_class0 = np.random.randn(num_samples, 2) + np.array([-2, -2])
X_class1 = np.random.randn(num_samples, 2) + np.array([2, 2])

X = np.vstack([X_class0, X_class1])
y = np.array([0]*num_samples + [1]*num_samples)

# Initialize weights
W = np.zeros((2,))
b = 0.0
lr = 0.1

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def binary_cross_entropy(y_true, y_pred):
    y_pred = np.clip(y_pred, 1e-12, 1-1e-12)
    return -np.mean(y_true * np.log(y_pred) + (1 - y_true)*np.log(1 - y_pred))

losses = []
for epoch in range(100):
    # Forward pass
    logits = X.dot(W) + b
    preds = sigmoid(logits)
    
    # Compute loss
    loss = binary_cross_entropy(y, preds)
    losses.append(loss)
    
    # Backprop
    grad_W = np.dot(X.T, (preds - y)) / len(y)
    grad_b = np.mean(preds - y)
    
    # Update
    W -= lr * grad_W
    b -= lr * grad_b

print("Final Weights:", W, "Final Bias:", b)

plt.plot(losses)
plt.title('Training Loss over Epochs')
plt.xlabel('Epoch')
plt.ylabel('Cross-Entropy Loss')
plt.show()

# Visualize decision boundary
x_min, x_max = X[:,0].min()-1, X[:,0].max()+1
y_min, y_max = X[:,1].min()-1, X[:,1].max()+1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100), np.linspace(y_min, y_max, 100))
Z = sigmoid(np.c_[xx.ravel(), yy.ravel()].dot(W) + b)
Z = Z.reshape(xx.shape)

plt.contourf(xx, yy, Z, levels=[0,0.5,1], alpha=0.5, cmap='RdBu')
plt.scatter(X[:num_samples,0], X[:num_samples,1], c='blue', label='Class 0')
plt.scatter(X[num_samples:,0], X[num_samples:,1], c='red', label='Class 1')
plt.legend()
plt.title('Decision Boundary')
plt.show()

Module 5: Cross-Entropy in Multi-Class Neural Networks (with PyTorch or TensorFlow)

Goal: Use a framework that has a built-in cross-entropy loss and train a simple neural network.

Concepts:

Exercise:

  1. Generate a synthetic 3-class dataset (e.g., three clusters of points).
  2. Train a small feedforward network to classify the points into one of the three classes.
  3. Visualize the decision boundaries at the end of training.

Code Skeleton (PyTorch):

import torch
import torch.nn as nn
import torch.optim as optim

# Synthetic dataset
N = 300
X_class0 = np.random.randn(N, 2) + np.array([0,0])
X_class1 = np.random.randn(N, 2) + np.array([3,3])
X_class2 = np.random.randn(N, 2) + np.array([-3,3])
X_all = np.vstack([X_class0, X_class1, X_class2])
y_all = np.array([0]*N + [1]*N + [2]*N)

X_tensor = torch.tensor(X_all, dtype=torch.float32)
y_tensor = torch.tensor(y_all, dtype=torch.long)

model = nn.Sequential(
    nn.Linear(2, 3)
)

loss_fn = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

losses = []
for epoch in range(200):
    optimizer.zero_grad()
    logits = model(X_tensor)
    loss = loss_fn(logits, y_tensor)
    loss.backward()
    optimizer.step()
    losses.append(loss.item())

plt.plot(losses)
plt.title('NN Training Loss (Cross-Entropy)')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.show()

Visualization:


Module 6: Practical Considerations

Goal: Learn about numerical stability, log-softmax implementations, and common pitfalls.

Concepts:

Exercise:

  1. Implement a numerically stable log-softmax using logsumexp.
  2. Compare the cross-entropy computed with a stable version vs. a naive implementation for a set of large logits.

Code Skeleton:

from scipy.special import logsumexp

def stable_log_softmax(logits):
    lse = logsumexp(logits)
    return logits - lse

logits = np.array([10.0, 20.0, 100.0])  # Large values
log_softmax_vals = stable_log_softmax(logits)
print("Stable log-softmax:", log_softmax_vals)
print("Exponentials sum to ~1:", np.exp(log_softmax_vals).sum())

Going Further


Final Project

Description:

Outline:

  1. Load MNIST via PyTorch datasets or TensorFlow Datasets.
  2. Set up a simple MLP or CNN model.
  3. Train it with cross-entropy loss.
  4. Plot training and validation cross-entropy and accuracy.
  5. Visualize sample predictions and show how the cross-entropy of those predictions changes as you train more epochs.