Introduction

When people hear the term Neural Networks, it often sounds mysterious. Almost magical.

But under the hood, neural networks are not magic.

They are simply:

numbers

matrix multiplications

weighted sums

error calculations

repeated adjustments

This document explains neural networks from the ground up — the exact same ideas covered in the presentation — but in a more detailed and intuitive way.

By the end, you should understand:

what a neuron actually does

how neural networks learn

why activation functions are necessary

what loss functions are

how backpropagation works

why gradient descent matters

and how these simple ideas scale all the way to systems like GPT-4

From Rules to Patterns

Traditional Programming

Traditional programming works like this:

Input + Human Written Rules → Output

Example:

if temperature > 30:
print("Hot")

This works well for simple systems.

But problems begin when tasks become too complex.

For example:

detecting cats in images

recognizing speech

generating human language

driving a car

understanding emotions

Humans cannot manually write rules for every possible situation.

Why?

Because the number of possibilities becomes enormous.

Traditional systems are also rigid.

If the input slightly differs from expected patterns, the system fails.

Machine Learning

Machine Learning changes the approach.

Instead of manually writing rules:

Data + Correct Answers → Learn Patterns Automatically

The system discovers mathematical relationships hidden inside data.

This is the central idea behind neural networks.

The real goal of machine learning is:

Discover patterns automatically from data.

The Simplest Prediction Problem

Let’s begin with a simple example.

Suppose we have house prices.

Humans immediately notice a pattern.

Every additional:

500 sq ft → +$100,000

Which means:

$200 per sq ft

So we can estimate:

2500 sq ft → $500,000

But how does a machine learn this automatically?

That is the core question neural networks try to solve.

Machine Learning in One Sentence

Training is simply:

Guess → Measure Error → Correct → Repeat

That’s it.

Everything in deep learning eventually reduces to this loop.

The Training Process

The Neuron — A Tiny Decision Maker

The fundamental building block of neural networks is the neuron. A neuron is NOT a real brain cell. Its just a mathematical function that takes numbers in, performs basic arithmetic operation and outputs a signal.

Inputs

Inputs are numerical representations of features.

Examples: house size, pixel brightness, age, salary, temperature, word embeddings

Suppose:

Weights and Summation

Every input has a weight. Weights determine importance. Weights are just importance score, think of it like, it gives the model an idea on how much importance we gonna give to particular input. it varies from 0 to 1, 1 being very important and 0 means not important.

If a weight is larger:

The neuron computes a weighted sum.

Mathematically (summation):

This sigma symbol means: sum everything together

Bias is an additional adjustable parameter. an offset that allows the neuron to shift its decision boundary.

denoted by ‘b’ in the formula

Bias helps shift the decision boundary.

Without bias:

the model becomes too restricted

outputs are forced through the origin

Bias gives flexibility.

After computing the weighted sum and adding bias:

we apply an activation function.

Formula:

The activation decides:

but what does that means??????

Why Activation Functions Matter

This is one of the most important ideas in deep learning.

Before understanding activation functions, let's first understand the difference between linear and non-linear data.

Linear Example

Suppose we have this dataset:

This relationship is linear.

Why?

Because as:

house size increases → price also increases proportionally

A simple straight-line equation can model this perfectly.

Something like:

price = k × size

works very well here.

This kind of data is easy for simple linear models.

Non-Linear Example

Now consider this dataset:

Now the relationship becomes much more complicated.

Notice:

house size remains SAME

but prices change drastically

Why?

Because now:

metro access matters

luxury locality matters

multiple features interact together

This relationship is NOT linear anymore.

A simple straight line cannot model this properly.

And this is exactly why we need:

non-linearity

Without non-linearity, neural networks would fail on real-world data.

Now, suppose we stack multiple layers WITHOUT activation functions.

z = W₂y + b₂

Substitute y:

z = W₂(W₁x + b₁) + b₂

Expand:

z = (W₂W₁)x + (W₂b₁ + b₂)

This simplifies into:

z = W'x + b'

Meaning:

Multiple linear layers collapse into a single linear layer.

Even if you stack:

10 layers

100 layers

1000 layers

without activation functions, the network is still mathematically equivalent to ONE linear transformation.

This is called: Mathematical Collapse

The Non-Linear Solution

Activation functions solve this problem.

They introduce:

non-linearity

Because of this:

layers no longer collapse

deep networks become meaningful

complex functions can be learned

Another Intuition for Non-Linearity

Suppose you have a graph with:

Something like this:

Without activation functions:

neural networks can only create straight-line boundaries

they fail on complex real-world patterns

With activation functions:

networks can create curved decision boundaries

learn complex patterns

separate complicated data distributions

So mathematically, after the summation step, we pass the output into an activation function.

That activation function introduces non-linearity into the network.

You can think of it as adding:

bends
curves
non-linear behavior

into the mathematical representation.

Without activation functions, every layer would only perform:

multiplication + summation

and the entire network would eventually collapse into just one big linear equation.

But because every layer now passes its output through an activation function:

Layer Output → Activation Function → Next Layer

each layer starts doing something different.

This makes the outputs of different layers unique and prevents all the layers from collapsing into a single linear transformation.

Universal Approximation

A neural network with non-linear activations can approximate:

Any continuous function.

This is an incredibly important result.

It means neural networks can model:

speech

images

language

physics simulations

medical diagnosis

human behavior

and much more

Types of Activation Functions:

Linear Activation Function

Non-Linear Activation Function

Modern architectures often use smoother versions of ReLU.

Examples:

The Loss Function

The loss function measures:

How wrong the model is.

Training goal: Minimize loss

Lower loss means:

predictions are improving

model is learning patterns correctly

Mean Squared Error (MSE)

A very common regression loss.

Formula:

L = (1/n) Σ(prediction - target)^2

Key idea:

(prediction - target)^2

The square does two important things:

Makes errors positive

Punishes large mistakes more heavily

Suppose:

Predicted = $350,000
Actual = $400,000

Error:

$50,000

Squared error:

2,500,000,000

Large mistakes become very expensive.

This pushes the model to reduce big errors aggressively.

Backpropagation — Assigning Blame

bends curves non-linear behavior

The Chain of Responsibility

The Question: "The prediction was wrong. Which specific weights are responsible for this error?" The Flow: Error signals travel backward from the output layer through the hidden layers to the input. Adjustment: Each weight is adjusted proportionally to its contribution to the final mistake.

Error begins at the output layer.

Then the signal travels backward through the network.

Every weight receives feedback about:

Weights responsible for larger mistakes get adjusted more.

Weights responsible for smaller mistakes get adjusted less.

This is how learning happens.

Understanding Gradients and Backpropagation: The Cooking Analogy

What is a Gradient?

In deep learning, a gradient answers a simple question:

If I tweak this specific weight a little bit, how will the final loss change?

Mathematically, we write this as:

∂L / ∂W

This notation simply means:

The Intuition — Who is Responsible?

Imagine your neural network has a specific weight set to:

The Forward Pass: The network makes a prediction.

The Outcome: The prediction is wrong, resulting in a high loss.

Now we ask:

Was this specific weight responsible for the mistake?

During backpropagation, the network checks the sensitivity of that weight.

If increasing the weight increases the loss: gradient is positive

If increasing the weight decreases the loss: gradient is negative

If changing the weight barely changes the loss: gradient is zero

A gradient measures: sensitivity

It tells the network exactly how sensitive the loss is to changes in a specific weight.

Imagine you are cooking a soup.

You taste the final dish and it tastes terrible.

In AI terms:

your loss score is high

Now you need to diagnose the problem.

You ask yourself:

Did I add too much salt?

Is there too much chili?

Did I add too little sugar?

Backpropagation is the mental process of looking backward at the recipe to figure out:

which ingredient caused the bad taste and by how much

The gradient is the exact numerical impact assigned to each ingredient.

Example:

Salt Contribution  = +8 badness
Chili Contribution = +2 badness
Sugar Contribution = 0 badness

Interpretation:

High positive gradient  → way too much salt
Low positive gradient   → slightly too spicy
Zero gradient           → sugar didn't cause the issue

Once the network knows the gradients:

(the impact numbers)

training becomes the process of adjusting the recipe.

So next time:

reduce the salt

slightly reduce the chili

leave the sugar unchanged

Repeat this process again and again until the soup tastes perfect.

This is exactly how neural networks learn.

What Backpropagation actually does

during training:

step1: Forward pass :

input goes through network and generate prediction

step2: Compute loss:

it will calculate prediction value vs actual value

step3: Backpropagation:

now we move backward through the network using chain rule from calculus, backprop compute

What Gradient value means

suppose ∂L / ∂W = 5 ———→ high positive gradient • What it means: A tiny increase in this weight increases the loss significantly. • The Action: The optimizer needs to decrease this weight strongly to drop the loss.

if ∂L / ∂W = -3 ———→ negative gradient • What it means: Increasing this weight will reduce the loss. • The Action: The optimizer needs to increase this weight to drive the loss down.

Learning Rate

Learning rate controls:

How big each update step should be

it controls how quickly the model learns during training

it determines the size of the step taken to minimize the loss funcitno

after backpropagation computer gradient(how much each weight contributed to error), the optimizer updates the weight like this:

imagine youre trying to reach the bottom of the valley

if learing rate is too small

If learning rate is too large

Ideal Learning Rate: a good learning rate:

Ex: curr weight = 0.8, gradient: 0.2, and learning rate = 0.1

Gradient Descent — Finding the Valley

Gradient descent is the optimization strategy.

Goal:

Reduce loss

The Gradient

Gradient means:

slope

It tells the model:

Which direction decreases loss?

Step Downhill

The process:

calculate gradient

move slightly downhill

reduce loss

repeat

Eventually the model reaches a low-error region.

This process is called:

Gradient Descent

Global Minimum

The ideal destination is:

Global Minimum

This is the point where error is as low as possible.

In practice:

loss surfaces are huge

many local minima exist

optimization is difficult

Yet gradient descent works surprisingly well.

The Complete Training Loop

Everything comes together in four major steps.

1. Forward Pass

Data moves through the network.

Inputs become predictions.

Example:

Input → Hidden Layer → Output

2. Calculate Loss

Prediction is compared with actual answer.

Loss measures error.

3. Backpropagate

Error flows backward.

The network computes:

Which weights caused the mistake?

4. Update Weights

Weights are adjusted using:

Gradient Descent

Goal:

Reduce future error

Scaling to Massive Models

The same exact principles scale from tiny toy networks to massive AI systems.

XOR Network

A tiny XOR neural network may have:

~20 parameters

Yet it still uses:

forward pass

loss calculation

backpropagation

weight updates

GPT-4 Scale

Modern large language models operate with:

~1 trillion parameters

But fundamentally:

the learning loop is still the same.

Even massive AI systems are still doing:

Forward Pass
→ Loss
→ Backpropagation
→ Weight Updates

just at enormous scale.

Key Takeaways

1. Neural Networks are Simple Building Blocks

Neural networks are layers of neurons performing weighted sums.

The complexity emerges from scale and composition.

2. Training is Iterative Improvement

Training is simply:

Guess
→ Measure Error
→ Correct
→ Repeat

3. Backpropagation Assigns Responsibility

Backpropagation determines:

Which weights caused the error?

and adjusts them accordingly.

4. Gradient Descent Optimizes Learning

Gradient descent continuously moves weights toward lower loss.

This is the optimization engine behind deep learning.

Final Intuition

A neural network is fundamentally:

A giant mathematical function approximator.

It learns:

Input → Output relationships

from examples.

Whether the task is:

predicting house prices

recognizing faces

translating languages

generating code

creating images

or powering GPT models

The core principles remain exactly the same.

From Theory to Practice — Building a Neural Network for XOR

At this point, we understand the theory behind neural networks:

neurons

weights

activation functions

loss

backpropagation

gradient descent

But understanding theory alone is not enough.

Now comes the fun part:

Building an actual neural network from scratch.

And for that, we will use one of the most famous problems in deep learning:

The XOR Problem

At first glance, XOR looks simple.

But this tiny problem changed the history of neural networks.

Why?

Because:

A single linear neuron cannot solve XOR.

If we try to separate the outputs using one straight line:

it fails

The network needs:

hidden layers + non-linearity

This makes XOR the perfect beginner problem for understanding:

why hidden layers exist

why activation functions matter

how neural networks learn complex patterns

What We'll Build

We'll build a neural network using only NumPy (basic math), train it to solve a problem that a single neuron *cannot* solve

What We'll Cover

1. The XOR Problem — Why we need hidden layers

2. Building a Neural Network — Forward pass from scratch

3. The Training Loop — Loss, backprop, weight updates

4. Watching It Learn — Visualizing training

5. Breaking It — What happens with bad hyperparameters

Part 1: The XOR Problem

Why XOR?

XOR (exclusive or) is a simple logical operation:

If inputs are different → output 1

If inputs are the same → output 0

import numpy as np
import matplotlib.pyplot as plt

# Our training data: XOR
X = np.array([
    [0, 0],
    [0, 1],
    [1, 0],
    [1, 1]
])

y = np.array([
    [0],
    [1],
    [1],
    [0]
])

print("XOR Dataset:")
print("-" * 30)
for i in range(len(X)):
    print(f"Input: {X[i]} → Output: {y[i][0]}")

Visualizing the Problem (optional)

Let's plot the XOR data. You'll see why a straight line can't separate the classes.

plt.figure(figsize=(8, 6))

# Plot the points
for i in range(len(X)):
    color = 'red' if y[i][0] == 0 else 'blue'
    marker = 'o' if y[i][0] == 0 else 's'
    plt.scatter(X[i][0], X[i][1], c=color, s=200, marker=marker,
                edgecolors='black', linewidths=2)

plt.xlabel('Input A', fontsize=12)
plt.ylabel('Input B', fontsize=12)
plt.title('XOR Problem: Can you draw a single straight line to separate red from blue?',
          fontsize=12)
plt.xlim(-0.5, 1.5)
plt.ylim(-0.5, 1.5)
plt.grid(True, alpha=0.3)
plt.legend(['Class 0 (same)', 'Class 1 (different)'], loc='upper right')
plt.show()

print("
❌ A single straight line CANNOT separate these classes.")
print("✅ This is why we need hidden layers — they create non-linear boundaries.")

Part 2: Building the Neural Network

Our Architecture

Input Layer (2 neurons) → Hidden Layer (4 neurons) → Output Layer (1 neuron)

Input: 2 values (the two XOR inputs)

Hidden: 4 neurons with sigmoid activation

Output: 1 neuron with sigmoid activation (gives us 0-1 probability)

Why Sigmoid?

For this educational example, we use sigmoid everywhere because:

Output is naturally between 0 and 1 (matches our target)

The math is clean and easy to follow

It's historically important

In practice, you'd use ReLU for hidden layers. But sigmoid helps us see what’s happening.

# Network architecture
INPUT_SIZE = 2
HIDDEN_SIZE = 4
OUTPUT_SIZE = 1

# Weights from input to hidden layer
weights_input_hidden = np.random.randn(INPUT_SIZE, HIDDEN_SIZE) * 0.5
bias_hidden = np.zeros((1, HIDDEN_SIZE))

# Weights from hidden to output layer
weights_hidden_output = np.random.randn(HIDDEN_SIZE, OUTPUT_SIZE) * 0.5
bias_output = np.zeros((1, OUTPUT_SIZE))

print("Network initialized with random weights:")
print(f"  Input → Hidden weights shape: {weights_input_hidden.shape}")
print(f"  Hidden → Output weights shape: {weights_hidden_output.shape}")
print(f"bias (input->hidden) + bias(hidden->output): {bias_hidden.size}, { bias_output.size}")
print(f"\nTotal parameters: {weights_input_hidden.size }+{ bias_hidden.size }+{ weights_hidden_output.size }+{ bias_output.size}={weights_input_hidden.size + bias_hidden.size + weights_hidden_output.size + bias_output.size}")

The Activation Function: Sigmoid

Sigmoid squashes any number into the range (0, 1):

Large positive numbers → close to 1

Large negative numbers → close to 0

Zero → exactly 0.5

We also need its derivative for backpropagation.

Sigmoid Function

σ(x) = 1 / (1 + e^(-x))

def sigmoid(x):
    """Squash values to range (0, 1)"""
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    """Derivative of sigmoid: σ(x) * (1 - σ(x))"""
    s = sigmoid(x)
    return s * (1 - s)

# Visualize sigmoid
x_range = np.linspace(-6, 6, 100)
plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.plot(x_range, sigmoid(x_range), 'b-', linewidth=2)
plt.axhline(y=0.5, color='gray', linestyle='--', alpha=0.5)
plt.axvline(x=0, color='gray', linestyle='--', alpha=0.5)
plt.xlabel('Input')
plt.ylabel('Output')
plt.title('Sigmoid Function: σ(x) = 1/(1+e⁻ˣ)')
plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
plt.plot(x_range, sigmoid_derivative(x_range), 'r-', linewidth=2)
plt.axhline(y=0.25, color='gray', linestyle='--', alpha=0.5, label='max = 0.25')
plt.xlabel('Input')
plt.ylabel('Derivative')
plt.title('Sigmoid Derivative (max value = 0.25)')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nNotice: The maximum derivative is only 0.25!")
print("   This is the vanishing gradient problem.")
print("   10 layers: 0.25^10 = ", 0.25**10)

Forward Pass

The forward pass is how data flows through the network:

Input → Hidden: Multiply inputs by weights, add bias, apply activation

Hidden → Output: Multiply hidden by weights, add bias, apply activation

Let's trace through exactly what happens.

def forward(X):
    """
    Forward pass through the network.
    Returns all intermediate values (we need them for backprop).
    """

    # Step 1: Input to Hidden
    z_hidden = np.dot(X, weights_input_hidden) + bias_hidden
    a_hidden = sigmoid(z_hidden)

    # Step 2: Hidden to Output
    z_output = np.dot(a_hidden, weights_hidden_output) + bias_output
    a_output = sigmoid(z_output)

    return z_hidden, a_hidden, z_output, a_output

# Test forward pass with untrained network
z_h, a_h, z_o, predictions = forward(X)

print("Forward pass with UNTRAINED network:")
print("-" * 50)
for i in range(len(X)):
    print(f"Input: {X[i]} → Prediction: {predictions[i][0]:.4f} (Target: {y[i][0]})")

print("\n❌ Predictions are garbage — the network hasn't learned anything yet.")

Loss Function: Mean Squared Error

Loss measures how wrong our predictions are. Lower = better.

MSE = mean((prediction - target)^2)

We square the error so:

all errors are positive

big errors are penalized more than small errors

def compute_loss(y_true, y_pred):
    """Mean Squared Error"""
    return np.mean((y_true - y_pred) ** 2)

# Calculate initial loss
initial_loss = compute_loss(y, predictions)
print(f"Initial Loss (untrained): {initial_loss:.4f}")
print("\nThis number should decrease as we train.")

Now we need to adjust the weights + bias in such a way that this loss should be minimum (close to 0). but now we need to calc or identify which parameter has contributed to this loss and to identify that we need backpropagation

Part 3: Backpropagation

This is where the magic happens.

Backprop answers:

"Which weights caused the error, and how much?"

The Chain of Blame

Calculate error at output

Figure out how much each output weight contributed

Propagate error back to hidden layer

Figure out how much each hidden weight contributed

Adjust all weights proportionally

The math uses the chain rule from calculus, but the intuition is simple: blame flows backward

def backward(X, y, z_hidden, a_hidden, z_output, a_output, learning_rate):
    """
    Backpropagation: compute gradients and update weights.
    """
    global weights_input_hidden, bias_hidden, weights_hidden_output, bias_output

    m = X.shape[0]

    # OUTPUT LAYER
    # Error at output: difference between prediction and target
    output_error = a_output - y
    # Gradient of loss wrt z_output (before activation)
    output_delta = output_error * sigmoid_derivative(z_output)

    # Gradient of loss wrt weights_hidden_output
    # How much did each weight contribute to the error?
    grad_weights_hidden_output = np.dot(a_hidden.T, output_delta) / m
    grad_bias_output = np.mean(output_delta, axis=0, keepdims=True)

    # HIDDEN LAYER
    # Propagate error back to hidden layer
    hidden_error = np.dot(output_delta, weights_hidden_output.T)
    # Gradient of loss wrt z_hidden
    hidden_delta = hidden_error * sigmoid_derivative(z_hidden)

    # Gradient of loss wrt weights_input_hidden
    grad_weights_input_hidden = np.dot(X.T, hidden_delta) / m
    grad_bias_hidden = np.mean(hidden_delta, axis=0, keepdims=True)

    # UPDATE WEIGHTS
    # Move weights in the opposite direction of the gradient
    # (gradient points uphill, we want to go downhill)
    weights_hidden_output -= learning_rate * grad_weights_hidden_output
    bias_output -= learning_rate * grad_bias_output
    weights_input_hidden -= learning_rate * grad_weights_input_hidden
    bias_hidden -= learning_rate * grad_bias_hidden

print("Backpropagation function defined.")
print("This is the learning part — adjusting weights to reduce error.")

Why Multiply?

output_delta=
output_error×sigmoid_derivative(z_output)

This combines:

How wrong are we?
×
How sensitive is neuron?

Think About It Like This

Imagine:

error = 100

BUT neuron sensitivity:

0.00001

Then:

delta=100×0.00001
=0.001

Even though error is huge,

the neuron cannot change output much here.

So update should stay small.

Another Example

Suppose:

output_error=0.8

and:

sigmoid_derivative=0.25

Then:

output_delta=0.8×0.25
=0.2

This final value:

0.2

is the actual signal used to update weights.

Part 4: The Training Loop

Now we put it all together:

for each iteration:
    1. Forward pass → get predictions
    2. Calculate loss → how wrong are we?
    3. Backward pass → compute gradients, update weights

Let's train for 10,000 iterations and watch the loss decrease.

# Reset weights
np.random.seed(42)
weights_input_hidden = np.random.randn(INPUT_SIZE, HIDDEN_SIZE) * 0.5
bias_hidden = np.zeros((1, HIDDEN_SIZE))
weights_hidden_output = np.random.randn(HIDDEN_SIZE, OUTPUT_SIZE) * 0.5
bias_output = np.zeros((1, OUTPUT_SIZE))

# Hyperparameters
learning_rate = 2.0
iterations = 10000

# Track loss over time
loss_history = []

print("Training started...")
print("-" * 50)

for i in range(iterations):
    z_h, a_h, z_o, predictions = forward(X)

    loss = compute_loss(y, predictions)
    loss_history.append(loss)

    backward(X, y, z_h, a_h, z_o, predictions, learning_rate)

    if i % 2000 == 0:
        print(f"Iteration {i:5d} | Loss: {loss:.6f}")

Let's See the Results!

# Final predictions
_, _, _, final_predictions = forward(X)

print("Final Results After Training:")
print("-" * 50)
print(f"{'Input':<12} {'Target':<10} {'Prediction':<12} {'Rounded':<10}")
print("-" * 50)

for i in range(len(X)):
    pred = final_predictions[i][0]
    rounded = round(pred)
    status = "✅" if rounded == y[i][0] else "❌"
    print(f"{str(X[i]):<12} {y[i][0]:<10} {pred:<12.4f} {rounded:<10} {status}")

print("-" * 50)
print("
🎉 The network learned XOR from random weights!")

# Plot the loss curve
plt.figure(figsize=(10, 5))
plt.plot(loss_history, 'b-', linewidth=0.5)
plt.xlabel('Iteration', fontsize=12)
plt.ylabel('Loss (MSE)', fontsize=12)
plt.title('Training Loss Over Time', fontsize=14)
plt.grid(True, alpha=0.3)
plt.show()

print("The loss started high (random guessing) and decreased (learning).")

Part 5: Breaking It (Experiments)

Understanding what breaks a network teaches you more than seeing it work.

Experiment 1: Learning Rate Too High

# Reset and train with learning rate = 100
np.random.seed(42)
weights_input_hidden = np.random.randn(INPUT_SIZE, HIDDEN_SIZE) * 0.5
bias_hidden = np.zeros((1, HIDDEN_SIZE))
weights_hidden_output = np.random.randn(HIDDEN_SIZE, OUTPUT_SIZE) * 0.5
bias_output = np.zeros((1, OUTPUT_SIZE))

lr_high = 100.0
loss_high_lr = []

for i in range(1000):
    z_h, a_h, z_o, pred = forward(X)
    loss_high_lr.append(compute_loss(y, pred))
    backward(X, y, z_h, a_h, z_o, pred, lr_high)

plt.figure(figsize=(10, 4))
plt.plot(loss_high_lr, 'r-', linewidth=1)
plt.xlabel('Iteration')
plt.ylabel('Loss')
plt.title(f'Learning Rate = {lr_high} (TOO HIGH) — Loss explodes or oscillates')
plt.grid(True, alpha=0.3)
plt.show()

Experiment 2: Learning Rate Too Low

# Reset and train with learning rate = 0.001
np.random.seed(42)
weights_input_hidden = np.random.randn(INPUT_SIZE, HIDDEN_SIZE) * 0.5
bias_hidden = np.zeros((1, HIDDEN_SIZE))
weights_hidden_output = np.random.randn(HIDDEN_SIZE, OUTPUT_SIZE) * 0.5
bias_output = np.zeros((1, OUTPUT_SIZE))

lr_low = 0.001
loss_low_lr = []

for i in range(10000):
    z_h, a_h, z_o, pred = forward(X)
    loss_low_lr.append(compute_loss(y, pred))
    backward(X, y, z_h, a_h, z_o, pred, lr_low)

plt.figure(figsize=(10, 4))
plt.plot(loss_low_lr, 'orange', linewidth=1)
plt.xlabel('Iteration')
plt.ylabel('Loss')
plt.title(f'Learning Rate = {lr_low} (TOO LOW) — Barely moves after 10,000 iterations')
plt.grid(True, alpha=0.3)
plt.show()

Experiment 3: Not Enough Hidden Neurons

# Try with only 2 hidden neurons
np.random.seed(42)
w_ih_small = np.random.randn(2, 2) * 0.5
b_h_small = np.zeros((1, 2))
w_ho_small = np.random.randn(2, 1) * 0.5
b_o_small = np.zeros((1, 1))

def forward_small(X):
    z_h = np.dot(X, w_ih_small) + b_h_small
    a_h = sigmoid(z_h)
    z_o = np.dot(a_h, w_ho_small) + b_o_small
    a_o = sigmoid(z_o)
    return z_h, a_h, z_o, a_o

def backward_small(X, y, z_h, a_h, z_o, a_o, lr):
    global w_ih_small, b_h_small, w_ho_small, b_o_small
    m = X.shape[0]

    output_delta = (a_o - y) * sigmoid_derivative(z_o)
    w_ho_small -= lr * np.dot(a_h.T, output_delta) / m
    b_o_small -= lr * np.mean(output_delta, axis=0, keepdims=True)

    hidden_delta = np.dot(output_delta, w_ho_small.T) * sigmoid_derivative(z_h)
    w_ih_small -= lr * np.dot(X.T, hidden_delta) / m
    b_h_small -= lr * np.mean(hidden_delta, axis=0, keepdims=True)

loss_small = []
for i in range(10000):
    z_h, a_h, z_o, pred = forward_small(X)
    loss_small.append(compute_loss(y, pred))
    backward_small(X, y, z_h, a_h, z_o, pred, 2.0)

plt.figure(figsize=(10, 4))
plt.plot(loss_small, 'purple', linewidth=1)
plt.xlabel('Iteration')
plt.ylabel('Loss')
plt.title('Only 2 Hidden Neurons — Network struggles to learn XOR')
plt.grid(True, alpha=0.3)
plt.show()

final_small = forward_small(X)[3]
print("Predictions with only 2 hidden neurons:")
for i in range(len(X)):
    print(f"  {X[i]} → {final_small[i][0]:.4f} (target: {y[i][0]})")
print("\n⚠️  With fewer neurons, the network may not have enough capacity.")

What Comes Next?

After understanding neural networks, the next major concept is:

Transformers & Attention

These architectures power:

ChatGPT

Claude

Gemini

modern LLMs

image generation systems

multimodal AI

And underneath them all:

the same neural network fundamentals still apply.

THE END….until next time… till then Happy learning>