back
Nural Network - basics with hands on
Introduction
When people hear the term Neural Networks, it often sounds mysterious. Almost magical.
But under the hood, neural networks are not magic.
They are simply:
- numbers
- matrix multiplications
- weighted sums
- error calculations
- repeated adjustments
This document explains neural networks from the ground up — the exact same ideas covered in the presentation — but in a more detailed and intuitive way.
By the end, you should understand:
- what a neuron actually does
- how neural networks learn
- why activation functions are necessary
- what loss functions are
- how backpropagation works
- why gradient descent matters
- and how these simple ideas scale all the way to systems like GPT-4
From Rules to Patterns
Traditional Programming
Traditional programming works like this:
Input + Human Written Rules → Output
Example:
if temperature > 30:
print("Hot")
This works well for simple systems.
But problems begin when tasks become too complex.
For example:
- detecting cats in images
- recognizing speech
- generating human language
- driving a car
- understanding emotions
Humans cannot manually write rules for every possible situation.
Why?
Because the number of possibilities becomes enormous.
Traditional systems are also rigid.
If the input slightly differs from expected patterns, the system fails.
Machine Learning
Machine Learning changes the approach.
Instead of manually writing rules:
Data + Correct Answers → Learn Patterns Automatically
The system discovers mathematical relationships hidden inside data.
This is the central idea behind neural networks.
The real goal of machine learning is:
Discover patterns automatically from data.
The Simplest Prediction Problem
Let’s begin with a simple example.
Suppose we have house prices.
Humans immediately notice a pattern.
Every additional:
500 sq ft → +$100,000
Which means:
$200 per sq ft
So we can estimate:
2500 sq ft → $500,000
But how does a machine learn this automatically?
That is the core question neural networks try to solve.
Machine Learning in One Sentence
Training is simply:
Guess → Measure Error → Correct → Repeat
That’s it.
Everything in deep learning eventually reduces to this loop.
The Training Process
The Neuron — A Tiny Decision Maker
The fundamental building block of neural networks is the neuron. A neuron is NOT a real brain cell. Its just a mathematical function that takes numbers in, performs basic arithmetic operation and outputs a signal.
Inputs
Inputs are numerical representations of features.
Examples: house size, pixel brightness, age, salary, temperature, word embeddings
Suppose:
Weights and Summation
Every input has a weight. Weights determine importance. Weights are just importance score, think of it like, it gives the model an idea on how much importance we gonna give to particular input. it varies from 0 to 1, 1 being very important and 0 means not important.
If a weight is larger:
The neuron computes a weighted sum.
Mathematically (summation):
This sigma symbol means: sum everything together
Bias is an additional adjustable parameter. an offset that allows the neuron to shift its decision boundary.
denoted by ‘b’ in the formula
Bias helps shift the decision boundary.
Without bias:
- the model becomes too restricted
- outputs are forced through the origin
Bias gives flexibility.
After computing the weighted sum and adding bias:
we apply an activation function.
Formula:
The activation decides:
but what does that means??????
Why Activation Functions Matter
This is one of the most important ideas in deep learning.
Before understanding activation functions, let's first understand the difference between linear and non-linear data.
Linear Example
Suppose we have this dataset:
This relationship is linear.
Why?
Because as:
house size increases → price also increases proportionally
A simple straight-line equation can model this perfectly.
Something like:
price = k × size
works very well here.
This kind of data is easy for simple linear models.
Non-Linear Example
Now consider this dataset:
Now the relationship becomes much more complicated.
Notice:
- house size remains SAME
- but prices change drastically
Why?
Because now:
- metro access matters
- luxury locality matters
- multiple features interact together
This relationship is NOT linear anymore.
A simple straight line cannot model this properly.
And this is exactly why we need:
non-linearity
Without non-linearity, neural networks would fail on real-world data.
Now, suppose we stack multiple layers WITHOUT activation functions.
z = W₂y + b₂
Substitute y:
z = W₂(W₁x + b₁) + b₂
Expand:
z = (W₂W₁)x + (W₂b₁ + b₂)
This simplifies into:
z = W'x + b'
Meaning:
Multiple linear layers collapse into a single linear layer.
Even if you stack:
- 10 layers
- 100 layers
- 1000 layers
without activation functions, the network is still mathematically equivalent to ONE linear transformation.
This is called: Mathematical Collapse
The Non-Linear Solution
Activation functions solve this problem.
They introduce:
non-linearity
Because of this:
- layers no longer collapse
- deep networks become meaningful
- complex functions can be learned
Another Intuition for Non-Linearity
Suppose you have a graph with:
Something like this:
Without activation functions:
- neural networks can only create straight-line boundaries
- they fail on complex real-world patterns
With activation functions:
- networks can create curved decision boundaries
- learn complex patterns
- separate complicated data distributions
So mathematically, after the summation step, we pass the output into an activation function.
That activation function introduces non-linearity into the network.
You can think of it as adding:
bends
curves
non-linear behavior
into the mathematical representation.
Without activation functions, every layer would only perform:
multiplication + summation
and the entire network would eventually collapse into just one big linear equation.
But because every layer now passes its output through an activation function:
Layer Output → Activation Function → Next Layer
each layer starts doing something different.
This makes the outputs of different layers unique and prevents all the layers from collapsing into a single linear transformation.
Universal Approximation
A neural network with non-linear activations can approximate:
Any continuous function.
This is an incredibly important result.
It means neural networks can model:
- speech
- images
- language
- physics simulations
- medical diagnosis
- human behavior
- and much more
Types of Activation Functions:
Linear Activation Function
Non-Linear Activation Function
Modern architectures often use smoother versions of ReLU.
Examples:
The Loss Function
The loss function measures:
How wrong the model is.
Training goal: Minimize loss
Lower loss means:
- predictions are improving
- model is learning patterns correctly
Mean Squared Error (MSE)
A very common regression loss.
Formula:
L = (1/n) Σ(prediction - target)^2
Key idea:
(prediction - target)^2
The square does two important things:
- Makes errors positive
- Punishes large mistakes more heavily
Suppose:
Predicted = $350,000
Actual = $400,000
Error:
$50,000
Squared error:
2,500,000,000
Large mistakes become very expensive.
This pushes the model to reduce big errors aggressively.
Backpropagation — Assigning Blame
bends curves non-linear behavior
The Chain of Responsibility
The Question: "The prediction was wrong. Which specific weights are responsible for this error?" The Flow: Error signals travel backward from the output layer through the hidden layers to the input. Adjustment: Each weight is adjusted proportionally to its contribution to the final mistake.
- Error begins at the output layer.
- Then the signal travels backward through the network.
- Every weight receives feedback about:
- Weights responsible for larger mistakes get adjusted more.
- Weights responsible for smaller mistakes get adjusted less.
This is how learning happens.
Understanding Gradients and Backpropagation: The Cooking Analogy
What is a Gradient?
In deep learning, a gradient answers a simple question:
If I tweak this specific weight a little bit, how will the final loss change?
Mathematically, we write this as:
∂L / ∂W
This notation simply means:
The Intuition — Who is Responsible?
Imagine your neural network has a specific weight set to:
The Forward Pass: The network makes a prediction.
The Outcome: The prediction is wrong, resulting in a high loss.
Now we ask:
Was this specific weight responsible for the mistake?
During backpropagation, the network checks the sensitivity of that weight.
- If increasing the weight increases the loss: gradient is positive
- If increasing the weight decreases the loss: gradient is negative
- If changing the weight barely changes the loss: gradient is zero
A gradient measures: sensitivity
It tells the network exactly how sensitive the loss is to changes in a specific weight.
Imagine you are cooking a soup.
You taste the final dish and it tastes terrible.
In AI terms:
your loss score is high
Now you need to diagnose the problem.
You ask yourself:
- Did I add too much salt?
- Is there too much chili?
- Did I add too little sugar?
Backpropagation is the mental process of looking backward at the recipe to figure out:
which ingredient caused the bad taste and by how much
The gradient is the exact numerical impact assigned to each ingredient.
Example:
Salt Contribution = +8 badness
Chili Contribution = +2 badness
Sugar Contribution = 0 badness
Interpretation:
High positive gradient → way too much salt
Low positive gradient → slightly too spicy
Zero gradient → sugar didn't cause the issue
Once the network knows the gradients:
(the impact numbers)
training becomes the process of adjusting the recipe.
So next time:
- reduce the salt
- slightly reduce the chili
- leave the sugar unchanged
Repeat this process again and again until the soup tastes perfect.
This is exactly how neural networks learn.
What Backpropagation actually does
during training:
step1: Forward pass :
- input goes through network and generate prediction
step2: Compute loss:
- it will calculate prediction value vs actual value
step3: Backpropagation:
- now we move backward through the network using chain rule from calculus, backprop compute
What Gradient value means
suppose ∂L / ∂W = 5 ———→ high positive gradient • What it means: A tiny increase in this weight increases the loss significantly. • The Action: The optimizer needs to decrease this weight strongly to drop the loss.
if ∂L / ∂W = -3 ———→ negative gradient • What it means: Increasing this weight will reduce the loss. • The Action: The optimizer needs to increase this weight to drive the loss down.
Learning Rate
Learning rate controls:
How big each update step should be
- it controls how quickly the model learns during training
- it determines the size of the step taken to minimize the loss funcitno
after backpropagation computer gradient(how much each weight contributed to error), the optimizer updates the weight like this:
imagine youre trying to reach the bottom of the valley
- if learing rate is too small
- If learning rate is too large
- Ideal Learning Rate: a good learning rate:
Ex: curr weight = 0.8, gradient: 0.2, and learning rate = 0.1
Gradient Descent — Finding the Valley
Gradient descent is the optimization strategy.
Goal:
Reduce loss
The Gradient
Gradient means:
slope
It tells the model:
Which direction decreases loss?
Step Downhill
The process:
- calculate gradient
- move slightly downhill
- reduce loss
- repeat
Eventually the model reaches a low-error region.
This process is called:
Gradient Descent
Global Minimum
The ideal destination is:
Global Minimum
This is the point where error is as low as possible.
In practice:
- loss surfaces are huge
- many local minima exist
- optimization is difficult
Yet gradient descent works surprisingly well.
The Complete Training Loop
Everything comes together in four major steps.
1. Forward Pass
Data moves through the network.
Inputs become predictions.
Example:
Input → Hidden Layer → Output
2. Calculate Loss
Prediction is compared with actual answer.
Loss measures error.
3. Backpropagate
Error flows backward.
The network computes:
Which weights caused the mistake?
4. Update Weights
Weights are adjusted using:
Gradient Descent
Goal:
Reduce future error
Scaling to Massive Models
The same exact principles scale from tiny toy networks to massive AI systems.
XOR Network
A tiny XOR neural network may have:
~20 parameters
Yet it still uses:
- forward pass
- loss calculation
- backpropagation
- weight updates
GPT-4 Scale
Modern large language models operate with:
~1 trillion parameters
But fundamentally:
the learning loop is still the same.
Even massive AI systems are still doing:
Forward Pass
→ Loss
→ Backpropagation
→ Weight Updates
just at enormous scale.
Key Takeaways
1. Neural Networks are Simple Building Blocks
Neural networks are layers of neurons performing weighted sums.
The complexity emerges from scale and composition.
2. Training is Iterative Improvement
Training is simply:
Guess
→ Measure Error
→ Correct
→ Repeat
3. Backpropagation Assigns Responsibility
Backpropagation determines:
Which weights caused the error?
and adjusts them accordingly.
4. Gradient Descent Optimizes Learning
Gradient descent continuously moves weights toward lower loss.
This is the optimization engine behind deep learning.
Final Intuition
A neural network is fundamentally:
A giant mathematical function approximator.
It learns:
Input → Output relationships
from examples.
Whether the task is:
- predicting house prices
- recognizing faces
- translating languages
- generating code
- creating images
- or powering GPT models
The core principles remain exactly the same.
From Theory to Practice — Building a Neural Network for XOR
At this point, we understand the theory behind neural networks:
- neurons
- weights
- activation functions
- loss
- backpropagation
- gradient descent
But understanding theory alone is not enough.
Now comes the fun part:
Building an actual neural network from scratch.
And for that, we will use one of the most famous problems in deep learning:
The XOR Problem
At first glance, XOR looks simple.
But this tiny problem changed the history of neural networks.
Why?
Because:
A single linear neuron cannot solve XOR.
If we try to separate the outputs using one straight line:
it fails
The network needs:
hidden layers + non-linearity
This makes XOR the perfect beginner problem for understanding:
- why hidden layers exist
- why activation functions matter
- how neural networks learn complex patterns
What We'll Build
We'll build a neural network using only NumPy (basic math), train it to solve a problem that a single neuron *cannot* solve
What We'll Cover
1. The XOR Problem — Why we need hidden layers
2. Building a Neural Network — Forward pass from scratch
3. The Training Loop — Loss, backprop, weight updates
4. Watching It Learn — Visualizing training
5. Breaking It — What happens with bad hyperparameters
Part 1: The XOR Problem
Why XOR?
XOR (exclusive or) is a simple logical operation:
- If inputs are different → output 1
- If inputs are the same → output 0
import numpy as np
import matplotlib.pyplot as plt
# Our training data: XOR
X = np.array([
[0, 0],
[0, 1],
[1, 0],
[1, 1]
])
y = np.array([
[0],
[1],
[1],
[0]
])
print("XOR Dataset:")
print("-" * 30)
for i in range(len(X)):
print(f"Input: {X[i]} → Output: {y[i][0]}")
Visualizing the Problem (optional)
Let's plot the XOR data. You'll see why a straight line can't separate the classes.
plt.figure(figsize=(8, 6))
# Plot the points
for i in range(len(X)):
color = 'red' if y[i][0] == 0 else 'blue'
marker = 'o' if y[i][0] == 0 else 's'
plt.scatter(X[i][0], X[i][1], c=color, s=200, marker=marker,
edgecolors='black', linewidths=2)
plt.xlabel('Input A', fontsize=12)
plt.ylabel('Input B', fontsize=12)
plt.title('XOR Problem: Can you draw a single straight line to separate red from blue?',
fontsize=12)
plt.xlim(-0.5, 1.5)
plt.ylim(-0.5, 1.5)
plt.grid(True, alpha=0.3)
plt.legend(['Class 0 (same)', 'Class 1 (different)'], loc='upper right')
plt.show()
print("
❌ A single straight line CANNOT separate these classes.")
print("✅ This is why we need hidden layers — they create non-linear boundaries.")
Part 2: Building the Neural Network
Our Architecture
Input Layer (2 neurons) → Hidden Layer (4 neurons) → Output Layer (1 neuron)
- Input: 2 values (the two XOR inputs)
- Hidden: 4 neurons with sigmoid activation
- Output: 1 neuron with sigmoid activation (gives us 0-1 probability)
Why Sigmoid?
For this educational example, we use sigmoid everywhere because:
- Output is naturally between 0 and 1 (matches our target)
- The math is clean and easy to follow
- It's historically important
In practice, you'd use ReLU for hidden layers. But sigmoid helps us see what’s happening.
# Network architecture
INPUT_SIZE = 2
HIDDEN_SIZE = 4
OUTPUT_SIZE = 1
# Weights from input to hidden layer
weights_input_hidden = np.random.randn(INPUT_SIZE, HIDDEN_SIZE) * 0.5
bias_hidden = np.zeros((1, HIDDEN_SIZE))
# Weights from hidden to output layer
weights_hidden_output = np.random.randn(HIDDEN_SIZE, OUTPUT_SIZE) * 0.5
bias_output = np.zeros((1, OUTPUT_SIZE))
print("Network initialized with random weights:")
print(f" Input → Hidden weights shape: {weights_input_hidden.shape}")
print(f" Hidden → Output weights shape: {weights_hidden_output.shape}")
print(f"bias (input->hidden) + bias(hidden->output): {bias_hidden.size}, { bias_output.size}")
print(f"\nTotal parameters: {weights_input_hidden.size }+{ bias_hidden.size }+{ weights_hidden_output.size }+{ bias_output.size}={weights_input_hidden.size + bias_hidden.size + weights_hidden_output.size + bias_output.size}")
The Activation Function: Sigmoid
Sigmoid squashes any number into the range (0, 1):
- Large positive numbers → close to 1
- Large negative numbers → close to 0
- Zero → exactly 0.5
We also need its derivative for backpropagation.
Sigmoid Function
σ(x) = 1 / (1 + e^(-x))
def sigmoid(x):
"""Squash values to range (0, 1)"""
return 1 / (1 + np.exp(-x))
def sigmoid_derivative(x):
"""Derivative of sigmoid: σ(x) * (1 - σ(x))"""
s = sigmoid(x)
return s * (1 - s)
# Visualize sigmoid
x_range = np.linspace(-6, 6, 100)
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot(x_range, sigmoid(x_range), 'b-', linewidth=2)
plt.axhline(y=0.5, color='gray', linestyle='--', alpha=0.5)
plt.axvline(x=0, color='gray', linestyle='--', alpha=0.5)
plt.xlabel('Input')
plt.ylabel('Output')
plt.title('Sigmoid Function: σ(x) = 1/(1+e⁻ˣ)')
plt.grid(True, alpha=0.3)
plt.subplot(1, 2, 2)
plt.plot(x_range, sigmoid_derivative(x_range), 'r-', linewidth=2)
plt.axhline(y=0.25, color='gray', linestyle='--', alpha=0.5, label='max = 0.25')
plt.xlabel('Input')
plt.ylabel('Derivative')
plt.title('Sigmoid Derivative (max value = 0.25)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("\nNotice: The maximum derivative is only 0.25!")
print(" This is the vanishing gradient problem.")
print(" 10 layers: 0.25^10 = ", 0.25**10)
Forward Pass
The forward pass is how data flows through the network:
- Input → Hidden: Multiply inputs by weights, add bias, apply activation
- Hidden → Output: Multiply hidden by weights, add bias, apply activation
Let's trace through exactly what happens.
def forward(X):
"""
Forward pass through the network.
Returns all intermediate values (we need them for backprop).
"""
# Step 1: Input to Hidden
z_hidden = np.dot(X, weights_input_hidden) + bias_hidden
a_hidden = sigmoid(z_hidden)
# Step 2: Hidden to Output
z_output = np.dot(a_hidden, weights_hidden_output) + bias_output
a_output = sigmoid(z_output)
return z_hidden, a_hidden, z_output, a_output
# Test forward pass with untrained network
z_h, a_h, z_o, predictions = forward(X)
print("Forward pass with UNTRAINED network:")
print("-" * 50)
for i in range(len(X)):
print(f"Input: {X[i]} → Prediction: {predictions[i][0]:.4f} (Target: {y[i][0]})")
print("\n❌ Predictions are garbage — the network hasn't learned anything yet.")
Loss Function: Mean Squared Error
Loss measures how wrong our predictions are. Lower = better.
MSE = mean((prediction - target)^2)
We square the error so:
- all errors are positive
- big errors are penalized more than small errors
def compute_loss(y_true, y_pred):
"""Mean Squared Error"""
return np.mean((y_true - y_pred) ** 2)
# Calculate initial loss
initial_loss = compute_loss(y, predictions)
print(f"Initial Loss (untrained): {initial_loss:.4f}")
print("\nThis number should decrease as we train.")
Now we need to adjust the weights + bias in such a way that this loss should be minimum (close to 0). but now we need to calc or identify which parameter has contributed to this loss and to identify that we need backpropagation
Part 3: Backpropagation
This is where the magic happens.
Backprop answers:
"Which weights caused the error, and how much?"
The Chain of Blame
- Calculate error at output
- Figure out how much each output weight contributed
- Propagate error back to hidden layer
- Figure out how much each hidden weight contributed
- Adjust all weights proportionally
The math uses the chain rule from calculus, but the intuition is simple: blame flows backward
def backward(X, y, z_hidden, a_hidden, z_output, a_output, learning_rate):
"""
Backpropagation: compute gradients and update weights.
"""
global weights_input_hidden, bias_hidden, weights_hidden_output, bias_output
m = X.shape[0]
# OUTPUT LAYER
# Error at output: difference between prediction and target
output_error = a_output - y
# Gradient of loss wrt z_output (before activation)
output_delta = output_error * sigmoid_derivative(z_output)
# Gradient of loss wrt weights_hidden_output
# How much did each weight contribute to the error?
grad_weights_hidden_output = np.dot(a_hidden.T, output_delta) / m
grad_bias_output = np.mean(output_delta, axis=0, keepdims=True)
# HIDDEN LAYER
# Propagate error back to hidden layer
hidden_error = np.dot(output_delta, weights_hidden_output.T)
# Gradient of loss wrt z_hidden
hidden_delta = hidden_error * sigmoid_derivative(z_hidden)
# Gradient of loss wrt weights_input_hidden
grad_weights_input_hidden = np.dot(X.T, hidden_delta) / m
grad_bias_hidden = np.mean(hidden_delta, axis=0, keepdims=True)
# UPDATE WEIGHTS
# Move weights in the opposite direction of the gradient
# (gradient points uphill, we want to go downhill)
weights_hidden_output -= learning_rate * grad_weights_hidden_output
bias_output -= learning_rate * grad_bias_output
weights_input_hidden -= learning_rate * grad_weights_input_hidden
bias_hidden -= learning_rate * grad_bias_hidden
print("Backpropagation function defined.")
print("This is the learning part — adjusting weights to reduce error.")
Why Multiply?
output_delta=
output_error×sigmoid_derivative(z_output)
This combines:
How wrong are we?
×
How sensitive is neuron?
Think About It Like This
Imagine:
error = 100
BUT neuron sensitivity:
0.00001
Then:
delta=100×0.00001
=0.001
Even though error is huge,
the neuron cannot change output much here.
So update should stay small.
Another Example
Suppose:
output_error=0.8
and:
sigmoid_derivative=0.25
Then:
output_delta=0.8×0.25
=0.2
This final value:
0.2
is the actual signal used to update weights.
Part 4: The Training Loop
Now we put it all together:
for each iteration:
1. Forward pass → get predictions
2. Calculate loss → how wrong are we?
3. Backward pass → compute gradients, update weights
Let's train for 10,000 iterations and watch the loss decrease.
# Reset weights
np.random.seed(42)
weights_input_hidden = np.random.randn(INPUT_SIZE, HIDDEN_SIZE) * 0.5
bias_hidden = np.zeros((1, HIDDEN_SIZE))
weights_hidden_output = np.random.randn(HIDDEN_SIZE, OUTPUT_SIZE) * 0.5
bias_output = np.zeros((1, OUTPUT_SIZE))
# Hyperparameters
learning_rate = 2.0
iterations = 10000
# Track loss over time
loss_history = []
print("Training started...")
print("-" * 50)
for i in range(iterations):
z_h, a_h, z_o, predictions = forward(X)
loss = compute_loss(y, predictions)
loss_history.append(loss)
backward(X, y, z_h, a_h, z_o, predictions, learning_rate)
if i % 2000 == 0:
print(f"Iteration {i:5d} | Loss: {loss:.6f}")
Let's See the Results!
# Final predictions
_, _, _, final_predictions = forward(X)
print("Final Results After Training:")
print("-" * 50)
print(f"{'Input':<12} {'Target':<10} {'Prediction':<12} {'Rounded':<10}")
print("-" * 50)
for i in range(len(X)):
pred = final_predictions[i][0]
rounded = round(pred)
status = "✅" if rounded == y[i][0] else "❌"
print(f"{str(X[i]):<12} {y[i][0]:<10} {pred:<12.4f} {rounded:<10} {status}")
print("-" * 50)
print("
🎉 The network learned XOR from random weights!")
# Plot the loss curve
plt.figure(figsize=(10, 5))
plt.plot(loss_history, 'b-', linewidth=0.5)
plt.xlabel('Iteration', fontsize=12)
plt.ylabel('Loss (MSE)', fontsize=12)
plt.title('Training Loss Over Time', fontsize=14)
plt.grid(True, alpha=0.3)
plt.show()
print("The loss started high (random guessing) and decreased (learning).")
Part 5: Breaking It (Experiments)
Understanding what breaks a network teaches you more than seeing it work.
Experiment 1: Learning Rate Too High
# Reset and train with learning rate = 100
np.random.seed(42)
weights_input_hidden = np.random.randn(INPUT_SIZE, HIDDEN_SIZE) * 0.5
bias_hidden = np.zeros((1, HIDDEN_SIZE))
weights_hidden_output = np.random.randn(HIDDEN_SIZE, OUTPUT_SIZE) * 0.5
bias_output = np.zeros((1, OUTPUT_SIZE))
lr_high = 100.0
loss_high_lr = []
for i in range(1000):
z_h, a_h, z_o, pred = forward(X)
loss_high_lr.append(compute_loss(y, pred))
backward(X, y, z_h, a_h, z_o, pred, lr_high)
plt.figure(figsize=(10, 4))
plt.plot(loss_high_lr, 'r-', linewidth=1)
plt.xlabel('Iteration')
plt.ylabel('Loss')
plt.title(f'Learning Rate = {lr_high} (TOO HIGH) — Loss explodes or oscillates')
plt.grid(True, alpha=0.3)
plt.show()
Experiment 2: Learning Rate Too Low
# Reset and train with learning rate = 0.001
np.random.seed(42)
weights_input_hidden = np.random.randn(INPUT_SIZE, HIDDEN_SIZE) * 0.5
bias_hidden = np.zeros((1, HIDDEN_SIZE))
weights_hidden_output = np.random.randn(HIDDEN_SIZE, OUTPUT_SIZE) * 0.5
bias_output = np.zeros((1, OUTPUT_SIZE))
lr_low = 0.001
loss_low_lr = []
for i in range(10000):
z_h, a_h, z_o, pred = forward(X)
loss_low_lr.append(compute_loss(y, pred))
backward(X, y, z_h, a_h, z_o, pred, lr_low)
plt.figure(figsize=(10, 4))
plt.plot(loss_low_lr, 'orange', linewidth=1)
plt.xlabel('Iteration')
plt.ylabel('Loss')
plt.title(f'Learning Rate = {lr_low} (TOO LOW) — Barely moves after 10,000 iterations')
plt.grid(True, alpha=0.3)
plt.show()
Experiment 3: Not Enough Hidden Neurons
# Try with only 2 hidden neurons
np.random.seed(42)
w_ih_small = np.random.randn(2, 2) * 0.5
b_h_small = np.zeros((1, 2))
w_ho_small = np.random.randn(2, 1) * 0.5
b_o_small = np.zeros((1, 1))
def forward_small(X):
z_h = np.dot(X, w_ih_small) + b_h_small
a_h = sigmoid(z_h)
z_o = np.dot(a_h, w_ho_small) + b_o_small
a_o = sigmoid(z_o)
return z_h, a_h, z_o, a_o
def backward_small(X, y, z_h, a_h, z_o, a_o, lr):
global w_ih_small, b_h_small, w_ho_small, b_o_small
m = X.shape[0]
output_delta = (a_o - y) * sigmoid_derivative(z_o)
w_ho_small -= lr * np.dot(a_h.T, output_delta) / m
b_o_small -= lr * np.mean(output_delta, axis=0, keepdims=True)
hidden_delta = np.dot(output_delta, w_ho_small.T) * sigmoid_derivative(z_h)
w_ih_small -= lr * np.dot(X.T, hidden_delta) / m
b_h_small -= lr * np.mean(hidden_delta, axis=0, keepdims=True)
loss_small = []
for i in range(10000):
z_h, a_h, z_o, pred = forward_small(X)
loss_small.append(compute_loss(y, pred))
backward_small(X, y, z_h, a_h, z_o, pred, 2.0)
plt.figure(figsize=(10, 4))
plt.plot(loss_small, 'purple', linewidth=1)
plt.xlabel('Iteration')
plt.ylabel('Loss')
plt.title('Only 2 Hidden Neurons — Network struggles to learn XOR')
plt.grid(True, alpha=0.3)
plt.show()
final_small = forward_small(X)[3]
print("Predictions with only 2 hidden neurons:")
for i in range(len(X)):
print(f" {X[i]} → {final_small[i][0]:.4f} (target: {y[i][0]})")
print("\n⚠️ With fewer neurons, the network may not have enough capacity.")
What Comes Next?
After understanding neural networks, the next major concept is:
Transformers & Attention
These architectures power:
- ChatGPT
- Claude
- Gemini
- modern LLMs
- image generation systems
- multimodal AI
And underneath them all:
the same neural network fundamentals still apply.
THE END….until next time… till then Happy learning>