back
building tinygrad - inpired from micrograd
Building a Neural Network by Understanding How Autograd Systems Like PyTorch Work
A Step-by-Step Journey Into Backpropagation and Autograd
Modern deep learning frameworks like PyTorch make neural networks feel almost magical.
loss.backward()
One line.
This blog is a deep, step-by-step walkthrough inspired by Andrej Karpathy’s micrograd project, where we build a tinygrad (tiny autograd engine) from scratch and then use it to build a neural network.
The goal is not just to use neural networks.
The goal is to understand:
how neural networks actually learn under the hood
What We’re Going to Build
We will build:
- A tiny autograd engine
- Automatic backpropagation
- A computation graph system
- A neuron
- A neural network
- Gradient descent training
And by the end, you’ll understand what frameworks like PyTorch are doing internally.
The Overall Learning Flow
Derivatives
↓
Computation Graph
↓
Local Gradients
↓
Chain Rule
↓
Backpropagation
↓
Autograd Engine
↓
Neuron
↓
Layer
↓
Neural Network
↓
Training
PART 1: Understanding Derivatives
Before building autograd, we first need to understand gradients.
Because neural networks learn through gradients.
Derivative: the derivative is a fundamental tool that quantifies the sensitivity to change of a function's output with respect to its input. (more info)
dy/dx represents the instantaneous rate of change of y with respect to x. It tells you how fast one variable changes relative to another at an exact moment.
What is dy/dx? In calculus, dy/dx is called the derivative. • 📈 It is a curve's slope. • 🚗 It is instantaneous speed. • 🎯 It is a tangent line.
The Limit Definition The true definition of the derivative is:
Here is how to read the components: • 🔹 (f(x+h) - f(x)): The vertical change (rise). • 🔹 h: The horizontal change (run). • 🔹 lim h→0 : Shrinks the step size to zero.
What the derivative formula means
Take a tiny change in x:
Then the function changes from:
The change in output is:
Divide by the change in input h:
That gives the average rate of change.
Then let h become extremely small:
and you get the instantaneous rate of change: the derivative.
Step 1: What Needs To Be Implemented and Why?
Neural networks learn by adjusting weights.
But how does the network know:
which weight caused the error?
That’s where gradients come in.
A gradient tells us:
How much changing a parameter changes the final loss.
Mathematically:
dLoss / dWeight
This is the foundation of backpropagation.
Step 2: What Are We Implementing?
We start with a simple mathematical function.
def f(x):
return 3*x**2 - 4*x + 5
Step 3: What Is This Doing?
This function gives us a curve.
We now ask:
if x changes slightly, how much does f(x) change?
That is the derivative.
We numerically approximate it like this:
h = 0.000001
x = 2/3
(f(x + h) - f(x)) / h
Step 4: What’s Next?
Now that we understand derivatives, we need to understand:
how derivatives flow through multiple operations
That leads us to computation graphs.
PART 2: Building the Computation Graph
Neural networks are not one equation.
They are chains of many small mathematical operations.
To track gradients through these operations, we need a graph.
Step 1: What Needs To Be Implemented and Why?
If a loss depends on many operations:
weight → multiply → activation → output → loss
then we need a way to remember:
- where values came from
- which operations created them
- how gradients should flow backward
This is called a computation graph.
Step 2: What Are We Implementing?
We create a small expression:
a = 2.0
b = -3.0
c = 10.0
d = a * b + c
Step 3: What Is This Doing?
Even though it looks simple, this creates a graph.
Each node depends on previous nodes.
This is exactly how neural networks work internally.
Step 4: What’s Next?
Now we need a way for every number to remember:
how it was created
That leads us to the Value object.
PART 3: Building the Value Object
This is the heart of the autograd engine.
Step 1: What Needs To Be Implemented and Why?
Normal Python numbers do not remember:
- previous operations
- gradients
- dependencies
But autograd systems need this information.
So we need a custom object.
Step 2: What Are We Implementing?
We build a Value class.
class Value:
def __init__(self, data, _children=(), _op='', label=''):
self.data = data
self.grad = 0
self._backward = lambda: None
self._prev = set(_children)
self._op = _op
self.label=label
def __repr__(self):
return f"Value(data={self.data})"
the __repr__ function is for the representation of the data inside the Value object.
Step 3: What Is This Doing?
Each Value object stores:
self.data=data
self.grad=0.0
self._backward=lambda:None
self._prev=set(_children)
self._op=_op
self.label=label
This means every value now remembers:
where it came from
That’s the foundation of autograd.
Step 4: What’s Next?
Now we need mathematical operations to build graphs automatically.
PART 4: Implementing Mathematical Operations
Step 1: What Needs To Be Implemented and Why?
When we do:
c = a + b
we want:
- a new Value object
- graph tracking
- gradient tracking
So operations themselves must become graph-aware.
Step 2: What Are We Implementing?
Addition:
def __add__(self, other):
out = Value(self.data + other.data, (self, other), '+')
return out
Multiplication:
def __mul__(self, other):
out = Value(self.data * other.data, (self, other), '*')
return out
add this functions in our “Value” class, so that we can perform addition and multiplication.
Step 3: What Is This Doing?
Every operation now:
creates a new node and links parent nodes
This builds the computation graph dynamically.
Exactly like PyTorch.
Step 4: What’s Next?
Now we need gradients.
We need to teach every operation:
how output changes wrt inputs
That leads to local gradients.
PART 5: Local Gradients
Step 1: What Needs To Be Implemented and Why?
Every mathematical operation has its own derivative rules.
Example:
For multiplication:
z = x * y
Then:
dz/dx = y
dz/dy = x
These are local gradients.
Without them, backpropagation cannot happen.
Step 2: What Are We Implementing?
Inside multiplication:
def __mul__(self, other):
out = Value(self.data * other.data, (self, other), '*')
def _backward():
self.grad += other.data * out.grad
other.grad += self.data * out.grad
out._backward = _backward
return out
and do same for the addition:
def __add__(self, other):
other = other if isinstance(other, Value) else Value(other)
out = Value(self.data + other.data, (self, other), '+')
def _backward():
self.grad += 1.0 * out.grad
other.grad += 1.0 * out.grad
out._backward = _backward
return out
Step 3: What Is This Doing?
This is extremely important.
Each operation now knows:
how gradients should flow backward
The operation stores its own tiny backward rule.
That’s the core idea behind autograd.
Step 4: What’s Next?
Now we need to combine local gradients across the graph.
That leads us to the chain rule.
PART 6: Chain Rule and Backpropagation
This is the core mathematical idea behind neural network learning.
Step 1: What Needs To Be Implemented and Why?
A weight does not affect the loss directly.
It affects:
weight → neuron → activation → output → loss
So gradients must flow through many operations.
This is done using the chain rule.
Step 2: What Are We Implementing?
The chain rule:
dLoss/dw = (dLoss/da) * (da/dz) * (dz/dw)
Step 3: What Is This Doing?
Backpropagation is simply:
multiplying many tiny local derivatives together
across the graph.
That’s it.
There is no magic.
Step 4: What’s Next?
Now we need to automate gradient flow.
That leads us to backward()
PART 7: Automating Backpropagation
Step 1: What Needs To Be Implemented and Why?
Manually calling gradients for every node is impossible in large networks.
We need automatic backpropagation.
Step 2: What Are We Implementing?
we need to implement a function backward() with that we can calculate the gradiants automatically. also we need to implement this using topological sort as we cant calculate gradient of current node if this node is dependent on some other nodes and we havent calculated the gradient of those. so for this first we need to calculate gradient of nodes on which other nodes are dependent.
so add this in the Value class:
def backward(self):
topo = []
visited = set()
def build_topo(v):
if v not in visited:
visited.add(v)
for child in v._prev:
build_topo(child)
topo.append(v)
build_topo(self)
self.grad = 1.0
for node in reversed(topo):
node._backward()
Step 3: What Is This Doing?
Topological sorting ensures:
children are processed before parents
Then gradients flow backward in correct order.
for node in reversed(topo):
node._backward()
This is the complete autograd engine.
Step 4: What’s Next?
Now that autograd works, we can finally build neural networks.
PART 8: Building a Neuron
Step 1: What Needs To Be Implemented and Why?
A neuron is the basic building block of neural networks.
We need a structure that:
- stores weights
- combines inputs
- produces output
a basic neuron in NN:
We use tanh as the activation function.
This introduces non-linearity into the network.
To support this operation, we must also add tanh inside our Value class.
Adding tanh To The Value Class
def tanh(self):
x = self.data
t = (math.exp(2*x) - 1)/(math.exp(2*x) + 1)
out = Value(t, (self, ), 'tanh')
def _backward():
self.grad += (1 - t**2) * out.grad
out._backward = _backward
return out
This function:
The derivative of tanh is:
1 - tanh²(x)
That gradient is used during backpropagation.
Expanding The Value Class
Now that we are supporting more mathematical operations, we should continue extending the Value class.
Real autograd systems like PyTorch support many operations.
So we gradually add more functionality.
This closely resembles a neuron with two inputs and two weights. We multiply them, add them together, and then pass the entire result through a tanh non-linearity layer to get the final output.
and also we are able to calculate the gradiant through it.
but actual Neural network not just contain one neuron but they have a entire netwok of neurons and that also have multiple layers like for example mulitple hidden layers.
it will be better if we create classes of all these
Step 2: What Are We Implementing?
class Neuron:
def __init__(self, nin):
# creating the weights and bias according to nin
self.w = [Value(random.uniform(-1,1)) for _ in range(nin)]
self.b = Value(0)
def __call__(self, x):
# w * x + b
act = sum((wi*xi for wi,xi in zip(self.w, x)), self.b)
# passing the summation through the tanh()
out = act.tanh()
return out
def parameters(self):
# total paramerters weights + bias
return self.w + [self.b]
Step 3: What Is This Doing?
A neuron computes:
weighted sum + activation
Mathematically:
z = w·x + b
then:
a = tanh(z)
The weights determine:
which inputs matter more
Parameters is the total number of weights and bias
Step 4: What’s Next?
Now we need many neurons together. That leads us to layers.
PART 9: Building Layers and MLP
Step 1: What Needs To Be Implemented and Why?
One neuron is too weak.
Neural networks become powerful by stacking neurons.
Step 2: What Are We Implementing?
Now that we have a single neuron, we need a way to combine many neurons together.
That leads us to:
Neuron → Layer → Multi Layer Perceptron (MLP)
A single neuron can only learn very simple patterns.
But deep learning becomes powerful when we stack:
- multiple neurons into layers
- multiple layers into networks
Building The Layer
A layer is simply:
a collection of neurons working in parallel
Visually:
Neuron 1
/
Input ---- Neuron 2
\
Neuron 3
Each neuron receives the same input.
But:
each neuron learns different weights
So each neuron learns different patterns.
Implementing Layer
class Layer:
def __init__(self, nin, nout):
self.neurons = [Neuron(nin) for _ in range(nout)]
def __call__(self, x):
outs = [n(x) for n in self.neurons]
return outs[0] if len(outs) == 1 else outs
def parameters(self):
return [p for neuron in self.neurons for p in neuron.parameters()]
Building The MLP (Multi Layer Perceptron)
Now we stack multiple layers together.
Visually:
Input Layer
↓
Hidden Layer 1
↓
Hidden Layer 2
↓
Output Layer
This entire structure is called: Multi Layer Perceptron (MLP)
Implementing The MLP
class MLP:
def __init__(self, nin, nouts):
sz = [nin] + nouts
self.layers = [Layer(sz[i], sz[i+1]) for i in range(len(nouts))]
def __call__(self, x):
for layer in self.layers:
x = layer(x)
return x
def parameters(self):
return [p for layer in self.layers for p in layer.parameters()]
Step 3: What Is This Doing?
A layer learns multiple patterns.
An MLP stacks layers together.
Input → Hidden Layers → Output
This creates deep learning.
Step 4: What’s Next?
Now we need learning.
That leads us to training.
PART 10: Training the Neural Network
This is where the network finally learns.
Step 1: What Needs To Be Implemented and Why?
We need:
- predictions
- loss calculation
- gradient computation
- parameter updates
This entire cycle is learning.
Step 2: What Are We Implementing?
Define the MLP:
mlp = MLP(3, [4, 4, 1]) #creating a MLP with 3 inputs, two hidden layers of size 4 and 1 output
inputs:
xs = [
[0.0, 0.0],
[1.0, 0.0],
[0.0, 1.0],
[1.0, 1.0]
]
ys = [0.0, 1.0, 1.0, 0.0] # desired targets
Training loop:
for k in range(1000):
# forward pass
ypred = [n(x) for x in xs]
# calc the loss
loss = sum((yout - ygt)**2 for ygt, yout in zip(ys, ypred))
# backward pass
for p in n.parameters():
p.grad = 0.0
loss.backward()
# update
for p in n.parameters():
p.data += -0.1 * p.grad #here the -0.1 is the learning rate
if k%200==0:
print(k, loss.data)
Forward pass:
ypred = [n(x) for x in xs]
Loss:
loss = sum((yout - ygt)**2 for ygt, yout in zip(ys, ypred))
Backward:
loss.backward()
Update:
for p in n.parameters():
p.data += -0.01 * p.grad
Step 3: What Is This Doing?
This is gradient descent.
The network:
makes prediction
measures error
finds responsibility
adjusts weights
repeatedly.
Over time:
loss decreases
predictions improve
That is learning.
PART 11: Results
Now that the neural network has been trained, we can finally inspect its predictions.
This helps us verify:
whether the network actually learned the patterns in the data
Viewing The Predictions
for i in range(len(xs)):
print(f"input: {xs[i]}, pred: {ypred[i].data:.3f} | target: {ys[i]}")
output (your might be diff):
input: [0.0, 0.0], pred: -0.016 | target: 0.0
input: [1.0, 0.0], pred: 0.990 | target: 1.0
input: [0.0, 1.0], pred: 0.992 | target: 1.0
input: [1.0, 1.0], pred: -0.021 | target: 0.0
If you compare the predictions with the actual targets:
the network predictions are already very close
Even though:
- this is a very small neural network
- only a few neurons are involved
- the autograd engine was built completely from scratch
it still learns surprisingly well.
This demonstrates how powerful backpropagation + gradients + gradient descent really are.
With larger networks, more layers, more neurons, and more training data:
this exact same idea scales into modern deep learning systems
including:
- image recognition
- language models
- generative AI
- recommendation systems
- speech recognition
All powered by the same core principles you implemented in this project.
Final Mental Model
This entire project can be understood using one simple flow.
Forward pass:
Make prediction
Loss:
Measure how wrong
Backward pass:
Find who caused the error
Gradient:
Measure contribution
Gradient descent:
Adjust parameters
Final Thoughts
The most powerful insight from building tinygrad is this:
PyTorch is not magic.
Modern frameworks are doing the exact same thing:
- computation graphs
- local derivatives
- chain rule
- backward propagation
- gradient descent
The only difference is scale and optimization.
By building a tiny autograd engine yourself, you strip away the abstractions and finally understand:
how neural networks actually learn
And once you understand that, deep learning becomes far less mysterious.
colab link: https://colab.research.google.com/drive/1W5shDFS-I1jECzouBGsW9qJOJqVWMIx2#scrollTo=gbq6HNQ1RNGz
THE END
Happy learning…Until next time….
blog by mahendra