Building a Neural Network by Understanding How Autograd Systems Like PyTorch Work

A Step-by-Step Journey Into Backpropagation and Autograd

Modern deep learning frameworks like PyTorch make neural networks feel almost magical.

loss.backward()

One line.

This blog is a deep, step-by-step walkthrough inspired by Andrej Karpathy’s micrograd project, where we build a tinygrad (tiny autograd engine) from scratch and then use it to build a neural network.

The goal is not just to use neural networks.

The goal is to understand:

how neural networks actually learn under the hood

What We’re Going to Build

We will build:

A tiny autograd engine

Automatic backpropagation

A computation graph system

A neuron

A neural network

Gradient descent training

And by the end, you’ll understand what frameworks like PyTorch are doing internally.

The Overall Learning Flow

Derivatives
    ↓
Computation Graph
    ↓
Local Gradients
    ↓
Chain Rule
    ↓
Backpropagation
    ↓
Autograd Engine
    ↓
Neuron
    ↓
Layer
    ↓
Neural Network
    ↓
Training

PART 1: Understanding Derivatives

Before building autograd, we first need to understand gradients.

Because neural networks learn through gradients.

Derivative: the derivative is a fundamental tool that quantifies the sensitivity to change of a function's output with respect to its input. (more info)

dy/dx represents the instantaneous rate of change of y with respect to x. It tells you how fast one variable changes relative to another at an exact moment.

What is dy/dx? In calculus, dy/dx is called the derivative. • 📈 It is a curve's slope. • 🚗 It is instantaneous speed. • 🎯 It is a tangent line.

The Limit Definition The true definition of the derivative is:

Here is how to read the components: • 🔹 (f(x+h) - f(x)): The vertical change (rise). • 🔹 h: The horizontal change (run). • 🔹 lim h→0 : Shrinks the step size to zero.

What the derivative formula means

Take a tiny change in x:

Then the function changes from:

The change in output is:

Divide by the change in input h:

That gives the average rate of change.

Then let h become extremely small:

and you get the instantaneous rate of change: the derivative.

Step 1: What Needs To Be Implemented and Why?

Neural networks learn by adjusting weights.

But how does the network know:

which weight caused the error?

That’s where gradients come in.

A gradient tells us:

How much changing a parameter changes the final loss.

Mathematically:

dLoss / dWeight

This is the foundation of backpropagation.

Step 2: What Are We Implementing?

We start with a simple mathematical function.

def f(x):
    return 3*x**2 - 4*x + 5

Step 3: What Is This Doing?

This function gives us a curve.

We now ask:

if x changes slightly, how much does f(x) change?

That is the derivative.

We numerically approximate it like this:

h = 0.000001
x = 2/3

(f(x + h) - f(x)) / h

Step 4: What’s Next?

Now that we understand derivatives, we need to understand:

how derivatives flow through multiple operations

That leads us to computation graphs.

PART 2: Building the Computation Graph

Neural networks are not one equation.

They are chains of many small mathematical operations.

To track gradients through these operations, we need a graph.

Step 1: What Needs To Be Implemented and Why?

If a loss depends on many operations:

weight → multiply → activation → output → loss

then we need a way to remember:

where values came from

which operations created them

how gradients should flow backward

This is called a computation graph.

Step 2: What Are We Implementing?

We create a small expression:

a = 2.0
b = -3.0
c = 10.0

d = a * b + c

Step 3: What Is This Doing?

Even though it looks simple, this creates a graph.

Each node depends on previous nodes.

This is exactly how neural networks work internally.

Step 4: What’s Next?

Now we need a way for every number to remember:

how it was created

That leads us to the Value object.

PART 3: Building the Value Object

This is the heart of the autograd engine.

Step 1: What Needs To Be Implemented and Why?

Normal Python numbers do not remember:

previous operations

gradients

dependencies

But autograd systems need this information.

So we need a custom object.

Step 2: What Are We Implementing?

We build a Value class.

class Value:
    def __init__(self, data, _children=(), _op='', label=''):
        self.data = data
        self.grad = 0
        self._backward = lambda: None
        self._prev = set(_children)
        self._op = _op
        self.label=label
    def __repr__(self):
        return f"Value(data={self.data})"

the __repr__ function is for the representation of the data inside the Value object.

Step 3: What Is This Doing?

Each Value object stores:

self.data=data

self.grad=0.0

self._backward=lambda:None

self._prev=set(_children)

self._op=_op

self.label=label

This means every value now remembers:

where it came from

That’s the foundation of autograd.

Step 4: What’s Next?

Now we need mathematical operations to build graphs automatically.

PART 4: Implementing Mathematical Operations

Step 1: What Needs To Be Implemented and Why?

When we do:

c = a + b

we want:

a new Value object

graph tracking

gradient tracking

So operations themselves must become graph-aware.

Step 2: What Are We Implementing?

Addition:

def __add__(self, other):
    out = Value(self.data + other.data, (self, other), '+')
    return out

Multiplication:

def __mul__(self, other):
    out = Value(self.data * other.data, (self, other), '*')
    return out

add this functions in our “Value” class, so that we can perform addition and multiplication.

Step 3: What Is This Doing?

Every operation now:

creates a new node and links parent nodes

This builds the computation graph dynamically.

Exactly like PyTorch.

Step 4: What’s Next?

Now we need gradients.

We need to teach every operation:

how output changes wrt inputs

That leads to local gradients.

PART 5: Local Gradients

Step 1: What Needs To Be Implemented and Why?

Every mathematical operation has its own derivative rules.

Example:

For multiplication:

z = x * y

Then:

dz/dx = y
dz/dy = x

These are local gradients.

Without them, backpropagation cannot happen.

Step 2: What Are We Implementing?

Inside multiplication:

def __mul__(self, other):
    out = Value(self.data * other.data, (self, other), '*')

    def _backward():
        self.grad += other.data * out.grad
        other.grad += self.data * out.grad

    out._backward = _backward

    return out

and do same for the addition:

def __add__(self, other):
      other = other if isinstance(other, Value) else Value(other)
      out = Value(self.data + other.data, (self, other), '+')
      
      def _backward():
        self.grad += 1.0 * out.grad
        other.grad += 1.0 * out.grad
      out._backward = _backward
      
      return out

Step 3: What Is This Doing?

This is extremely important.

Each operation now knows:

how gradients should flow backward

The operation stores its own tiny backward rule.

That’s the core idea behind autograd.

Step 4: What’s Next?

Now we need to combine local gradients across the graph.

That leads us to the chain rule.

PART 6: Chain Rule and Backpropagation

This is the core mathematical idea behind neural network learning.

Step 1: What Needs To Be Implemented and Why?

A weight does not affect the loss directly.

It affects:

weight → neuron → activation → output → loss

So gradients must flow through many operations.

This is done using the chain rule.

Step 2: What Are We Implementing?

The chain rule:

dLoss/dw = (dLoss/da) * (da/dz) * (dz/dw)

Step 3: What Is This Doing?

Backpropagation is simply:

multiplying many tiny local derivatives together

across the graph.

That’s it.

There is no magic.

Step 4: What’s Next?

Now we need to automate gradient flow.

That leads us to backward()

PART 7: Automating Backpropagation

Step 1: What Needs To Be Implemented and Why?

Manually calling gradients for every node is impossible in large networks.

We need automatic backpropagation.

Step 2: What Are We Implementing?

we need to implement a function backward() with that we can calculate the gradiants automatically. also we need to implement this using topological sort as we cant calculate gradient of current node if this node is dependent on some other nodes and we havent calculated the gradient of those. so for this first we need to calculate gradient of nodes on which other nodes are dependent.

so add this in the Value class:

def backward(self):
      topo = []
      visited = set()
      def build_topo(v):
        if v not in visited:
          visited.add(v)
          for child in v._prev:
            build_topo(child)
          topo.append(v)
      build_topo(self)
      
      self.grad = 1.0
      for node in reversed(topo):
        node._backward()

Step 3: What Is This Doing?

Topological sorting ensures:

children are processed before parents

Then gradients flow backward in correct order.

for node in reversed(topo):
    node._backward()

This is the complete autograd engine.

Step 4: What’s Next?

Now that autograd works, we can finally build neural networks.

PART 8: Building a Neuron

Step 1: What Needs To Be Implemented and Why?

A neuron is the basic building block of neural networks.

We need a structure that:

stores weights

combines inputs

produces output

a basic neuron in NN:

We use tanh as the activation function.

This introduces non-linearity into the network.

To support this operation, we must also add tanh inside our Value class.

Adding tanh To The Value Class

def tanh(self):
    x = self.data
    t = (math.exp(2*x) - 1)/(math.exp(2*x) + 1)
    out = Value(t, (self, ), 'tanh')

    def _backward():
      self.grad += (1 - t**2) * out.grad
    out._backward = _backward

    return out

This function:

The derivative of tanh is:

1 - tanh²(x)

That gradient is used during backpropagation.

Expanding The Value Class

Now that we are supporting more mathematical operations, we should continue extending the Value class.

Real autograd systems like PyTorch support many operations.

So we gradually add more functionality.

This closely resembles a neuron with two inputs and two weights. We multiply them, add them together, and then pass the entire result through a tanh non-linearity layer to get the final output.

and also we are able to calculate the gradiant through it.

but actual Neural network not just contain one neuron but they have a entire netwok of neurons and that also have multiple layers like for example mulitple hidden layers.

it will be better if we create classes of all these

Step 2: What Are We Implementing?

class Neuron:
    def __init__(self, nin):
		    # creating the weights and bias according to nin
        self.w = [Value(random.uniform(-1,1)) for _ in range(nin)]
        self.b = Value(0)

    def __call__(self, x):
        # w * x + b
        act = sum((wi*xi for wi,xi in zip(self.w, x)), self.b)
        # passing the summation through the tanh()
        out = act.tanh()
        return out
        
    def parameters(self):
	    # total paramerters weights + bias
	    return self.w + [self.b]

Step 3: What Is This Doing?

A neuron computes:

weighted sum + activation

Mathematically:

z = w·x + b

then:

a = tanh(z)

The weights determine:

which inputs matter more

Parameters is the total number of weights and bias

Step 4: What’s Next?

Now we need many neurons together. That leads us to layers.

PART 9: Building Layers and MLP

Step 1: What Needs To Be Implemented and Why?

One neuron is too weak.

Neural networks become powerful by stacking neurons.

Step 2: What Are We Implementing?

Now that we have a single neuron, we need a way to combine many neurons together.

That leads us to:

Neuron → Layer → Multi Layer Perceptron (MLP)

A single neuron can only learn very simple patterns.

But deep learning becomes powerful when we stack:

multiple neurons into layers

multiple layers into networks

Building The Layer

A layer is simply:

a collection of neurons working in parallel

Visually:

            Neuron 1
           /
Input ---- Neuron 2
           \
            Neuron 3

Each neuron receives the same input.

But:

each neuron learns different weights

So each neuron learns different patterns.

Implementing Layer

class Layer:
    def __init__(self, nin, nout):
    self.neurons = [Neuron(nin) for _ in range(nout)]
  
  def __call__(self, x):
    outs = [n(x) for n in self.neurons]
    return outs[0] if len(outs) == 1 else outs
  
  def parameters(self):
    return [p for neuron in self.neurons for p in neuron.parameters()]

Building The MLP (Multi Layer Perceptron)

Now we stack multiple layers together.

Visually:

Input Layer
     ↓
Hidden Layer 1
     ↓
Hidden Layer 2
     ↓
Output Layer

This entire structure is called: Multi Layer Perceptron (MLP)

Implementing The MLP

class MLP:
  
  def __init__(self, nin, nouts):
    sz = [nin] + nouts
    self.layers = [Layer(sz[i], sz[i+1]) for i in range(len(nouts))]
  
  def __call__(self, x):
    for layer in self.layers:
      x = layer(x)
    return x
  
  def parameters(self):
    return [p for layer in self.layers for p in layer.parameters()]

Step 3: What Is This Doing?

A layer learns multiple patterns.

An MLP stacks layers together.

Input → Hidden Layers → Output

This creates deep learning.

Step 4: What’s Next?

Now we need learning.

That leads us to training.

PART 10: Training the Neural Network

This is where the network finally learns.

Step 1: What Needs To Be Implemented and Why?

We need:

predictions

loss calculation

gradient computation

parameter updates

This entire cycle is learning.

Step 2: What Are We Implementing?

Define the MLP:

mlp = MLP(3, [4, 4, 1]) #creating a MLP with 3 inputs, two hidden layers of size 4 and 1 output

inputs:

xs = [
  [0.0, 0.0],
  [1.0, 0.0],
  [0.0, 1.0],
  [1.0, 1.0]
]
ys = [0.0, 1.0, 1.0, 0.0] # desired targets

Training loop:

for k in range(1000):
  
  # forward pass
  ypred = [n(x) for x in xs]
  
  # calc the loss
  loss = sum((yout - ygt)**2 for ygt, yout in zip(ys, ypred))
  
  # backward pass
  for p in n.parameters():
    p.grad = 0.0
  loss.backward()
  
  # update
  for p in n.parameters():
    p.data += -0.1 * p.grad #here the -0.1 is the learning rate
  if k%200==0:  
    print(k, loss.data)

Forward pass:

ypred = [n(x) for x in xs]

Loss:

loss = sum((yout - ygt)**2 for ygt, yout in zip(ys, ypred))

Backward:

loss.backward()

Update:

for p in n.parameters():
    p.data += -0.01 * p.grad

Step 3: What Is This Doing?

This is gradient descent.

The network:

makes prediction
measures error
finds responsibility
adjusts weights

repeatedly.

Over time:

loss decreases
predictions improve

That is learning.

PART 11: Results

Now that the neural network has been trained, we can finally inspect its predictions.

This helps us verify:

whether the network actually learned the patterns in the data

Viewing The Predictions

for i in range(len(xs)):
  print(f"input: {xs[i]}, pred: {ypred[i].data:.3f} | target: {ys[i]}")

output (your might be diff):

input: [0.0, 0.0], pred: -0.016 | target: 0.0
input: [1.0, 0.0], pred: 0.990 | target: 1.0
input: [0.0, 1.0], pred: 0.992 | target: 1.0
input: [1.0, 1.0], pred: -0.021 | target: 0.0

If you compare the predictions with the actual targets:

the network predictions are already very close

Even though:

this is a very small neural network

only a few neurons are involved

the autograd engine was built completely from scratch

it still learns surprisingly well.

This demonstrates how powerful backpropagation + gradients + gradient descent really are.

With larger networks, more layers, more neurons, and more training data:

this exact same idea scales into modern deep learning systems

including:

image recognition

language models

generative AI

recommendation systems

speech recognition

All powered by the same core principles you implemented in this project.

Final Mental Model

This entire project can be understood using one simple flow.

Forward pass:
	Make prediction

Loss:
	Measure how wrong

Backward pass:
	Find who caused the error

Gradient:
	Measure contribution

Gradient descent:
	Adjust parameters

Final Thoughts

The most powerful insight from building tinygrad is this:

PyTorch is not magic.

Modern frameworks are doing the exact same thing:

computation graphs

local derivatives

chain rule

backward propagation

gradient descent

The only difference is scale and optimization.

By building a tiny autograd engine yourself, you strip away the abstractions and finally understand:

how neural networks actually learn

And once you understand that, deep learning becomes far less mysterious.

colab link: https://colab.research.google.com/drive/1W5shDFS-I1jECzouBGsW9qJOJqVWMIx2#scrollTo=gbq6HNQ1RNGz

THE END

Happy learning…Until next time….

blog by mahendra