back

My POV on “Efficient Estimation of Word Representations in Vector Space”

3 min read

res paper link: https://arxiv.org/pdf/1301.3781


The other day, I started reading the famous Word2Vec paper:

“Efficient Estimation of Word Representations in Vector Space” by Tomas Mikolov and team.

At first, I thought:

“Okay cool… word embeddings.”

But the deeper I went, the more I realized this paper quietly changed NLP forever.

And honestly, I didn’t fully understand it just by reading.

I had to:

Only then things started clicking.

So this blog is basically me walking through the paper in the same order I understood it.

Not as a researcher.

Just as someone trying to genuinely understand what the authors were trying to solve.


The First Thing the Paper Says

The paper starts with this line:

NLP systems treat words as atomic units.

This sentence looks simple.

But it’s actually the whole problem.


What Does “Atomic Unit” Mean?

Basically:

word_to_id= {
"king":0,
"queen":1,
"banana":2
}

To older NLP systems:

That’s it.

The machine has absolutely no idea:

Every word is isolated.

When I understood this, I realized:

traditional NLP was mostly memorization.

Not understanding.


Then the Paper Talks About N-Grams

This part is important historically.

Before neural networks became dominant,

N-grams were everywhere.

The paper mentions N-gram models because they were the standard approach for language modeling.


So What is an N-Gram?

Simple idea.

Predict the next word using previous words.

Example:

sentence= ["I","love","machine"]

# predict next word

Maybe:

"learning"

If using:

Example:

P("learning"|"machine")

The model just learns probabilities from huge text datasets.


Why N-Grams Worked

Because they were:

You could train them on billions of words.

And for years, that was enough.

But then the paper says something important:

simple techniques are reaching their limits

Because N-grams memorize exact sequences.

They don’t understand meaning.


The Limitation That Changed Everything

Imagine the model sees:

"I love cats"

but never sees:

"I love dogs"

A human instantly understands both.

Because:

cat≈dog

But N-grams don’t understand similarity.

They only understand exact patterns.

That’s where the paper introduces a massive shift:

distributed representations of words

Distributed Representations (The Core Idea)

Fancy term.

Simple meaning.

Instead of storing words as IDs:

"king"=0

represent them as vectors:

king= [0.21,-0.44,0.78]
queen= [0.19,-0.40,0.80]

At first these numbers looked meaningless to me.

But then I realized:

vectors create geometry.

Now:

Meaning becomes measurable.


The Moment That Blew My Mind

Then comes the famous equation:

The paper explains that simple vector arithmetic can recover semantic relationships.

This genuinely changed how I think about embeddings.

Because the model was not memorizing.

It was learning relationships geometrically.

That’s insane.

Because if each word was represented by just a single number, this would make absolutely no sense.

How can:

5 - 2 + 3 = queen ?

That’s impossible.

But then I realized:

the model is not representing words as single values.

It represents them as high-dimensional vectors.

And suddenly this starts becoming intuitive.

For example, imagine:

Dimension 1 -> gender
Dimension 2 -> royalty
Dimension 3 -> age
Dimension 4 -> power
...

Then maybe:

So now:

king - man

could roughly mean:

“remove masculine features from king” …. but we have “royality” and “power”

and:

+ woman

could mean:

“add feminine features” along with royality and power

which moves the vector toward:

queen

Of course, the model never explicitly labels dimensions like “gender” or “royalty”.

That’s the fascinating part.

These relationships emerge naturally from training on context.


What the Authors Actually Wanted

At this point I thought

“Okay embeddings are cool.”

But then I understood the real goal of the paper.

The paper wasn’t just trying to create embeddings.

It was trying to create:

The authors repeatedly focus on computational complexity and scalability.

That’s the real innovation.


Then the Paper Starts Discussing Older Neural Models

This section confused me initially.

Because suddenly:

started appearing everywhere.

So I had to slow down.


LSA and LDA

The paper briefly mentions:

These are older NLP techniques.


LSA

LSA tries to find relationships between words using matrix decomposition.

Very roughly:

word-document matrix
 -> compress it
 -> discover latent meaning

Kind of like:

But:


LDA

LDA is more topic-based.

It tries to discover:

Example:

But the paper says:

LDA becomes computationally expensive on huge datasets.

And scalability is the whole focus here.


SGD and Backpropagation

The paper mentions:

models are trained using stochastic gradient descent and backpropagation

This part became much clearer after implementing tiny neural networks and backpropagation/autograd by myself. (ref blog for NN here, backprop/autograd here)


SGD (Stochastic Gradient Descent)

Very simple intuition:

prediction ---> calculate error ---> slightly adjust weights

Repeated millions of times.

Tiny improvements over time.

That’s learning.


Backpropagation

This was honestly mysterious to me at first.

But now I think of it as:

“How does the model know which weights caused the error?”

Backpropagation calculates gradients:

Without backprop:

neural networks basically cannot learn efficiently. (more here)


Feedforward Neural Network Language Model (NNLM)

Then the paper introduces NNLM.

This is where embeddings started becoming practical.

At this point in the paper, I realized almost everything revolves around one simple task:

predicting the next word.

This task is called:

language modeling

For example:

"I love machine"

A language model tries to predict:

"learning"

Or:

"The king rules the"

Predict:

"kingdom"

That’s basically what language modeling is:

learning the probability of what word comes next given previous words.

Before Word2Vec, researchers had already started using neural networks for this task.

One important approach was called:

Feedforward Neural Network Language Model (NNLM)

At first the name sounded complicated to me, but it’s actually pretty simple.

A feedforward neural network is basically:

-> input -> hidden layer -> output

Information only moves forward.

No memory.

No recurrence.

Just forward computation.

So in NNLM, the pipeline becomes:

previous words
→ embeddings
→ neural network
→ predict next word

And this was a huge improvement over traditional N-grams because now the model could learn similarities between words instead of just memorizing exact sequences.

My Understanding of NNLM

The flow is basically:

words -> embeddings -> hiddenlayer -> predict next word

Example:

["I","love","machine"]
----> predict:
							"learning"

The huge realization here was:

embeddings are learned while solving another task.

The model isn’t directly told:

“learn meaning.”

It learns meaning while trying to predict words.

That’s beautiful.


Then Comes RNNLM

The FFNN was working fine, why people moved beyond simple feedforward networks?

This part matters historically because RNNs solved an important limitation.


Limitation of Feedforward Models

Feed forward models use fixed context.

Example:

last 3 words only 

So if the model was trained with a context window of 3:

"The king who ruled the ancient"

it might only see:

["ruled","the","ancient"]

Everything before that is forgotten.

But real language doesn’t work like that.

Meaning often depends on words that appeared much earlier in the sentence.

But language isn’t fixed length.

RNNs introduced memory.


My Understanding of RNNs

RNNs introduced the idea of memory.

To understand why that mattered, I first had to understand the limitation of feedforward networks.

In a normal feedforward neural network, we basically do:

input words
→ embeddings
→ multiply with weights
→ hidden layer
→ predict next word

The problem is that this happens independently for every prediction.

The model only sees a fixed context window like:

last 3 words only

So now the network keeps updating an internal memory while reading words one by one.

Unlike feedforward networks, the model is no longer restricted to only a fixed window of previous words.

That’s why the paper says RNNs can form a kind of “short term memory.”

And honestly, this made much more sense to me once I stopped thinking of RNNs as complicated equations and started thinking of them as:

a neural network that remembers previous context while reading a sentence.

Anything before that is completely forgotten.

So even though feedforward models were powerful, their “memory” was limited by the context size we manually chose.

But language doesn’t really work that way.

Sometimes a word earlier in the sentence completely changes the meaning later on.

That’s where RNNs changed things.

Very roughly:

previous hidden state + current word → next hidden state

Then the Paper Makes a Very Important Decision

This was the turning point.

The authors realized:

hidden layers are expensive.

So instead of making bigger neural nets…

they simplified the architecture.

This decision eventually leads to:

And honestly this simplicity is what made Word2Vec revolutionary.


CBOW: Continuous Bag of Words

CBOW predicts the current word using surrounding context.

Example:

context= ["I","love","learning"]
target="machine"

The model sees surrounding words and predicts the center word.


My Understanding of CBOW

What CBOW is really doing is:

compressing context into a single representation.

In the implementation:

embeds.mean(dim=0)

we literally average all surrounding word vectors.

That averaged vector becomes:

Why It’s Called “Bag of Words”

Because:

So:

["I","love","deep"]

and:

["deep","love","I"]

are treated similarly.

That’s why it’s called:

Continuous Bag of Words


But CBOW Has a Limitation

CBOW is great at:

But because it averages context,

it sometimes loses fine semantic detail.

That’s why the paper later shows:

And that honestly makes sense intuitively


Skip-Gram (My Favorite Part)

Skip-Gram flips the entire prediction direction.

Instead of:

context → word

it does:

word → context

Example:

Input:
"king"

Targets:
["queen","royal","kingdom"]

Now one word tries to predict its surrounding neighbors.

Example:

input_word="machine"

targets= [
"I",
"love",
"learning"
]

The paper says Skip-Gram performs especially well on semantic relationships.

And honestly that makes intuitive sense.

Because one word is learning how it relates to many neighboring words.


The Results Section Was Actually Crazy

The paper starts showing relationships like:

Paris-France+Italy=Rome

At first I thought:

“this has to be cherry-picked.”

But then they systematically evaluate relationships:

And the vectors actually capture them.

That’s when embeddings stopped feeling like random math.


What I Finally Understood

At some point while reading this paper,

I stopped thinking:

“these are vectors”

and started thinking:

“this is a learned semantic space.”

Words become points.

Relationships become directions.

Meaning becomes geometry.

And modern NLP basically starts from here.


So I Tried Building a Tiny Version Myself

At this point, I felt like I understood the paper conceptually.

Words become vectors.

Similar words become nearby vectors.

CBOW predicts a word from its surrounding context.

But I still had one question:

How do these embeddings actually get created?

So instead of reading more papers, I decided to build a tiny version myself.

Not a production-ready Word2Vec.

Just enough code to understand how a neural network can learn meaningful word representations.


Tiny Word2Vec (Tiny Educational Version):

step 0: include all the imports:

import torch
import torch.nn as nn
import torch.optim as optim

Step 1: Starting With Raw Text

I began with a tiny corpus:

# ---------------------------------------------------
# 1. Tiny Dataset
# ---------------------------------------------------

text = """
i love deep learning
i love neural networks
deep learning loves data
neural networks learn patterns
"""

The first thing I needed was to break this text into individual words.

words = text.lower().split() #entire dataset
words

which will give: [ "i", "love", "deep", "learning",...],

This process is called tokenization. Almost every NLP pipeline starts here.

Step 2: Creating a Vocabulary

Neural networks cannot understand words. They only understand numbers. So the next step is creating a vocabulary.

# ---------------------------------------------------
# 2. Vocabulary
# ---------------------------------------------------

vocab = sorted(set(words)) #total unique vocab
print(f"vocab: {vocab}")
print(f"vocab size: {len(vocab)}")
word_to_idx = {
    word: idx
    for idx, word in enumerate(vocab)
}

idx_to_word = {
    idx: word
    for word, idx in word_to_idx.items()
}

vocab_size = len(vocab)

print("Vocabulary:")
print(word_to_idx)

Understanding the code block

step 3: Creating training data

from the data that we have, we are creating the training data

# ---------------------------------------------------
# 3. Create Training Data (CBOW)
# ---------------------------------------------------
# context -> target

window_size = 2

training_data = []

for i in range(window_size, len(words) - window_size):

    context = [
        words[i - 2],
        words[i - 1],
        words[i + 1],
        words[i + 2]
    ]

    target = words[i]

    training_data.append((context, target))

print("\nSample Training Pairs:")
for pair in training_data[:3]:
    print(pair)

Understanding this code block

Step 4: Creating the Embedding Layer

This is where Word2Vec finally starts.

# ---------------------------------------------------
# 4. CBOW Model
# ---------------------------------------------------

class CBOW(nn.Module):

    def __init__(self, vocab_size, embedding_dim):
        super().__init__()
        # embedding table
        self.embeddings = nn.Embedding(
            vocab_size,
            embedding_dim
        )
        # output layer
        self.linear = nn.Linear(
            embedding_dim,
            vocab_size
        )

    def forward(self, context_idxs):
        # calc embeddings from context_idxs - for embedding lookup
        embeds = self.embeddings(context_idxs)
        # average context embeddings - CBOW's main idea
        # Take surrounding words:
        # - average them
        # - create single context representation
        # This is why it’s called: Continuous Bag of Words
        # Because:
        #   word order mostly ignored
        #   context compressed into one vector
        context_vector = embeds.mean(dim=0)
        # predict next word
        output = self.linear(context_vector)
        return output

Understanding this code block

Step 5: Hyperparameters config

# ---------------------------------------------------
# 5. Hyperparameters
# ---------------------------------------------------

EMBEDDING_DIM = 10 # every word vector has 10 numbers (real model have 768 or 4096+)

model = CBOW(vocab_size, EMBEDDING_DIM)

loss_function = nn.CrossEntropyLoss() #Measures: how wrong prediction is

optimizer = optim.SGD(   #Stochastic Gradient Descent, updating weights
    model.parameters(),
    lr=0.01
)

Understanding this code block

Step 6: Training

# ---------------------------------------------------
# 6. Training
# ---------------------------------------------------

print("\nTraining...\n")

for epoch in range(200):  #one full pass over dataset, More epochs: more learning

    total_loss = 0
    for context, target in training_data:

        # context -> tensor ids
        context_idxs = torch.tensor(
            [word_to_idx[w] for w in context],
            dtype=torch.long
        )

        # target -> tensor id
        target_idx = torch.tensor(
            [word_to_idx[target]],
            dtype=torch.long
        )

        # forward pass
        output = model(context_idxs)

        # calculate loss
        loss = loss_function(
            output.unsqueeze(0),
            target_idx
        )

        # reset gradients
        optimizer.zero_grad()

        # backpropagation
        loss.backward()

        # update weights
        optimizer.step()

        total_loss += loss.item()

    if epoch % 20 == 0:
        print(f"Epoch {epoch} | Loss: {total_loss:.4f}")

normal neural network flow:

Step 6: Result - Looking at the Learned Embeddings

# ---------------------------------------------------
# 7. Learned Embeddings
# ---------------------------------------------------

print("\nLearned Word Embeddings:\n")

embeddings = model.embeddings.weight.data

for word in vocab:

    idx = word_to_idx[word]

    print(f"{word} -> {embeddings[idx].numpy()}")

you’ll get output something like this (youre might be different):

Learned Word Embeddings:

data -> [ 0.204588    0.34369263 -1.2781278   0.6347771  -0.19352344  0.40877014
  0.6538011   1.1686823  -0.28810385 -0.675569  ]
deep -> [ 1.4658912   1.2604588  -0.94967735 -0.12230076 -1.1447756   0.42462495
  1.4930336  -0.60545194 -0.6058509  -0.62027574]
i -> [ 0.8208627   0.6121605  -0.6962656  -0.416866    2.387134   -0.6415344
 -0.36190453 -0.27344874  1.5900056   0.07299932]
learn -> [ 0.34396684  0.26548082  0.35114777  1.7791262   1.4267697  -0.75057405
 -0.6348938  -0.65018314 -0.23072183  1.2093976 ]
learning -> [ 0.42603555 -0.71520233  1.0555758  -0.508294    0.43078062  1.1194822
 -0.03498755  0.4830245   1.606832   -0.7192453 ]
love -> [-1.4685085   1.654144    0.37540713  3.5980105   1.0338693   0.5566743
  1.4982488   1.3331736   0.66575694  0.32241437]
loves -> [ 2.1571102  -0.40720007 -0.47596204  0.31325144  0.50651723 -0.02485154
  1.4894956   0.44100833 -1.1547507  -0.19835488]
networks -> [-0.29807022 -0.5435951  -0.5359694  -1.330048    0.393193   -0.89274454
  0.06493169  0.29302877 -0.5873003   1.002373  ]
neural -> [ 0.6281754   0.3660114  -0.7708631   0.15852615 -0.6642887   1.2953672
 -2.0957208  -0.21021275 -0.07943218 -0.38959318]
patterns -> [ 0.13621739  1.6686039   0.21825774  0.9016406   1.7864395  -0.5738791
 -0.9760383  -1.3823657  -1.7776685   0.3002549 ]

At this point, the vectors are no longer random. They contain information learned from context.

And that's exactly what this entire paper is about:

learning meaning from usage.

This implementation is tiny.

But honestly, this is where the paper truly became understandable to me.

Because now:

all became real. Not just theory.

This section naturally leads into your next section:

Can We Reproduce king - man + woman ≈ queen?

One of the most famous examples associated with Word2Vec is:

When I first saw this online, I honestly thought it looked like a cool mathematical trick.

After reading the paper, I understood why it happens. But I still wanted to see it myself.

Here's how I implemented a tiny Word2Vec from scratch and verified whether the famous king - man + woman ≈ queen relationship emerges.

1. Importing the Required Libraries

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

2. Defining the CBOW Model

class CBOW(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super().__init__()
        self.embeddings = nn.Embedding(
            vocab_size,
            embedding_dim
        )
        self.linear = nn.Linear(
            embedding_dim,
            vocab_size
        )

    def forward(self, context_idxs):
        embeds = self.embeddings(context_idxs)
        context_vector = embeds.mean(dim=0)
        output = self.linear(context_vector)
        return output

Next, I defined a simple CBOW (Continuous Bag of Words) model. (exactly same as we have defined above… if you running this on notebook you can skip this step and just run that CBOW block again)


3. Creating a Small Dataset

text="""
king is royal male
queen is royal female
...
"""

Since my goal was to test whether relationships like king - man + woman ≈ queen could emerge, I created a small corpus focused on royalty, gender, and related concepts.

The dataset repeatedly reinforces relationships such as:

king ↔ queen
man ↔ woman
prince ↔ princess

This gives the model a chance to learn meaningful patterns despite the small amount of data.


4. Building the Vocabulary and Training Examples

# 1. Vocab creation
words = text.lower().split()
vocab = sorted(set(words))
vocab_size = len(vocab)
word_to_idx = {word: idx for idx, word in enumerate(vocab)}
idx_to_word = {idx: word for word, idx in word_to_idx.items()}

# 2. Window processing per sentence (window_size = 1 or 2)
window_size = 1
training_data = []
sentences = [line.lower().split() for line in text.strip().split('\n') if line.strip()]

for sentence_words in sentences:
    for i in range(len(sentence_words)):
        target = sentence_words[i]
        # Get context words strictly within this sentence bounds
        context = [
            sentence_words[j] 
            for j in range(max(0, i - window_size), min(len(sentence_words), i + window_size + 1)) 
            if j != i
        ]
        if context:
            training_data.append((context, target))

Neural networks cannot process raw text directly, so the first step is converting words into integer IDs.

After building the vocabulary, I generated CBOW training examples by creating context-target pairs. For every word, the surrounding words become the input context, while the center word becomes the prediction target.


5. Defining the Hyperparameters

# 3. Hyperparameters (Lower, safer starting LR)
EMBEDDING_DIM = 20
model = CBOW(vocab_size, EMBEDDING_DIM)
loss_function = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.1) 

Here I configured the model.

The embedding dimension controls how many numbers represent each word vector, while the loss function measures prediction error. I used Adam instead of Stochastic Gradient Descent (SGD) as the optimizer to gradually improve the embeddings during training, Adam works better as compared to SGD.


5. Defining the Hyperparameters

epochs = 4000
for epoch in range(epochs):
    total_loss = 0
    optimizer.zero_grad()  # Reset gradients once per epoch
    
    # Linear learning rate decay
    current_lr = 0.1 * (1.0 - (epoch / epochs))
    for param_group in optimizer.param_groups:
        param_group['lr'] = current_lr

    # We use a running batch loss to prevent gradient explosion
    batch_loss = 0
    
    for context, target in training_data:
        context_idxs = torch.tensor([word_to_idx[w] for w in context], dtype=torch.long)
        target_idx = torch.tensor([word_to_idx[target]], dtype=torch.long)
        
        output = model(context_idxs)
        loss = loss_function(output.unsqueeze(0), target_idx)
        
        # Divide by total training data size so gradients are averaged, not summed infinitely
        loss = loss / len(training_data) 
        loss.backward()  
        
        batch_loss += loss.item() * len(training_data) # track true total loss for printing
        
    optimizer.step()  # Apply stable, averaged updates
    
    if epoch % 500 == 0:
        print(f"Epoch {epoch} | Loss: {batch_loss:.4f} | LR: {current_lr:.4f}")

This is where the actual learning happens.

For every context-target pair:

  1. The model makes a prediction.
  1. CrossEntropyLoss measures how wrong the prediction is.
  1. Backpropagation computes gradients.
  1. Adam/SGD updates the weights.

Repeating this process thousands of times gradually transforms random embeddings into meaningful representations. I have kept it 4000 you can see by inc/dec how its affecting the output.


7. Performing the Famous Vector Arithmetic

embeddings = model.embeddings.weight.data
def get_embedding(word):
    idx = word_to_idx[word]
    return embeddings[idx]

result = (
    get_embedding("king")
    - get_embedding("man")
    + get_embedding("woman")
)

Once training was complete, I extracted the learned embeddings and recreated the famous Word2Vec analogy.

The idea is to move through the embedding space by removing the "male" component from king and adding the "female" component from woman.

The resulting vector should ideally point toward queen.


8. Finding the Closest Matching Word

best_word = None
best_similarity = 1

excluded_words = [
    "king",
    "man",
    "woman"
]
similarities = []

for word in vocab:

    # skip input words
    if word in excluded_words:
        continue

    vector = get_embedding(word)

    similarity = F.cosine_similarity(
        result,
        vector,
        dim=0
    ).item()
    similarities.append((word, similarity))

    # sort descending
similarities = sorted(
    similarities,
    key=lambda x: x[1],
    reverse=True
)

print(f"\nSimilarity Scores for '{eq}':\n")

for word, score in similarities:
    print(f"{word}: {score:.4f}")

best_word, best_similarity = similarities[0]

print("\nBest Match:", best_word)
print("Similarity:", round(best_similarity, 4))

The vector produced by the arithmetic operation is not itself a word, it is simply another point in the embedding space.

To determine which word it represents, I compared it against every learned embedding using cosine similarity and selected the closest match.

If the embeddings successfully capture semantic relationships, the nearest word should be something very close to queen.

output:

Similarity Scores for 'king - man + woman':

queen: 0.6053
prince: 0.4930
princess: 0.4485
dress: 0.3918
girl: 0.3009
boy: 0.2913
young: 0.1904
field: 0.1722
....

as you can see queen is on top. means after doing ‘king - man + woman’, whatever embedding we got, that has higest probability that its similar to queen’s embedding.


I like this structure because it mirrors how someone would actually build and reason about the implementation:

Model → Data → Vocabulary → Training → Embeddings → Vector Arithmetic → Similarity Search

and it naturally ties back to the original Word2Vec paper instead of feeling like a random PyTorch tutorial.


Final Thoughts

I think the biggest lesson from this paper is:

meaning does not need to be manually defined.

If a model sees enough context, relationships emerge naturally.

And that idea became the foundation for:

This paper looks simple at first.

I genuinely think it’s one of the most important papers in modern AI.


collab link: https://colab.research.google.com/drive/1nyZDsVg3_P_ebNGEo8fGhJQT1PJDN3Ou?usp=sharing


THE END


Happy learning…Until next time….

blog by mahendra