back
The “What, Why, How” Guide to Machine Learning - ML Basics
Machine Learning Foundations - From Zero to Linear Models
If your end goal is Generative AI, you must understand one truth:
Large Language Models are not magic.
Before Transformers, before Attention, before LLMs — there was Linear Regression.
This blog builds your foundation the right way using:
What → Why → How
What is Artificial Intelligence (AI)?
AI = Artificial + Intelligence
What is “Artificial”?
Artificial means:
- Man-made
- Not naturally occurring
- Built by humans using algorithms + hardware
Example:
- A human brain → natural
- A neural network running on GPU → artificial
So AI systems are engineered systems that mimic certain capabilities of humans.
What is “Intelligence”?
Intelligence means the ability to:
- Learn from experience
- Reason
- Solve problems
- Understand language
- Perceive the environment
- Adapt to new situations
In humans:
- We see → understand → decide → act.
In machines:
- Input → process → output (with learning).
So What is AI?
Artificial Intelligence = Artificial systems that can perform tasks requiring intelligence.
Examples:
- Playing chess
- Detecting spam
- Recommending videos
- Conversational chatbots
- Self-driving perception systems
Important:
AI does not mean consciousness.
It means decision-making or learning capability.
Now: ML vs DL vs AI vs GenAI
Think of it like nested circles.
AI is the big field.
ML is one way to achieve AI.
Deep Learning is one type of ML.
Generative AI is built mostly on Deep Learning.
Simple Concept Diagram
ArtificialIntelligence(AI)
│
├── Rule-Based Systems
│
└── MachineLearning(ML)
│
├── ClassicalML(SVM, Trees, etc.)
│
└── DeepLearning(DL)
│
├── CNNs
├── RNNs
└── Transformers
│
├── Large LanguageModels(LLMs)
│
└── Generative AI
Visual Nested View (Very Simple)
AI
└── ML
└── Deep Learning
└── Transformers
└── LLMs
└── Generative AI
Where Does Generative AI Fit?
Generative AI is:
- A subset of Deep Learning
- Built on Transformer Architecture
- Often implemented as LLMs or diffusion models
It focuses on:
- Generating new content
- Not just predicting labels
Example:
- Spam detection → ML
- Image classification → Deep Learning
- ChatGPT writing essays → Generative AI
One-Line Difference Summary
- AI → Big goal: make machines intelligent
- ML → Machines learn from data
- Deep Learning → ML using neural networks
- Generative AI → Deep learning models that create new content
# Introduction to Machine Learning
What
Machine Learning (ML) is a branch of Artificial Intelligence where systems learn patterns from data instead of being explicitly programmed with rules.
Traditional Programming:
Rules +Data → Output
Machine Learning:
Data + Output → Model
Model + NewData → Prediction
Instead of writing rules manually, we let the machine discover patterns.
Example:
Instead of writing:
if email_contains "lottery" → spam
We give the system:
- Thousands of spam emails
- Thousands of non-spam emails
The model learns patterns on its own.
Why
Because real-world problems are too complex for rule-based systems.
Consider:
- Image recognition
- Speech recognition
- Fraud detection
- Language generation
You cannot manually write rules for:
- Recognizing a cat in 10 million possible image variations.
- Detecting sarcasm in text.
- Generating a Shakespeare-style paragraph.
Data-driven learning scales.
Manual rule-writing does not.
In Generative AI:
We don’t write rules for writing poetry.
We train models on massive datasets.
How
The core ML workflow:
This entire loop is used even when training billion-parameter models.
2. Why Machine Learning?
What
A philosophical question:
Why do we need ML instead of traditional programming?
Why
Three core reasons:
1️⃣ No Clear Rules
How do you write rules to:
- Translate English to Japanese?
- Generate human-like conversation?
Impossible manually.
2️⃣ Massive Data Availability
Internet created:
- Petabytes of text
- Billions of images
- Clickstream behavior
ML thrives on data.
3️⃣ Adaptability
ML models:
- Adapt to new data
- Improve over time
Traditional systems:
- Require manual updates.
How
Think of ML as:
Function Approximation
We try to approximate an unknown function:
Generative AI is:
Same principle. Bigger scale.
# Types of Machine Learning
🔹 Supervised Learning
What
Learning from labeled data.
Input (X) → Output (Y)
Example:
- House size → Price
- Email text → Spam/Not spam
Why
Used when we know the correct answers.
How
Two main categories:
1️⃣ Regression
2️⃣ Classification
🔹 Unsupervised Learning
What
Learning patterns without labels.
Why
Often, labels don’t exist.
Example:
Customer segmentation - process of dividing a customer base into distinct groups (segments) sharing similar characteristics—such as demographics, behaviors, or needs—to enable targeted marketing, improved product development, and higher conversion rates
How
Model finds:
- Clusters
- Hidden structure
- Data compression
Used in:
- Dimensionality reduction (PCA) - Dimensionality reduction helps to reduce the number of features while retaining key information. It converts high-dimensional data into a lower-dimensional space while preserving important details.
- Clustering
- Pre-training embeddings
🔹 Reinforcement Learning
What
Learning through reward and penalty.
Why
Used in:
- Game playing
- Robotics
- Sequential decision-making
How
Agent:
- Takes action
- Gets reward
- Updates policy
LLMs use:
Reinforcement Learning from Human Feedback (RLHF).
# Supervised Machine Learning Algos
◻️ 1. Linear Regression - The Foundation
What
fundamental supervised machine learning algorithms
A model that describes relationship between variables using a straight line:
Where:
- w → weight (slope)
- b → bias
- y → prediction
Why
Why are we learning this?
At its core, this is about predicting a value (y) from an input (x).
The key point:
👉 The output y is continuous (not categories).
Example: house price, temperature, salary, etc.
It is the simplest form of:
Parameter learning through optimization.
Understanding Linear Regression means understanding:
- Neural network layers
- Gradient descent
- Loss functions
- Weight updates
It is the seed of deep learning.
How
But how does it actually work?
We model the relationship like this:
- x (input) → what you give as an input (e.g., house size)
- w (weight) → how strongly x influences y
- b (bias) → a constant offset (baseline value)
So basically:
👉 We take the input, scale it using a weight, and shift it using a bias.
Example:
Predict salary based on experience.
Model learns slope.
Training means:
Find best w and b.
What are we really learning?
We’re not learning “the answer” directly.
We’re learning:
👉 the best values of w and b such that predictions are as close as possible to real data.
so basically, Learning parameters (w, b) by minimizing error using optimization.
▨ Ordinary Least Squares (OLS)
What
Method to find best fitting line by minimizing squared errors.
Ordinary Least Squares (OLS) is a mathematical method used to find the best-fitting line in Linear Regression.
It chooses the line that minimizes the total squared difference between the predicted values and the actual values.
Those differences are called errors or residuals.
Example:
Suppose we want to predict house price from size.
OLS finds the line:
that best fits these points.
Why
We need mathematical definition of "best fit."
In real data, points never lie perfectly on a straight line.
Example:
*
*
*
*
If we draw any random line, the prediction errors will be large.
OLS helps us find the optimal line such that:
is as small as possible.
Why squared error?
- Prevents positive and negative errors from canceling out
- Penalizes large mistakes more heavily
- Makes the math easier to optimize
How
Two approaches:
1. Closed Form Solution
The Closed Form Solution is a direct mathematical formula that calculates the optimal regression parameters in one step using linear algebra.
For linear regression, the parameters are computed as:
Where:
Limitations:
- Requires matrix inversion
- Becomes computationally expensive when features are large
- Memory-heavy for large datasets
2. Gradient Descent
Iterative optimization.
Gradient Descent is an iterative optimization algorithm that gradually updates the model parameters to minimize the cost function.
Instead of solving a formula directly, it takes small steps toward the minimum error.
Why
Closed form becomes impractical for large datasets or high-dimensional data.
Gradient descent works well because:
- Scales to millions of data points
- Works with complex models
- Does not require matrix inversion
This is why deep learning and neural networks rely on gradient descent.
How
Step-by-step:
Intuition
Imagine you are standing on a mountain trying to reach the lowest valley.
- Closed Form → helicopter directly drops you at the lowest point
- Gradient Descent → you walk downhill step by step
▨ Cost Functions
What
Function that measures model error.
cost function is a mathematical formula that quantifies the difference between a machine learning model's predicted values and the actual target values
Loss vs. Cost: While often used interchangeably, a "loss function" typically refers to the error for a single data point, whereas a "cost function" is the average or sum of these losses over the entire dataset
- MSE (Mean Squared Error):
- RMSE (Root Mean Squared Error):
Why
Without a cost function:
We don’t know how wrong we are.
Cost function guides learning.
How
Training = minimize cost.
In Neural Networks:
- Cross entropy
- KL divergence
Same philosophy.
▨ Regularization (Ridge, Lasso, ElasticNet)
What
Add penalty to large weights.
Regularization in Machine Learning is a technique used to prevent a model from memorizing the training data too much so that it can work well on new, unseen data.
Why/The Problem: Overfitting
Sometimes a model learns the training data too perfectly. It memorizes noise and small details that don’t matter.
This problem is called Overfitting.
Example:
- Imagine studying for an exam by memorizing exact questions from last year.
- If the exam questions change slightly, you might fail.
A model that overfits behaves the same way.
Imagine fitting a line through data.
Without regularization:
Model tries to perfectly fit every point
With regularization:
Model tries to capture the overall trend
Regularization forces the model to prefer simpler solutions.
What Regularization Does
Regularization adds a penalty to the model for becoming too complex.
So the model is encouraged to:
- keep weights small
- stay simple
- focus on general patterns instead of noise
Think of it like:
A teacher telling you:
Simple Analogy
Imagine drawing a curve to fit points on a graph.
Without regularization:
- The curve twists and turns to pass through every point.
With regularization:
- The curve stays smooth and simple, capturing the overall trend.
Types of Regularization
1. Ridge Regression (L2 Regularization)
2. Lasso Regression (L1 Regularization)
3. ElasticNet
▨ Polynomial Regression
What
Extend linear regression with higher degree terms:
y=ax2+bx+c
Polynomial regression is a supervised machine learning algorithm used to model non-linear relationships by fitting a curved line
to data. It transforms input features into polynomial terms (e.g., squaring or cubing inputs) to model complex patterns, such as growth rates or rapid, non-linear, changes in data. It is frequently used for regression tasks that are more complex than simple linear regression.
Why
Real-world relationships are nonlinear.
Example: Predicting Salary from Experience:
Sometimes the relationship is not a straight line.
- With experience, salary might:
A straight line (linear regression) can’t capture this curve properly
Polynomial Regression Idea:
Instead of just using x, we also use higher powers:
This makes the model curved instead of straight.
Concrete Data Example:
👉 Notice:
- Growth is not linear
- It accelerates
What the Model Learns:
Instead of:
y = wx + b (straight line ❌)
It learns something like:
y = 0.5x² + 1.2x + 2 (curve ✅)
👉 Now the curve bends upward and fits the data better.
Intuition (Important):
- x → basic effect
- x² → captures acceleration (curvature)
- x³ → captures more complex bends
So:
👉 Polynomial regression = linear model on transformed features (x, x², x³...)
Real-world Use Cases
- House pricing (area + area² effects)
- Stock trends (non-linear movement)
- Growth patterns (population, revenue)
- Physics (motion equations)
How
Use PolynomialFeatures in sklearn.
Risk:
High degree → overfitting.
▨ Bias-Variance Tradeoff
What
Bias means the model is too simple to learn the true pattern in the data. The model makes strong assumptions.
Variance means the model is too sensitive to the training data. It learns noise instead of the real pattern.
The Bias–Variance Tradeoff describes the balance between:
Bias → error from overly simple models (Model too simple → underfitting.)
Variance → error from overly complex models (Model too complex → overfitting.)
A good model must balance both.
Too much of either leads to poor predictions.
LLMs reduce bias by scaling model size.
Regularization reduces variance.
Why
Explains model performance behavior.
1. Bias
What
Bias means the model is too simple to learn the true pattern in the data.
The model makes strong assumptions.
Example:
Trying to fit a straight line to data that is actually curved.
Actual pattern:
*
*
*
*
*
Model prediction (line):
---------
The model cannot capture the real pattern.
Result
High Bias → Underfitting
The model performs badly on both training and test data.
2. Variance
What
Variance means the model is too sensitive to the training data.
It learns noise instead of the real pattern.
Example:
Training points:
*
*
*
*
The model creates a very wiggly curve to match every point.
Crazy curve:
*__
\__
* \_
* \__
Result
High Variance → Overfitting
The model performs:
Training accuracy → very high
Test accuracy → very low
Visual Intuition
Model Complexity
↑
Underfitting Good Model Overfitting
High Bias Balanced High Variance
| | |
-----|------------------|-----------------|-----
Why It's Called a Tradeoff
Reducing one usually increases the other.
Example:
So ML training tries to find the sweet spot.
How We Control Bias and Variance
Several techniques help manage this balance.
Example
Suppose you predict house prices.
Model A
Price = a + b * size
Too simple → High bias
Model B
Price = a + b1x + b2x² + b3x³ + ... + b20x²⁰
Too complex → High variance
Best Model
Price = a + b1x + b2x²
Balanced.
Mathematical View
Total prediction error can be approximated as:
Where:
- Bias² → error from wrong assumptions
- Variance → error from model sensitivity
- Noise → unavoidable randomness in data
Why This Matters for AI / Deep Learning
Almost every technique in modern AI deals with this tradeoff.
Examples:
Even LLM training is influenced by this principle.
Real life example / Intuition
Imagine learning for an exam.
High Bias student
Studies only basic summary
→ cannot solve real questions
High Variance student
Memorizes exact past papers
→ fails when questions change
Balanced student
Understands concepts
→ performs well in new questions
▨ Feature Scaling
What
Normalize input features.
Why
Speeds up gradient descent.
Required for:
- KNN
- SVM
- Neural networks
How
Standardization:
Mean = 0
Std = 1
▨ Train-Test Split
What
Divide dataset into:
- Training
- Testing
Why
To measure generalization.
Overfitting happens when:
Model memorizes training data.
How
- Train-test split - he dataset is randomly divided into two subsets: a training set (typically 70-80% of the data) for the model to learn from, and a separate testing set (20-30%) to evaluate its final performance on unseen data. This method is simple, fast, and ideal for large datasets or computationally expensive models.
- Train-Validation-Test split - The data is divided into three parts: a training set to fit the model, a validation set to tune hyperparameters and perform model selection, and a final, held-out test set for an unbiased evaluation of the final model's performance.
- Cross Validation (k-fold cross-validation) - he dataset is divided into k equally sized "folds". The model is trained k times; each time, one fold is used as the test set and the remaining k-1 folds as the training set. The k performance results are then averaged to provide a more robust estimate of model performance.
▨ Cross Validation
What
Cross Validation is a technique used to evaluate how well a machine learning model will perform on unseen data.
Instead of splitting the dataset once into train/test, we split it multiple times and train/test the model several times.
This gives a more reliable estimate of model performance.
Why
If we use only one train-test split, the result may depend too much on which data points ended up in the test set.
Example:
Dataset = 100 samples
Train = 80
Test = 20
Maybe those 20 test samples were very easy or very difficult, which can give a misleading accuracy.
Cross-validation solves this by testing the model on multiple different splits.
How
The most common method is K-Fold Cross Validation.
Visual Intuition
Dataset
|F1|F2|F3|F4|F5|
Test each fold once
Train on the remaining folds
Why Cross Validation is Important
It helps with:
- Better model evaluation
- Detecting overfitting
- Choosing the best hyperparameters
- Selecting the best model
Example:
Model A → 88%
Model B → 92%
Choose Model B
Where It Fits in ML
Dataset
│
Train/Test Split
│
Cross Validation
│
Model Evaluation
│
Model Selection
Example in Python (scikit-learn)
Simple One-Line Intuition
Cross Validation = testing the model multiple times on different parts of the dataset to get a reliable performance estimate.
▨ Model Evaluation Techniques
Model Evaluation
│
├── Data Splitting Methods
│ ├── Train–Test Split
│ ├── Train–Validation–Test Split
│ └── Cross Validation
│ ├── K-Fold Cross Validation
│ ├── Stratified K-Fold
│ └── Leave-One-Out CV
│
└── Evaluation Metrics
├── Regression Metrics
│ ├── MSE
│ ├── RMSE
│ ├── MAE
│ └── R² Score
│
└── Classification Metrics
├── Accuracy
├── Precision
├── Recall
├── F1 Score
└── ROC-AUC
So evaluation has two parts.
1. Data Splitting Techniques
These determine how we test the model.
Train–Test Split
Simplest method.
Dataset
│
├── Train
└── Test
Used to check generalization.
Train–Validation–Test Split
Used when we need hyperparameter tuning.
Dataset
│
├── Train
├── Validation
└── Test
Cross Validation
Used to get more reliable performance estimates.
Example (5-fold):
|F1|F2|F3|F4|F5|
Each fold becomes test once.
2. Evaluation Metrics
These measure how good the model predictions are.
Regression Metrics
Classification Metrics
Simple Mental Model
Think of evaluation as two questions.
1️⃣ How do we test the model?
→ Train-Test Split
→ Cross Validation
2️⃣ How do we measure performance?
→ Accuracy
→ MSE
→ F1 Score
Simple Example
Suppose we build a house price prediction model.
Evaluation pipeline:
Dataset
↓
Train-Test Split
↓
Train Model
↓
Predict Test Data
↓
Compute RMSE
One-Line Summary
Model evaluation = data splitting technique + evaluation metric.
◻️ 2. Classification
Classification means:
Predicting a category or class label.
Examples:
- Email → Spam / Not Spam
- Image → Cat / Dog
- Medical test → Disease / No Diseas
▨ 1. Linear Classifiers
A linear classifier separates classes using a straight line (or hyperplane).
Example in 2D:
Cats ● ● ● ●
Dogs ○ ○ ○ ○
Decision boundary:
-----------------------
In higher dimensions this becomes a hyperplane.
Key idea
y = w₁x₁ + w₂x₂ + ... + b
If the result is above a threshold → Class A
Else → Class B
i. Logistic Regression
ii. Single-Layer Perceptron
iii. Linear SVM
iv. Naive Bayes
v. SGD Classifier
▨ 2. Non-Linear Classifiers
When data cannot be separated with a straight line, we need non-linear models.
Example:
○ ○ ○
○ ● ● ○
○ ○ ○
A line cannot separate them.
i. Kernel SVM
ii. Decision Tree
iii. Random Forest
iv. K Nearest Neighbors (KNN)
v. Gradient Boosting
# Unsupervised Machine Learning Algos
What
Unsupervised Learning is a type of machine learning where the data does not contain labels or correct answers.
Example dataset:
There is no label like "rich" or "poor".
The algorithm must discover patterns on its own.
Why
Many real-world datasets do not have labeled data because labeling is expensive or impossible.
Examples:
- Customer segmentation
- Detecting fraud
- Finding similar documents
- Reducing data size
- Detecting unusual behavior
How
Unsupervised learning works by finding structure in the data, such as:
- grouping similar data → Clustering
- compressing information → Dimensionality Reduction
- discovering relationships → Association Rules
- detecting unusual points → Anomaly Detection
So the structure becomes:
Unsupervised Learning
│
├── Clustering
├── Dimensionality Reduction
├── Association Rules
└── Anomaly Detection
Now let's go deeper.
1. Clustering
What
Clustering means grouping similar data points together.
Example:
Customer data may automatically group into:
Group 1 → Students
Group 2 → Working professionals
Group 3 → High-income customers
Even though the model was never told these labels.
Why
Used for:
- Customer segmentation
- Social network analysis
- Document grouping
- Image segmentation
- Market analysis
How
Clustering algorithms measure distance or similarity between data points.
Common distance metric:
Euclidean Distance
Points closer together → same cluster.
i. K-Means Clustering
ii. Hierarchical Clustering
iii. DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
iv. Mean Shift
2. Dimensionality Reduction
What
Reduces the number of features (variables) in a dataset while preserving important information.
Example:
Original features:
Height
Weight
Age
Income
Education
Country
Purchasing behavior
Reduce to:
2 or 3 dimensions
Why
Reasons:
- High dimensional data is hard to visualize
- Training becomes slow
- Some features are redundant
- Avoid curse of dimensionality
How
Algorithms transform data into lower dimensional space.
PCA (Principal Component Analysis)
What
Transforms data into new axes called principal components.
These components capture maximum variance in the data.
Why
Used for:
- noise reduction
- data compression
- visualization
- preprocessing before ML
How
Steps:
- Standardize data
- Compute covariance matrix
- Compute eigenvectors
- Choose top components
- Project data onto them
Example:
10 features → reduce to 2
t-SNE (t-distributed Stochastic Neighbor Embedding)
What
A dimensionality reduction algorithm designed for visualizing high-dimensional data.
Often used in 2D or 3D visualization.
Why
Preserves local relationships between points.
Example:
Used in visualizing:
- word embeddings
- image embeddings
- neural network features
How
Instead of preserving global distances, it tries to keep similar points close together.
Commonly used for visualizing deep learning embeddings.
UMAP (Uniform Manifold Approximation and Projection)
What
A modern dimensionality reduction technique.
Similar to t-SNE but faster and scalable.
Why
Advantages over t-SNE:
- faster
- preserves more global structure
- works well on large datasets
Used heavily in:
- bioinformatics
- embedding visualization
- deep learning pipelines
How
UMAP builds a graph of data points and then finds a lower dimensional representation preserving structure.
3. Association Rules
What
Finds relationships between items in datasets.
Classic example:
Market Basket Analysis
Example rule:
If person buys bread → they also buy butter
Why
Used for:
- recommendation systems
- product placement
- shopping analysis
Example:
Amazon: "Customers also bought..."
How
Uses metrics:
Support
Confidence
Lift
Example rule:
Bread → Butter
Apriori Algorithm
What
Finds frequent itemsets in transactional data.
Why
Used in market basket analysis.
Example:
Transactions:
Milk, Bread
Milk, Butter
Bread, Butter
Milk, Bread, Butter
Apriori finds frequent combinations.
How
Key idea:
If a set is frequent
→ all subsets must also be frequent
Steps:
- Find frequent items
- Build larger itemsets
- Generate association rules
Eclat Algorithm
What
Another method for finding frequent itemsets.
Why
Faster than Apriori for large datasets.
How
Instead of scanning database repeatedly:
Eclat uses vertical data format
Example:
Milk → T1, T2, T4
Bread → T1, T3, T4
Then intersections are used to compute frequency.
4. Anomaly Detection
What
Detects unusual or rare data points.
Example:
Normal credit card transactions → $20, $50, $30
Anomaly → $10,000 purchase in another country
Why
Used in:
- fraud detection
- cybersecurity
- medical diagnosis
- manufacturing defects
How
The model learns normal behavior, then detects points that deviate.
Isolation Forest
What
A tree-based algorithm for detecting anomalies.
Why
Efficient for large datasets.
How
Key idea:
Anomalies are easier to isolate
Example:
normal points → need many splits
anomalies → isolated quickly
Shorter path length in tree → anomaly.
One-Class SVM
What
A variation of Support Vector Machine used for anomaly detection.
Why
Used when only normal data is available.
Example:
Train on normal behavior, detect abnormal points.
How
The algorithm finds a boundary around normal data.
Inside boundary → normal
Outside boundary → anomaly
Simple Big Picture
Unsupervised Learning
│
├── Clustering
│ ├── K-Means
│ ├── Hierarchical
│ ├── DBSCAN
│ └── Mean Shift
│
├── Dimensionality Reduction
│ ├── PCA
│ ├── t-SNE
│ └── UMAP
│
├── Association Rules
│ ├── Apriori
│ └── Eclat
│
└── Anomaly Detection
├── Isolation Forest
└── One-Class SVM
The END… till Happy learning>>