back

The “What, Why, How” Guide to Machine Learning - ML Basics

3 min read

Machine Learning Foundations - From Zero to Linear Models

If your end goal is Generative AI, you must understand one truth:

Large Language Models are not magic.

Before Transformers, before Attention, before LLMs — there was Linear Regression.

This blog builds your foundation the right way using:

What → Why → How

What is Artificial Intelligence (AI)?

AI = Artificial + Intelligence


What is “Artificial”?

Artificial means:

Example:

So AI systems are engineered systems that mimic certain capabilities of humans.


What is “Intelligence”?

Intelligence means the ability to:

In humans:

In machines:


So What is AI?

Artificial Intelligence = Artificial systems that can perform tasks requiring intelligence.

Examples:

Important:

AI does not mean consciousness.

It means decision-making or learning capability.


Now: ML vs DL vs AI vs GenAI

Think of it like nested circles.

AI is the big field.

ML is one way to achieve AI.

Deep Learning is one type of ML.

Generative AI is built mostly on Deep Learning.


Simple Concept Diagram

ArtificialIntelligence(AI)
│
├── Rule-Based Systems
│
└── MachineLearning(ML)
     │
     ├── ClassicalML(SVM, Trees, etc.)
     │
     └── DeepLearning(DL)
          │
          ├── CNNs
          ├── RNNs
          └── Transformers
                │
                ├── Large LanguageModels(LLMs)
                │
                └── Generative AI

Visual Nested View (Very Simple)

AI
 └── ML
      └── Deep Learning
            └── Transformers
                  └── LLMs
                        └── Generative AI

Where Does Generative AI Fit?

Generative AI is:

It focuses on:

Example:


One-Line Difference Summary

# Introduction to Machine Learning


What

Machine Learning (ML) is a branch of Artificial Intelligence where systems learn patterns from data instead of being explicitly programmed with rules.

Traditional Programming:

Rules +Data → Output

Machine Learning:

Data + Output → Model
Model + NewData → Prediction

Instead of writing rules manually, we let the machine discover patterns.

Example:

Instead of writing:

if email_contains "lottery" → spam

We give the system:

The model learns patterns on its own.


Why

Because real-world problems are too complex for rule-based systems.

Consider:

You cannot manually write rules for:

Data-driven learning scales.

Manual rule-writing does not.

In Generative AI:

We don’t write rules for writing poetry.

We train models on massive datasets.


How

The core ML workflow:

This entire loop is used even when training billion-parameter models.


2. Why Machine Learning?


What

A philosophical question:

Why do we need ML instead of traditional programming?


Why

Three core reasons:

1️⃣ No Clear Rules

How do you write rules to:

Impossible manually.


2️⃣ Massive Data Availability

Internet created:

ML thrives on data.


3️⃣ Adaptability

ML models:

Traditional systems:


How

Think of ML as:

Function Approximation

We try to approximate an unknown function:

Generative AI is:

Same principle. Bigger scale.


# Types of Machine Learning


🔹 Supervised Learning


What

Learning from labeled data.

Input (X) → Output (Y)

Example:


Why

Used when we know the correct answers.


How

Two main categories:

1️⃣ Regression

2️⃣ Classification


🔹 Unsupervised Learning


What

Learning patterns without labels.


Why

Often, labels don’t exist.

Example:

Customer segmentation - process of dividing a customer base into distinct groups (segments) sharing similar characteristics—such as demographics, behaviors, or needs—to enable targeted marketing, improved product development, and higher conversion rates


How

Model finds:

Used in:


🔹 Reinforcement Learning


What

Learning through reward and penalty.


Why

Used in:


How

Agent:

  1. Takes action
  1. Gets reward
  1. Updates policy

LLMs use:

Reinforcement Learning from Human Feedback (RLHF).


# Supervised Machine Learning Algos


◻️ 1. Linear Regression - The Foundation

What

fundamental supervised machine learning algorithms

A model that describes relationship between variables using a straight line:

Where:


Why

Why are we learning this?

At its core, this is about predicting a value (y) from an input (x).

The key point:

👉 The output y is continuous (not categories).

Example: house price, temperature, salary, etc.

It is the simplest form of:

Parameter learning through optimization.

Understanding Linear Regression means understanding:

It is the seed of deep learning.


How

But how does it actually work?

We model the relationship like this:

So basically:

👉 We take the input, scale it using a weight, and shift it using a bias.

Example:

Predict salary based on experience.

Model learns slope.

Training means:

Find best w and b.

What are we really learning?

We’re not learning “the answer” directly.

We’re learning:

👉 the best values of w and b such that predictions are as close as possible to real data.

so basically, Learning parameters (w, b) by minimizing error using optimization.


▨ Ordinary Least Squares (OLS)


What

Method to find best fitting line by minimizing squared errors.

Ordinary Least Squares (OLS) is a mathematical method used to find the best-fitting line in Linear Regression.

It chooses the line that minimizes the total squared difference between the predicted values and the actual values.

Those differences are called errors or residuals.

Example:

Suppose we want to predict house price from size.

OLS finds the line:

that best fits these points.


Why

We need mathematical definition of "best fit."

In real data, points never lie perfectly on a straight line.

Example:

   *
        *
 *
           *

If we draw any random line, the prediction errors will be large.

OLS helps us find the optimal line such that:

is as small as possible.

Why squared error?

  1. Prevents positive and negative errors from canceling out
  1. Penalizes large mistakes more heavily
  1. Makes the math easier to optimize

How

Two approaches:

1. Closed Form Solution

The Closed Form Solution is a direct mathematical formula that calculates the optimal regression parameters in one step using linear algebra.

For linear regression, the parameters are computed as:

Where:

Limitations:

2. Gradient Descent

Iterative optimization.

Gradient Descent is an iterative optimization algorithm that gradually updates the model parameters to minimize the cost function.

Instead of solving a formula directly, it takes small steps toward the minimum error.


Why

Closed form becomes impractical for large datasets or high-dimensional data.

Gradient descent works well because:

This is why deep learning and neural networks rely on gradient descent.

How

Step-by-step:


Intuition

Imagine you are standing on a mountain trying to reach the lowest valley.


▨ Cost Functions


What

Function that measures model error.

cost function is a mathematical formula that quantifies the difference between a machine learning model's predicted values and the actual target values

Loss vs. Cost: While often used interchangeably, a "loss function" typically refers to the error for a single data point, whereas a "cost function" is the average or sum of these losses over the entire dataset


Why

Without a cost function:

We don’t know how wrong we are.

Cost function guides learning.


How

Training = minimize cost.

In Neural Networks:

Same philosophy.


▨ Regularization (Ridge, Lasso, ElasticNet)


What

Add penalty to large weights.

Regularization in Machine Learning is a technique used to prevent a model from memorizing the training data too much so that it can work well on new, unseen data.

Why/The Problem: Overfitting

Sometimes a model learns the training data too perfectly. It memorizes noise and small details that don’t matter.

This problem is called Overfitting.

Example:

A model that overfits behaves the same way.

Imagine fitting a line through data.

Without regularization:

Model tries to perfectly fit every point

With regularization:

Model tries to capture the overall trend

Regularization forces the model to prefer simpler solutions.


What Regularization Does

Regularization adds a penalty to the model for becoming too complex.

So the model is encouraged to:

Think of it like:

A teacher telling you:

Simple Analogy

Imagine drawing a curve to fit points on a graph.

Without regularization:

With regularization:


Types of Regularization

1. Ridge Regression (L2 Regularization)

2. Lasso Regression (L1 Regularization)

3. ElasticNet


▨ Polynomial Regression


What

Extend linear regression with higher degree terms:

y=ax2+bx+c

Polynomial regression is a supervised machine learning algorithm used to model non-linear relationships by fitting a curved line

to data. It transforms input features into polynomial terms (e.g., squaring or cubing inputs) to model complex patterns, such as growth rates or rapid, non-linear, changes in data. It is frequently used for regression tasks that are more complex than simple linear regression.


Why

Real-world relationships are nonlinear.

Example: Predicting Salary from Experience:

Sometimes the relationship is not a straight line.

A straight line (linear regression) can’t capture this curve properly


Polynomial Regression Idea:

Instead of just using x, we also use higher powers:

This makes the model curved instead of straight.


Concrete Data Example:

👉 Notice:


What the Model Learns:

Instead of:

y = wx + b   (straight line ❌)

It learns something like:

y = 0.5x² + 1.2x + 2   (curve ✅)

👉 Now the curve bends upward and fits the data better.


Intuition (Important):

So:

👉 Polynomial regression = linear model on transformed features (x, x², x³...)


Real-world Use Cases


How

Use PolynomialFeatures in sklearn.

Risk:

High degree → overfitting.


▨ Bias-Variance Tradeoff


What

Bias means the model is too simple to learn the true pattern in the data. The model makes strong assumptions.

Variance means the model is too sensitive to the training data. It learns noise instead of the real pattern.

The Bias–Variance Tradeoff describes the balance between:

Bias  → error from overly simple models (Model too simple → underfitting.)
Variance → error from overly complex models (Model too complex → overfitting.)

A good model must balance both.

Too much of either leads to poor predictions.

LLMs reduce bias by scaling model size.

Regularization reduces variance.


Why

Explains model performance behavior.


1. Bias

What

Bias means the model is too simple to learn the true pattern in the data.

The model makes strong assumptions.

Example:

Trying to fit a straight line to data that is actually curved.

Actual pattern:
      *
   *
 *
      *
   *

Model prediction (line):
---------

The model cannot capture the real pattern.


Result

High Bias → Underfitting

The model performs badly on both training and test data.


2. Variance

What

Variance means the model is too sensitive to the training data.

It learns noise instead of the real pattern.

Example:

Training points:
   *
      *
 *
        *

The model creates a very wiggly curve to match every point.

Crazy curve:
   *__
      \__
  *      \_
        *   \__

Result

High Variance → Overfitting

The model performs:

Training accuracy → very high
Test accuracy → very low

Visual Intuition

Model Complexity
        ↑

Underfitting       Good Model        Overfitting
High Bias          Balanced          High Variance

     |                  |                 |
-----|------------------|-----------------|-----

Why It's Called a Tradeoff

Reducing one usually increases the other.

Example:

So ML training tries to find the sweet spot.


How We Control Bias and Variance

Several techniques help manage this balance.

Example

Suppose you predict house prices.

Model A

Price = a + b * size

Too simple → High bias


Model B

Price = a + b1x + b2x² + b3x³ + ... + b20x²⁰

Too complex → High variance


Best Model

Price = a + b1x + b2x²

Balanced.


Mathematical View

Total prediction error can be approximated as:

Where:


Why This Matters for AI / Deep Learning

Almost every technique in modern AI deals with this tradeoff.

Examples:

Even LLM training is influenced by this principle.


Real life example / Intuition

Imagine learning for an exam.

High Bias student

Studies only basic summary
→ cannot solve real questions

High Variance student

Memorizes exact past papers
→ fails when questions change

Balanced student

Understands concepts
→ performs well in new questions

▨ Feature Scaling

What

Normalize input features.


Why

Speeds up gradient descent.

Required for:


How

Standardization:

Mean = 0

Std = 1


▨ Train-Test Split


What

Divide dataset into:


Why

To measure generalization.

Overfitting happens when:

Model memorizes training data.


How

  1. Train-test split - he dataset is randomly divided into two subsets: a training set (typically 70-80% of the data) for the model to learn from, and a separate testing set (20-30%) to evaluate its final performance on unseen data. This method is simple, fast, and ideal for large datasets or computationally expensive models.
  1. Train-Validation-Test split - The data is divided into three parts: a training set to fit the model, a validation set to tune hyperparameters and perform model selection, and a final, held-out test set for an unbiased evaluation of the final model's performance.
  1. Cross Validation (k-fold cross-validation) - he dataset is divided into k equally sized "folds". The model is trained k times; each time, one fold is used as the test set and the remaining k-1 folds as the training set. The k performance results are then averaged to provide a more robust estimate of model performance.

▨ Cross Validation


What

Cross Validation is a technique used to evaluate how well a machine learning model will perform on unseen data.

Instead of splitting the dataset once into train/test, we split it multiple times and train/test the model several times.

This gives a more reliable estimate of model performance.


Why

If we use only one train-test split, the result may depend too much on which data points ended up in the test set.

Example:

Dataset = 100 samples

Train = 80
Test  = 20

Maybe those 20 test samples were very easy or very difficult, which can give a misleading accuracy.

Cross-validation solves this by testing the model on multiple different splits.


How

The most common method is K-Fold Cross Validation.


Visual Intuition

Dataset

|F1|F2|F3|F4|F5|

Test each fold once
Train on the remaining folds

Why Cross Validation is Important

It helps with:

  1. Better model evaluation
  1. Detecting overfitting
  1. Choosing the best hyperparameters
  1. Selecting the best model

Example:

Model A → 88%
Model B → 92%

Choose Model B

Where It Fits in ML

Dataset
   │
Train/Test Split
   │
Cross Validation
   │
Model Evaluation
   │
Model Selection

Example in Python (scikit-learn)


Simple One-Line Intuition

Cross Validation = testing the model multiple times on different parts of the dataset to get a reliable performance estimate.

▨ Model Evaluation Techniques

Model Evaluation
│
├── Data Splitting Methods
│   ├── Train–Test Split
│   ├── Train–Validation–Test Split
│   └── Cross Validation
│        ├── K-Fold Cross Validation
│        ├── Stratified K-Fold
│        └── Leave-One-Out CV
│
└── Evaluation Metrics
    ├── Regression Metrics
    │     ├── MSE
    │     ├── RMSE
    │     ├── MAE
    │     └── R² Score
    │
    └── Classification Metrics
          ├── Accuracy
          ├── Precision
          ├── Recall
          ├── F1 Score
          └── ROC-AUC

So evaluation has two parts.


1. Data Splitting Techniques

These determine how we test the model.

Train–Test Split

Simplest method.

Dataset
│
├── Train
└── Test

Used to check generalization.


Train–Validation–Test Split

Used when we need hyperparameter tuning.

Dataset
│
├── Train
├── Validation
└── Test

Cross Validation

Used to get more reliable performance estimates.

Example (5-fold):

|F1|F2|F3|F4|F5|

Each fold becomes test once.


2. Evaluation Metrics

These measure how good the model predictions are.


Regression Metrics

Classification Metrics


Simple Mental Model

Think of evaluation as two questions.

1️⃣ How do we test the model?
   → Train-Test Split
   → Cross Validation

2️⃣ How do we measure performance?
   → Accuracy
   → MSE
   → F1 Score

Simple Example

Suppose we build a house price prediction model.

Evaluation pipeline:

Dataset
   ↓
Train-Test Split
   ↓
Train Model
   ↓
Predict Test Data
   ↓
Compute RMSE

One-Line Summary

Model evaluation = data splitting technique + evaluation metric.



◻️ 2. Classification


Classification means:

Predicting a category or class label.

Examples:

▨ 1. Linear Classifiers

A linear classifier separates classes using a straight line (or hyperplane).

Example in 2D:

Cats  ● ● ● ●
Dogs  ○ ○ ○ ○

Decision boundary:
-----------------------

In higher dimensions this becomes a hyperplane.

Key idea

y = w₁x₁ + w₂x₂ + ... + b

If the result is above a threshold → Class A

Else → Class B

i. Logistic Regression

ii. Single-Layer Perceptron

iii. Linear SVM

iv. Naive Bayes

v. SGD Classifier

▨ 2. Non-Linear Classifiers

When data cannot be separated with a straight line, we need non-linear models.

Example:

    ○ ○ ○
   ○ ● ● ○
    ○ ○ ○

A line cannot separate them.


i. Kernel SVM

ii. Decision Tree

iii. Random Forest

iv. K Nearest Neighbors (KNN)

v. Gradient Boosting

# Unsupervised Machine Learning Algos


What

Unsupervised Learning is a type of machine learning where the data does not contain labels or correct answers.

Example dataset:

There is no label like "rich" or "poor".

The algorithm must discover patterns on its own.


Why

Many real-world datasets do not have labeled data because labeling is expensive or impossible.

Examples:


How

Unsupervised learning works by finding structure in the data, such as:

So the structure becomes:

Unsupervised Learning
│
├── Clustering
├── Dimensionality Reduction
├── Association Rules
└── Anomaly Detection

Now let's go deeper.

1. Clustering

What

Clustering means grouping similar data points together.

Example:

Customer data may automatically group into:

Group 1 → Students
Group 2 → Working professionals
Group 3 → High-income customers

Even though the model was never told these labels.


Why

Used for:


How

Clustering algorithms measure distance or similarity between data points.

Common distance metric:

Euclidean Distance

Points closer together → same cluster.


i. K-Means Clustering

ii. Hierarchical Clustering

iii. DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

iv. Mean Shift

2. Dimensionality Reduction

What

Reduces the number of features (variables) in a dataset while preserving important information.

Example:

Original features:
Height
Weight
Age
Income
Education
Country
Purchasing behavior

Reduce to:

2 or 3 dimensions

Why

Reasons:

  1. High dimensional data is hard to visualize
  1. Training becomes slow
  1. Some features are redundant
  1. Avoid curse of dimensionality

How

Algorithms transform data into lower dimensional space.


PCA (Principal Component Analysis)

What

Transforms data into new axes called principal components.

These components capture maximum variance in the data.


Why

Used for:


How

Steps:

  1. Standardize data
  1. Compute covariance matrix
  1. Compute eigenvectors
  1. Choose top components
  1. Project data onto them

Example:

10 features → reduce to 2

t-SNE (t-distributed Stochastic Neighbor Embedding)

What

A dimensionality reduction algorithm designed for visualizing high-dimensional data.

Often used in 2D or 3D visualization.


Why

Preserves local relationships between points.

Example:

Used in visualizing:


How

Instead of preserving global distances, it tries to keep similar points close together.

Commonly used for visualizing deep learning embeddings.


UMAP (Uniform Manifold Approximation and Projection)

What

A modern dimensionality reduction technique.

Similar to t-SNE but faster and scalable.


Why

Advantages over t-SNE:

Used heavily in:


How

UMAP builds a graph of data points and then finds a lower dimensional representation preserving structure.


3. Association Rules

What

Finds relationships between items in datasets.

Classic example:

Market Basket Analysis

Example rule:

If person buys bread → they also buy butter

Why

Used for:

Example:

Amazon: "Customers also bought..."

How

Uses metrics:

Support
Confidence
Lift

Example rule:

Bread → Butter

Apriori Algorithm

What

Finds frequent itemsets in transactional data.


Why

Used in market basket analysis.

Example:

Transactions:

Milk, Bread
Milk, Butter
Bread, Butter
Milk, Bread, Butter

Apriori finds frequent combinations.


How

Key idea:

If a set is frequent
→ all subsets must also be frequent

Steps:

  1. Find frequent items
  1. Build larger itemsets
  1. Generate association rules

Eclat Algorithm

What

Another method for finding frequent itemsets.


Why

Faster than Apriori for large datasets.


How

Instead of scanning database repeatedly:

Eclat uses vertical data format

Example:

Milk → T1, T2, T4
Bread → T1, T3, T4

Then intersections are used to compute frequency.


4. Anomaly Detection

What

Detects unusual or rare data points.

Example:

Normal credit card transactions → $20, $50, $30
Anomaly → $10,000 purchase in another country

Why

Used in:


How

The model learns normal behavior, then detects points that deviate.


Isolation Forest

What

A tree-based algorithm for detecting anomalies.


Why

Efficient for large datasets.


How

Key idea:

Anomalies are easier to isolate

Example:

normal points → need many splits
anomalies → isolated quickly

Shorter path length in tree → anomaly.


One-Class SVM

What

A variation of Support Vector Machine used for anomaly detection.


Why

Used when only normal data is available.

Example:

Train on normal behavior, detect abnormal points.


How

The algorithm finds a boundary around normal data.

Inside boundary → normal
Outside boundary → anomaly

Simple Big Picture

Unsupervised Learning
│
├── Clustering
│   ├── K-Means
│   ├── Hierarchical
│   ├── DBSCAN
│   └── Mean Shift
│
├── Dimensionality Reduction
│   ├── PCA
│   ├── t-SNE
│   └── UMAP
│
├── Association Rules
│   ├── Apriori
│   └── Eclat
│
└── Anomaly Detection
    ├── Isolation Forest
    └── One-Class SVM

The END… till Happy learning>>