Machine Learning Foundations - From Zero to Linear Models

If your end goal is Generative AI, you must understand one truth:

Large Language Models are not magic.

Before Transformers, before Attention, before LLMs — there was Linear Regression.

This blog builds your foundation the right way using:

What → Why → How

What is Artificial Intelligence (AI)?

AI = Artificial + Intelligence

What is “Artificial”?

Artificial means:

Man-made

Not naturally occurring

Built by humans using algorithms + hardware

Example:

A human brain → natural

A neural network running on GPU → artificial

So AI systems are engineered systems that mimic certain capabilities of humans.

What is “Intelligence”?

Intelligence means the ability to:

Learn from experience

Reason

Solve problems

Understand language

Perceive the environment

Adapt to new situations

In humans:

We see → understand → decide → act.

In machines:

Input → process → output (with learning).

So What is AI?

Artificial Intelligence = Artificial systems that can perform tasks requiring intelligence.

Examples:

Playing chess

Detecting spam

Recommending videos

Conversational chatbots

Self-driving perception systems

Important:

AI does not mean consciousness.

It means decision-making or learning capability.

Now: ML vs DL vs AI vs GenAI

Think of it like nested circles.

AI is the big field.

ML is one way to achieve AI.

Deep Learning is one type of ML.

Generative AI is built mostly on Deep Learning.

Simple Concept Diagram

ArtificialIntelligence(AI)
│
├── Rule-Based Systems
│
└── MachineLearning(ML)
     │
     ├── ClassicalML(SVM, Trees, etc.)
     │
     └── DeepLearning(DL)
          │
          ├── CNNs
          ├── RNNs
          └── Transformers
                │
                ├── Large LanguageModels(LLMs)
                │
                └── Generative AI

Visual Nested View (Very Simple)

AI
 └── ML
      └── Deep Learning
            └── Transformers
                  └── LLMs
                        └── Generative AI

Where Does Generative AI Fit?

Generative AI is:

A subset of Deep Learning

Built on Transformer Architecture

Often implemented as LLMs or diffusion models

It focuses on:

Generating new content

Not just predicting labels

Example:

Spam detection → ML

Image classification → Deep Learning

ChatGPT writing essays → Generative AI

One-Line Difference Summary

AI → Big goal: make machines intelligent

ML → Machines learn from data

Deep Learning → ML using neural networks

Generative AI → Deep learning models that create new content

# Introduction to Machine Learning

What

Machine Learning (ML) is a branch of Artificial Intelligence where systems learn patterns from data instead of being explicitly programmed with rules.

Traditional Programming:

Rules +Data → Output

Machine Learning:

Data + Output → Model
Model + NewData → Prediction

Instead of writing rules manually, we let the machine discover patterns.

Example:

Instead of writing:

if email_contains "lottery" → spam

We give the system:

Thousands of spam emails

Thousands of non-spam emails

The model learns patterns on its own.

Why

Because real-world problems are too complex for rule-based systems.

Consider:

Image recognition

Speech recognition

Fraud detection

Language generation

You cannot manually write rules for:

Recognizing a cat in 10 million possible image variations.

Detecting sarcasm in text.

Generating a Shakespeare-style paragraph.

Data-driven learning scales.

Manual rule-writing does not.

In Generative AI:

We don’t write rules for writing poetry.

We train models on massive datasets.

How

The core ML workflow:

This entire loop is used even when training billion-parameter models.

2. Why Machine Learning?

What

A philosophical question:

Why do we need ML instead of traditional programming?

Why

Three core reasons:

1️⃣ No Clear Rules

How do you write rules to:

Translate English to Japanese?

Generate human-like conversation?

Impossible manually.

2️⃣ Massive Data Availability

Internet created:

Petabytes of text

Billions of images

Clickstream behavior

ML thrives on data.

3️⃣ Adaptability

ML models:

Adapt to new data

Improve over time

Traditional systems:

Require manual updates.

How

Think of ML as:

Function Approximation

We try to approximate an unknown function:

Generative AI is:

Same principle. Bigger scale.

# Types of Machine Learning

🔹 Supervised Learning

What

Learning from labeled data.

Input (X) → Output (Y)

Example:

House size → Price

Email text → Spam/Not spam

Why

Used when we know the correct answers.

How

Two main categories:

1️⃣ Regression

2️⃣ Classification

🔹 Unsupervised Learning

What

Learning patterns without labels.

Why

Often, labels don’t exist.

Example:

Customer segmentation - process of dividing a customer base into distinct groups (segments) sharing similar characteristics—such as demographics, behaviors, or needs—to enable targeted marketing, improved product development, and higher conversion rates

How

Model finds:

Clusters

Hidden structure

Data compression

Used in:

Dimensionality reduction (PCA) - Dimensionality reduction helps to reduce the number of features while retaining key information. It converts high-dimensional data into a lower-dimensional space while preserving important details.

Clustering

Pre-training embeddings

🔹 Reinforcement Learning

What

Learning through reward and penalty.

Why

Used in:

Game playing

Robotics

Sequential decision-making

How

Agent:

Takes action

Gets reward

Updates policy

LLMs use:

Reinforcement Learning from Human Feedback (RLHF).

# Supervised Machine Learning Algos

◻️ 1. Linear Regression - The Foundation

What

fundamental supervised machine learning algorithms

A model that describes relationship between variables using a straight line:

Where:

w → weight (slope)

b → bias

y → prediction

Why

Why are we learning this?

At its core, this is about predicting a value (y) from an input (x).

The key point:

👉 The output y is continuous (not categories).

Example: house price, temperature, salary, etc.

It is the simplest form of:

Parameter learning through optimization.

Understanding Linear Regression means understanding:

Neural network layers

Gradient descent

Loss functions

Weight updates

It is the seed of deep learning.

How

But how does it actually work?

We model the relationship like this:

x (input) → what you give as an input (e.g., house size)

w (weight) → how strongly x influences y

b (bias) → a constant offset (baseline value)

So basically:

👉 We take the input, scale it using a weight, and shift it using a bias.

Example:

Predict salary based on experience.

Model learns slope.

Training means:

Find best w and b.

What are we really learning?

We’re not learning “the answer” directly.

We’re learning:

👉 the best values of w and b such that predictions are as close as possible to real data.

so basically, Learning parameters (w, b) by minimizing error using optimization.

▨ Ordinary Least Squares (OLS)

What

Method to find best fitting line by minimizing squared errors.

Ordinary Least Squares (OLS) is a mathematical method used to find the best-fitting line in Linear Regression.

It chooses the line that minimizes the total squared difference between the predicted values and the actual values.

Those differences are called errors or residuals.

Example:

Suppose we want to predict house price from size.

OLS finds the line:

that best fits these points.

Why

We need mathematical definition of "best fit."

In real data, points never lie perfectly on a straight line.

Example:

If we draw any random line, the prediction errors will be large.

OLS helps us find the optimal line such that:

is as small as possible.

Why squared error?

Prevents positive and negative errors from canceling out

Penalizes large mistakes more heavily

Makes the math easier to optimize

How

Two approaches:

1. Closed Form Solution

The Closed Form Solution is a direct mathematical formula that calculates the optimal regression parameters in one step using linear algebra.

For linear regression, the parameters are computed as:

Where:

Limitations:

Requires matrix inversion

Becomes computationally expensive when features are large

Memory-heavy for large datasets

2. Gradient Descent

Iterative optimization.

Gradient Descent is an iterative optimization algorithm that gradually updates the model parameters to minimize the cost function.

Instead of solving a formula directly, it takes small steps toward the minimum error.

Why

Closed form becomes impractical for large datasets or high-dimensional data.

Gradient descent works well because:

Scales to millions of data points

Works with complex models

Does not require matrix inversion

This is why deep learning and neural networks rely on gradient descent.

How

Step-by-step:

Intuition

Imagine you are standing on a mountain trying to reach the lowest valley.

Closed Form → helicopter directly drops you at the lowest point

Gradient Descent → you walk downhill step by step

▨ Cost Functions

What

Function that measures model error.

cost function is a mathematical formula that quantifies the difference between a machine learning model's predicted values and the actual target values

Loss vs. Cost: While often used interchangeably, a "loss function" typically refers to the error for a single data point, whereas a "cost function" is the average or sum of these losses over the entire dataset

MSE (Mean Squared Error):

RMSE (Root Mean Squared Error):

Why

Without a cost function:

We don’t know how wrong we are.

Cost function guides learning.

How

Training = minimize cost.

In Neural Networks:

Cross entropy

KL divergence

Same philosophy.

▨ Regularization (Ridge, Lasso, ElasticNet)

What

Add penalty to large weights.

Regularization in Machine Learning is a technique used to prevent a model from memorizing the training data too much so that it can work well on new, unseen data.

Why/The Problem: Overfitting

Sometimes a model learns the training data too perfectly. It memorizes noise and small details that don’t matter.

This problem is called Overfitting.

Example:

Imagine studying for an exam by memorizing exact questions from last year.

If the exam questions change slightly, you might fail.

A model that overfits behaves the same way.

Imagine fitting a line through data.

Without regularization:

Model tries to perfectly fit every point

With regularization:

Model tries to capture the overall trend

Regularization forces the model to prefer simpler solutions.

What Regularization Does

Regularization adds a penalty to the model for becoming too complex.

So the model is encouraged to:

keep weights small

stay simple

focus on general patterns instead of noise

Think of it like:

A teacher telling you:

Simple Analogy

Imagine drawing a curve to fit points on a graph.

Without regularization:

The curve twists and turns to pass through every point.

With regularization:

The curve stays smooth and simple, capturing the overall trend.

Types of Regularization

1. Ridge Regression (L2 Regularization)

2. Lasso Regression (L1 Regularization)

3. ElasticNet

▨ Polynomial Regression

What

Extend linear regression with higher degree terms:

y=ax2+bx+c

Polynomial regression is a supervised machine learning algorithm used to model non-linear relationships by fitting a curved line

to data. It transforms input features into polynomial terms (e.g., squaring or cubing inputs) to model complex patterns, such as growth rates or rapid, non-linear, changes in data. It is frequently used for regression tasks that are more complex than simple linear regression.

Why

Real-world relationships are nonlinear.

Example: Predicting Salary from Experience:

Sometimes the relationship is not a straight line.

With experience, salary might:

A straight line (linear regression) can’t capture this curve properly

Polynomial Regression Idea:

Instead of just using x, we also use higher powers:

This makes the model curved instead of straight.

Concrete Data Example:

👉 Notice:

Growth is not linear

It accelerates

What the Model Learns:

Instead of:

y = wx + b   (straight line ❌)

It learns something like:

y = 0.5x² + 1.2x + 2   (curve ✅)

👉 Now the curve bends upward and fits the data better.

Intuition (Important):

x → basic effect

x² → captures acceleration (curvature)

x³ → captures more complex bends

So:

👉 Polynomial regression = linear model on transformed features (x, x², x³...)

Real-world Use Cases

House pricing (area + area² effects)

Stock trends (non-linear movement)

Growth patterns (population, revenue)

Physics (motion equations)

How

Use PolynomialFeatures in sklearn.

Risk:

High degree → overfitting.

▨ Bias-Variance Tradeoff

What

Bias means the model is too simple to learn the true pattern in the data. The model makes strong assumptions.

Variance means the model is too sensitive to the training data. It learns noise instead of the real pattern.

The Bias–Variance Tradeoff describes the balance between:

Bias  → error from overly simple models (Model too simple → underfitting.)
Variance → error from overly complex models (Model too complex → overfitting.)

A good model must balance both.

Too much of either leads to poor predictions.

LLMs reduce bias by scaling model size.

Regularization reduces variance.

Why

Explains model performance behavior.

1. Bias

What

Bias means the model is too simple to learn the true pattern in the data.

The model makes strong assumptions.

Example:

Trying to fit a straight line to data that is actually curved.

Actual pattern:
      *
   *
 *
      *
   *

Model prediction (line):
---------

The model cannot capture the real pattern.

Result

High Bias → Underfitting

The model performs badly on both training and test data.

2. Variance

What

Variance means the model is too sensitive to the training data.

It learns noise instead of the real pattern.

Example:

Training points:
   *
      *
 *
        *

The model creates a very wiggly curve to match every point.

Crazy curve:
   *__
      \__
  *      \_
        *   \__

Result

High Variance → Overfitting

The model performs:

Training accuracy → very high
Test accuracy → very low

Visual Intuition

Model Complexity
        ↑

Underfitting       Good Model        Overfitting
High Bias          Balanced          High Variance

     |                  |                 |
-----|------------------|-----------------|-----

Why It's Called a Tradeoff

Reducing one usually increases the other.

Example:

So ML training tries to find the sweet spot.

How We Control Bias and Variance

Several techniques help manage this balance.

Example

Suppose you predict house prices.

Model A

Price = a + b * size

Too simple → High bias

Model B

Price = a + b1x + b2x² + b3x³ + ... + b20x²⁰

Too complex → High variance

Best Model

Price = a + b1x + b2x²

Balanced.

Mathematical View

Total prediction error can be approximated as:

Where:

Bias² → error from wrong assumptions

Variance → error from model sensitivity

Noise → unavoidable randomness in data

Why This Matters for AI / Deep Learning

Almost every technique in modern AI deals with this tradeoff.

Examples:

Even LLM training is influenced by this principle.

Real life example / Intuition

Imagine learning for an exam.

High Bias student

Studies only basic summary
→ cannot solve real questions

High Variance student

Memorizes exact past papers
→ fails when questions change

Balanced student

Understands concepts
→ performs well in new questions

▨ Feature Scaling

What

Normalize input features.

Why

Speeds up gradient descent.

Required for:

Neural networks

How

Standardization:

Mean = 0

Std = 1

▨ Train-Test Split

What

Divide dataset into:

Training

Testing

Why

To measure generalization.

Overfitting happens when:

Model memorizes training data.

How

Train-test split - he dataset is randomly divided into two subsets: a training set (typically 70-80% of the data) for the model to learn from, and a separate testing set (20-30%) to evaluate its final performance on unseen data. This method is simple, fast, and ideal for large datasets or computationally expensive models.

Train-Validation-Test split - The data is divided into three parts: a training set to fit the model, a validation set to tune hyperparameters and perform model selection, and a final, held-out test set for an unbiased evaluation of the final model's performance.

Cross Validation (k-fold cross-validation) - he dataset is divided into k equally sized "folds". The model is trained k times; each time, one fold is used as the test set and the remaining k-1 folds as the training set. The k performance results are then averaged to provide a more robust estimate of model performance.

▨ Cross Validation

What

Cross Validation is a technique used to evaluate how well a machine learning model will perform on unseen data.

Instead of splitting the dataset once into train/test, we split it multiple times and train/test the model several times.

This gives a more reliable estimate of model performance.

Why

If we use only one train-test split, the result may depend too much on which data points ended up in the test set.

Example:

Dataset = 100 samples

Train = 80
Test  = 20

Maybe those 20 test samples were very easy or very difficult, which can give a misleading accuracy.

Cross-validation solves this by testing the model on multiple different splits.

How

The most common method is K-Fold Cross Validation.

Visual Intuition

Dataset

|F1|F2|F3|F4|F5|

Test each fold once
Train on the remaining folds

Why Cross Validation is Important

It helps with:

Better model evaluation

Detecting overfitting

Choosing the best hyperparameters

Selecting the best model

Example:

Model A → 88%
Model B → 92%

Choose Model B

Where It Fits in ML

Dataset
   │
Train/Test Split
   │
Cross Validation
   │
Model Evaluation
   │
Model Selection

Example in Python (scikit-learn)

Simple One-Line Intuition

Cross Validation = testing the model multiple times on different parts of the dataset to get a reliable performance estimate.

▨ Model Evaluation Techniques

Model Evaluation
│
├── Data Splitting Methods
│   ├── Train–Test Split
│   ├── Train–Validation–Test Split
│   └── Cross Validation
│        ├── K-Fold Cross Validation
│        ├── Stratified K-Fold
│        └── Leave-One-Out CV
│
└── Evaluation Metrics
    ├── Regression Metrics
    │     ├── MSE
    │     ├── RMSE
    │     ├── MAE
    │     └── R² Score
    │
    └── Classification Metrics
          ├── Accuracy
          ├── Precision
          ├── Recall
          ├── F1 Score
          └── ROC-AUC

So evaluation has two parts.

1. Data Splitting Techniques

These determine how we test the model.

Train–Test Split

Simplest method.

Dataset
│
├── Train
└── Test

Used to check generalization.

Train–Validation–Test Split

Used when we need hyperparameter tuning.

Dataset
│
├── Train
├── Validation
└── Test

Cross Validation

Used to get more reliable performance estimates.

Example (5-fold):

|F1|F2|F3|F4|F5|

Each fold becomes test once.

2. Evaluation Metrics

These measure how good the model predictions are.

Regression Metrics

Classification Metrics

Simple Mental Model

Think of evaluation as two questions.

1️⃣ How do we test the model?
   → Train-Test Split
   → Cross Validation

2️⃣ How do we measure performance?
   → Accuracy
   → MSE
   → F1 Score

Simple Example

Suppose we build a house price prediction model.

Evaluation pipeline:

Dataset
   ↓
Train-Test Split
   ↓
Train Model
   ↓
Predict Test Data
   ↓
Compute RMSE

One-Line Summary

Model evaluation = data splitting technique + evaluation metric.

◻️ 2. Classification

Classification means:

Predicting a category or class label.

Examples:

Email → Spam / Not Spam

Image → Cat / Dog

Medical test → Disease / No Diseas

▨ 1. Linear Classifiers

A linear classifier separates classes using a straight line (or hyperplane).

Example in 2D:

Cats  ● ● ● ●
Dogs  ○ ○ ○ ○

Decision boundary:
-----------------------

In higher dimensions this becomes a hyperplane.

Key idea

y = w₁x₁ + w₂x₂ + ... + b

If the result is above a threshold → Class A

Else → Class B

i. Logistic Regression

ii. Single-Layer Perceptron

iii. Linear SVM

iv. Naive Bayes

v. SGD Classifier

▨ 2. Non-Linear Classifiers

When data cannot be separated with a straight line, we need non-linear models.

Example:

    ○ ○ ○
   ○ ● ● ○
    ○ ○ ○

A line cannot separate them.

i. Kernel SVM

ii. Decision Tree

iii. Random Forest

iv. K Nearest Neighbors (KNN)

v. Gradient Boosting

# Unsupervised Machine Learning Algos

What

Unsupervised Learning is a type of machine learning where the data does not contain labels or correct answers.

Example dataset:

There is no label like "rich" or "poor".

The algorithm must discover patterns on its own.

Why

Many real-world datasets do not have labeled data because labeling is expensive or impossible.

Examples:

Customer segmentation

Detecting fraud

Finding similar documents

Reducing data size

Detecting unusual behavior

How

Unsupervised learning works by finding structure in the data, such as:

grouping similar data → Clustering

compressing information → Dimensionality Reduction

discovering relationships → Association Rules

detecting unusual points → Anomaly Detection

So the structure becomes:

Unsupervised Learning
│
├── Clustering
├── Dimensionality Reduction
├── Association Rules
└── Anomaly Detection

Now let's go deeper.

1. Clustering

What

Clustering means grouping similar data points together.

Example:

Customer data may automatically group into:

Group 1 → Students
Group 2 → Working professionals
Group 3 → High-income customers

Even though the model was never told these labels.

Why

Used for:

Customer segmentation

Social network analysis

Document grouping

Image segmentation

Market analysis

How

Clustering algorithms measure distance or similarity between data points.

Common distance metric:

Euclidean Distance

Points closer together → same cluster.

i. K-Means Clustering

ii. Hierarchical Clustering

iii. DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

iv. Mean Shift

2. Dimensionality Reduction

What

Reduces the number of features (variables) in a dataset while preserving important information.

Example:

Original features:
Height
Weight
Age
Income
Education
Country
Purchasing behavior

Reduce to:

2 or 3 dimensions

Why

Reasons:

High dimensional data is hard to visualize

Training becomes slow

Some features are redundant

Avoid curse of dimensionality

How

Algorithms transform data into lower dimensional space.

PCA (Principal Component Analysis)

What

Transforms data into new axes called principal components.

These components capture maximum variance in the data.

Why

Used for:

noise reduction

data compression

visualization

preprocessing before ML

How

Steps:

Standardize data

Compute covariance matrix

Compute eigenvectors

Choose top components

Project data onto them

Example:

10 features → reduce to 2

t-SNE (t-distributed Stochastic Neighbor Embedding)

What

A dimensionality reduction algorithm designed for visualizing high-dimensional data.

Often used in 2D or 3D visualization.

Why

Preserves local relationships between points.

Example:

Used in visualizing:

word embeddings

image embeddings

neural network features

How

Instead of preserving global distances, it tries to keep similar points close together.

Commonly used for visualizing deep learning embeddings.

UMAP (Uniform Manifold Approximation and Projection)

What

A modern dimensionality reduction technique.

Similar to t-SNE but faster and scalable.

Why

Advantages over t-SNE:

faster

preserves more global structure

works well on large datasets

Used heavily in:

bioinformatics

embedding visualization

deep learning pipelines

How

UMAP builds a graph of data points and then finds a lower dimensional representation preserving structure.

3. Association Rules

What

Finds relationships between items in datasets.

Classic example:

Market Basket Analysis

Example rule:

If person buys bread → they also buy butter

Why

Used for:

recommendation systems

product placement

shopping analysis

Example:

Amazon: "Customers also bought..."

How

Uses metrics:

Support
Confidence
Lift

Example rule:

Bread → Butter

Apriori Algorithm

What

Finds frequent itemsets in transactional data.

Why

Used in market basket analysis.

Example:

Transactions:

Milk, Bread
Milk, Butter
Bread, Butter
Milk, Bread, Butter

Apriori finds frequent combinations.

How

Key idea:

If a set is frequent
→ all subsets must also be frequent

Steps:

Find frequent items

Build larger itemsets

Generate association rules

Eclat Algorithm

What

Another method for finding frequent itemsets.

Why

Faster than Apriori for large datasets.

How

Instead of scanning database repeatedly:

Eclat uses vertical data format

Example:

Milk → T1, T2, T4
Bread → T1, T3, T4

Then intersections are used to compute frequency.

4. Anomaly Detection

What

Detects unusual or rare data points.

Example:

Normal credit card transactions → $20, $50, $30
Anomaly → $10,000 purchase in another country

Why

Used in:

fraud detection

cybersecurity

medical diagnosis

manufacturing defects

How

The model learns normal behavior, then detects points that deviate.

Isolation Forest

What

A tree-based algorithm for detecting anomalies.

Why

Efficient for large datasets.

How

Key idea:

Anomalies are easier to isolate

Example:

normal points → need many splits
anomalies → isolated quickly

Shorter path length in tree → anomaly.

One-Class SVM

What

A variation of Support Vector Machine used for anomaly detection.

Why

Used when only normal data is available.

Example:

Train on normal behavior, detect abnormal points.

How

The algorithm finds a boundary around normal data.

Inside boundary → normal
Outside boundary → anomaly

Simple Big Picture

Unsupervised Learning
│
├── Clustering
│   ├── K-Means
│   ├── Hierarchical
│   ├── DBSCAN
│   └── Mean Shift
│
├── Dimensionality Reduction
│   ├── PCA
│   ├── t-SNE
│   └── UMAP
│
├── Association Rules
│   ├── Apriori
│   └── Eclat
│
└── Anomaly Detection
    ├── Isolation Forest
    └── One-Class SVM

The END… till Happy learning>>