danielljeon.github.io

Machine Learning


Table of Contents

1 Intro and Data


2 Optimization

2.1 The Cost Function

The cost function, also known as the loss function (often written as the function $J$) measures how well a machine learning model's predictions match the actual target values. It quantifies the error between the expected and predicted outputs, providing the feedback signal that guides learning.

In essence, the objective is to minimize the cost function, allowing the model to achieve the most accurate and generalizable predictions possible.

2.2 Local Optimization

Finds the nearby most optimal solution (but might not be the global optimal solution).

2.2.1 Gradient Descent

Gradient descent is an iterative optimization algorithm that minimizes an cost function by moving the current guess down the slope (opposite to the gradient), taking small incremental steps.

$$a_{n+1} = a_n - \eta_n \nabla J(a_n)$$

where:

Gradient descent can stop at the following general conditions:

  1. The change in the cost function is smaller than a chosen threshold.
  2. The magnitude of the gradient becomes very small (derivative/slope is near zero).
  3. The maximum number of iterations is reached.

Heuristic optimization methods often incorporate randomization or other exploratory strategies to reduce the chance of getting trapped in local minima and to improve the likelihood of finding a global minimum.

YouTube, 3Blue1Brown: Gradient descent, how neural networks learn | Deep Learning Chapter 2

2.3 Global Optimization

Finds the absolute best solution across the entire problem space, not just the nearest one.

2.3.1 Simulated Annealing (SA)

Background: The term "annealing" comes from metallurgy: the process of heating a metal and then cooling it slowly so that its atoms settle into a low-energy (stable) crystalline structure.

  • Cool too quickly = atoms "freeze" in random positions maintaining defects.
  • Cool slowly = atoms can rearrange to a lower total energy (optimal structure).

The SA algorithm searches for the global minimum of an objective function $J(x)$ by:

  1. Start at a random point $x_0$.
  2. Make small random moves.
  3. Accept not only the model improvements, but sometimes even worse moves.
  4. Gradually reduce the probability of accepting worse moves over time.

The change in the cost function is given by:

$$\Delta E = J(x') - J(x)$$

The acceptance likelihood for the new solution is given by the following equation:

$$P = e^{-\Delta E/T}$$

where:

2.4 Regularization

Regularization adds a penalty term to the cost function to discourage overly complex models (large weights). This prevents overfitting and encourages generalization.

2.4.1 L1 Lasso Regression

L1 regularization (Lasso) adds a penalty equal to the sum of the absolute values of the model weights. This encourages sparsity by pushing some coefficients exactly to zero, effectively performing feature selection and simplifying the model.

L1 penalty term: $$R_{L1}(\omega) = \sum_{j=1}^{M} |\omega_j| = ||\omega||_1$$

L1 regularized cost function (component form): $$J_{L1}(\omega) = \frac{1}{2N} \sum_{i=1}^{N} (y_i - \hat y_i)^2 + \lambda \sum_{j=1}^{M} |\omega_j|$$

L1 regularized cost function (vector form): $$J_{L1}(\omega) = \frac{1}{2N} ||y - X\omega||_2^2 + \lambda ||\omega||_1$$

2.4.2 L2 Ridge Regression

L2 regularization (Ridge) adds a penalty equal to the sum of the squared weights, shrinking all coefficients smoothly toward zero without eliminating them. This prevents large weights, stabilizes learning, and reduces overfitting while maintaining all features' contributions.

L2 penalty term: $$R_{L2}(\omega) = \sum_{j=1}^{M} \omega_j^2 = ||\omega||_2^2$$

L2 regularized cost function (component form): $$J_{L2}(\omega) = \frac{1}{2N} \sum_{i=1}^{N} (y_i - \hat y_i)^2 + \lambda \sum_{j=1}^{M} \omega_j^2$$

L2 regularized cost function (vector form): $$J_{L2}(\omega) = \frac{1}{2N} ||y - X\omega||_2^2 + \lambda ||\omega||_2^2$$


3 Supervised Prediction Models

3.1 Overview

3.1.1 Overfitting and Overtraining

Overfitting happens when a prediction model captures noise or random fluctuations in the training data instead of the underlying patterns that generalize to unseen data. Or in other words, the model memorizes the training data instead of learning general rules.

3.1.2 Bias

Bias is the error due to overly simplistic assumptions about the data.

3.1.3 Variance

Variance measures how much the model's predictions change if trained it on a different dataset.

3.2 Linear Regression: Continuous Outcome Prediction

Linear regression models the relationship between input features and a continuous output by fitting a straight line (or hyperplane) through the data. The model learns weights that minimize the squared error between predictions and actual values, either by solving a closed-form equation (normal equation) or using gradient descent to iteratively adjust parameters.

Linear:

$$h(x, \omega) = \omega_0 + \omega_1 x + \omega_2 x^2 + \cdots + \omega_M x^M = \omega_0 + \sum_{j=0}^{M} \omega_j x^j$$

Higher order polynomial:

$$h \left( x_i, \omega \right) = \sum_{j=0}^M \omega_j x_i^j$$

where:

3.2.1 Gradient Descent Approach: Mean Squared Error (MSE)

The objective function is the mathematical expression that should be minimized in order to find the best-fitting line (or hyperplane), in linear regression this is known as the Mean Squared Error (MSE):

$$J(\omega) = \frac{1}{2 N} \sum_{i=1}^{N} \left( h(x_i, \omega) - t_i \right)^2$$

3.2.2 Normal Equation Approach

The normal equation is a direct algebraic solution for the model parameters (weights) without needing iterative updates like gradient descent.

$$\omega = (X^T X)^{-1} X^T \hat{y}$$

3.3 Logistic Regression: Probability Based Classification

Logistic regression predicts the probability of a binary outcome by applying the sigmoid function to a linear combination of input features.

Instead of minimizing squared error (as in linear regression), logistic regression uses maximum likelihood estimation (MLE) to find the model parameters that make the observed labels most probable under the model. In other words, it adjusts the weights to maximize how well the predicted probabilities align with the actual outcomes.

Although logistic regression produces nonlinear (S-shaped) sigmoid outputs, it remains a linear model at its core, the decision boundary is defined by a linear equation of the input features. The sigmoid function simply transforms this linear relationship into a continuous probability between 0 and 1, enabling the model to perform smooth, probabilistic classification.

3.3.1 Sigmoid

The sigmoid is a mathematical function that maps any real number into the range (0, 1), making it useful for modeling probabilities. In logistic regression, the sigmoid function transforms a linear combination of inputs into a probability, allowing for prediction of binary outcomes.

$$\sigma \left( z \right) = \frac{1}{1 + e^{-z}}$$

The sigmoid can be rewritten to represent the probability of getting an $\mathrm{output} = 1$:

$$p(C = 1 \mid x) = \frac{1}{1 + e^{-(\omega_0 + \omega^T x)}}$$

where (similar to before):

3.3.2 Decision Boundary

The decision boundary is said to occur when the sigmoid = 0.5:

$$\sigma \left( z \right) = 0.5$$ $$p \left( C = 0 \mid x \right) = p \left( C = 1 \mid x \right) = 0.5$$

which happens when:

$$\omega_0 + \omega^T x = 0$$

3.3.3 Maximum Likelihood

Maximum Likelihood Estimation (MLE) chooses the parameter values that make an observed data set most likely under the model.

$$\max_{\omega} L(\omega) = \max_{\omega} \prod_{i=1}^N p(t=1 \mid x_i; \omega)^{t_i}$$ $$\Big(1 - p(t=1 \mid x_i; \omega)\Big)^{1 - t_i}$$

where:

3.3.3.1 Maximum Log-Likelihood

In order to make finding the maximum likelihood easier, the log can be taken, converting the product over $N$ samples to a sum. This allows the maximum log-likelihood to be used as the objective function with gradient accent/decent.

$$\ell_{log}(\omega) = \sum_{i=1}^N \Big[t_i \log p(t=1 \mid x_i; \omega) +(1 - t_i) \log \big(1 - p(t=1 \mid x_i; \omega)\big)\Big]$$

3.3.3.2 Binary Cross Entropy (BCE)

Real life optimizers that minimise the cost function typically use the inverse of the Maximum Log-Likelihood, known as the Binary Cross Entropy (BCE):

$$J(\omega) = - \frac{1}{N} \ell_{log}(\omega)$$

3.4 Decision Trees: Rules-based Hierarchical Splits

In a decision tree, each node represents a subset of the training data.

At the root node, all data samples are present. As the tree grows and splits on feature thresholds, the dataset is divided into smaller subsets that pass down to the child nodes.

Thus, the number of data points contained in each node decreases with depth, as each branching point partitions the data further.

The impurity of a node measures how mixed or homogeneous the data is with respect to the target variable (class labels).

The goal of a decision tree algorithm is to select splits that reduce impurity the most, producing child nodes that are more homogeneous than their parent. This reduction in impurity is quantified using measures such as Information Gain (based on entropy) or Gini Reduction (based on the Gini Index).

Basic Process:

  1. Determine the best feature and split threshold using a greedy search algorithm.
    • For each feature, trial all possible splits (thresholds or category partitions).
    • Evaluate each candidate split using an impurity measure:
      • Information Gain with entropy
      • Information Gain Gini Index and Weighted Sum
    • Select the split that maximizes Information Gain (reduces impurity the most).
  2. Split the dataset into subsets according to the selected feature's values.
  3. Evaluate purity of resulting nodes (check if all samples belong to one class).
  4. Repeat recursively for each subset until:
    1. Nodes are pure meaning zero impurity/uncertainty, training data memorization (bad).
    2. Stopping condition is met (e.g., max depth, min samples per leaf)

3.4.1 Information Gain and Impurity

Information Gain (IG) is a measure used in decision trees to decide which attribute to split on at each step. It quantifies how much "impurity" in the dataset can be reduced after splitting the data based on a given attribute.

$$IG = I_{parent} - \sum_{i \in V(A)} \frac{N_i}{N} I_i$$

where:

3.4.2 Entropy

The entropy or uncertainty is calculated as follows:

$$H(S) = -,p(+) \log_{2} p(+) - p(-) \log_{2} p(-)$$

where:

3.4.3 Gini Index and Weighted Sum

The Gini Index is another common impurity measure used in decision trees. It quantifies how often a randomly chosen sample from the dataset would be incorrectly labeled if it were randomly assigned a label according to the distribution of labels in that node.

$$Gini = 1 - \sum_{i=1}^{n} p_i^2$$

where:

3.4.4 Overfitting and Pruning

By nature, decision trees tend to be low bias and overfit, meaning they often capture very specific noise or learn the training data outright. When a decision tree overfits, pruning is used to simplify it.

The idea is to replace certain unnecessary subtrees with leaf nodes that predict the majority class. This is tested on a validation set (separate from the training and test sets). If accuracy improves or stays the same, the prune is kept; if not, it's reverted. This prevents the model from memorizing noise and helps it generalize better.

This greatly slightly increases bias while greatly reducing variance.

3.4.5 Decision Regression Trees

A Regression Tree is a type of decision tree designed for predicting continuous target values. Unlike classical decision trees, which are generally used for categorical outcomes, regression trees recursively split the data into regions (nodes) that minimize prediction error-typically measured by variance within each subset. Here, variance serves as the analogue of "impurity" in classification trees.

The objective is to create leaf nodes that represent smaller, more homogeneous ranges of the overall continuous target space, ensuring that each leaf provides a good local approximation of the target value.

Decision tree splits occur in the input (x) feature space, but the choice of the best split is always determined by how much it improves purity or reduces variance in the output (y) space.

When regression trees talk about variance, they're not referring to the "spread of the input feature values (x)" - they're referring to the spread of the target values (y or t) within that node around their mean prediction.

Process:

  1. Select an attribute (feature) to test.

    • Eventually, all attributes may be considered as potential split candidates.
    • The selection uses a greedy search for the split that minimizes prediction error.
  2. Identify possible split points for each continuous attribute: $$n_{\mathrm{samples}} - 1$$

    • Typically, the midpoints between sorted values.
  3. Compute the mean target value for each subset formed by a split: $$\hat y_{\tau} = \frac{1}{N_{\tau}} \sum_{x_i \in Y_{\tau}}^{N} t_i$$

    where:

    • $\hat{y}_{\tau}$: The predicted output for subset (leaf) $\tau$.
    • $N_{\tau}$: The number of samples in subset $Y_{\tau}$.
    • $t_i$: The target value (label/output) for sample point $x_i$.
    • $x_i$: A single data sample/observation.
    • $Y_{\tau}$: The subset of data samples belonging to region/leaf $\tau$.
    • $N$: The total number of samples in the dataset.
  4. Calculate the error (Sum of Squared Residuals) for each subset:

    $$E_{\tau} = \sum_{i=1}^{N} \left( y_i - \hat{y}_{\tau} \right)^2$$

    where:

    • $E_{\tau}$: The total squared error for subset $\tau$.
    • $y_i$: The observed/true value of the i-th sample.
    • $i$: The index running over the samples in the dataset (from 1 to N).

    The total error for each split is given by:

    $$E = \sum_{\tau=1}^{T} E_{\tau}$$

    where:

    • $E$: The total error across all subsets.
    • $\tau$: The index of the subset (e.g., a region, cluster, or leaf in a tree).
  5. Select the split that yields the lowest total error.

  6. Repeat recursively for each new subset until a stopping criterion is met.

3.4.6 Random Forest: Bagging

A Random Forest is an ensemble of decision trees trained on random subsets of the data and/or features. Each tree acts as an independent predictor, and the final prediction is the average (for regression) or majority vote (for classification) of all trees.

$$\hat{h}{bag}(x) = \frac{1}{N{trees}} \sum^{N_{trees}}_{i=1} h_i(x)$$

Process:

  1. Divide the training data set into N random subsets.
  2. Build $N_{stump}$ trees to generate multiple hypothesis $h_i(x)$.
    • Consider only a random subset of features at each split throughout the forest.
  3. Use the forest to predict the target variable of test instances as the average of all trees in the forest.

3.4.7 Adaboost: Adaptive Boosting on Trees

  1. Assign equal weights to the dataset for all $i$. $$\omega_i = \frac{1}{N}$$

  2. Train a weak learner (e.g., a decision stump).

  3. Calculate the learner's weighted error: $$\text{Error} = \frac{\sum_i \omega_i I \left( y_i \ne h_i(x_i)\right)}{\sum_i \omega_i}$$

    where:

    • $I$: The iverson bracket (also known as the indicator function).
  4. Calculate the importance factor $\alpha$, for each data point factor. $$\alpha = \frac{1}{2} \ln \left( \frac{1 - \text{Error}}{\text{Error}} \right)$$

    where:

    • $\alpha$: The weight assigned to a weak learner (in boosting).
    • $\text{Error}$: The weighted error rate of the weak learner.
      • For example: the sum of misclassified sample weights divided by the total.
  5. Reweight the data points with the calculated importance factors.

    • Correct sample: $\omega_{i} \leftarrow \omega_{i} e^{-\alpha}$.
    • Incorrect sample: $\omega_{i} \leftarrow \omega_{i} e^{\alpha}$.

    where:

    • $\omega_i$: The weight (importance factor) of the $i$-th training sample
  6. Normalize the new weights. $$\omega_{i} \leftarrow \frac{\omega_{i}}{\sum_{i=1}^{N} \omega_{i}}$$

    where:

    • $\sum_{i=1}^{N} \omega_i$: Is the normalization constant, it ensures all weights sum to 1.
  7. Repeat the process for several rounds to build multiple weak learners.

  8. Final prediction:

$$h(x) = \text{sign} \left( \sum^{N_{stumps}}_{i=1} \alpha_i h_i (x) \right)$$

where:

3.4.7.1 Iverson Bracket and Signum Operator

$$ I(\text{condition}) = \begin{cases} 1, & \text{if condition is true, misclassified} \ 0, & \text{if condition is false, classified correctly} \end{cases} $$

$$ \text{sign}(x) = \begin{cases} +1, & \text{if } x > 0 \ 0, & \text{if } x = 0 \ -1, & \text{if } x < 0 \end{cases} $$

3.5 Support Vector Machines (SVM)

SVM tries to maximize the margin between two classes: the margin is the distance between the decision boundary and the closest data points from each class (called support vectors).

Suppose you have two features, SVM finds a line that separates the classes:

$$\omega^T x + b = 0$$

where:

SVM solves an optimization problem to maximize the margin, which is equivalent to minimizing $|| \omega ||^2$, subject to all points being correctly classified (or mostly correct if soft-margin). In other words, find the flattest separating hyperplane (smallest $\omega$) that correctly divides the data, to maximize the margin between classes.

The objective function is given by:

$$J(\omega, b) = \frac{1}{2} || \omega ||^2$$

where:

To optimize the SVM cost function, the lagrange duality can be used, giving a closed-form relationship at optimality:

$$w = \sum^N_{i=1} \alpha_i y_i x_i$$

where:

The classification decision is made by the $\text{sign}$ of $\omega^T x + b$:

$$ \hat{y} = \begin{cases} +1, & \text{if } w^T x + b \geq 0 \ -1, & \text{if } w^T x + b < 0 \end{cases} $$

3.5.1 Kernal

A kernel allows SVMs to draw curved, non-linear decision boundaries in the original feature space by performing a linear separation in a higher-dimensional (often implicit) feature space. This process is done efficiently using the kernel trick, which avoids explicitly computing the transformation into that higher-dimensional space.

The kernal function $K(x_i, x_j)$ replaces the dot product between data points in the Lagrange dual formulation, resulting in the decision function:

$$f(x) = \text{sign}\left( \sum_i \alpha_i y_i K(x_i, x) + b \right)$$

3.5.2 Slack Variables: Regularization Effect

The slack variable $\xi_i$ allows some points to violate the margin constraint, it measures how much each point breaks the rule. The SVM then balances two competing goals:

  1. Keep the margin as wide as possible.
  2. Keep total violations (sum of $\xi_i$) as small as possible.

$$\xi_i = \max\left(0,1 - y_i(\omega^T x_i + b)\right)$$

$\xi_i$ Meaning Classification
$\xi_i = 0$ Perfectly outside margin Correct
$0 < \xi_i < 1$ Inside margin but on correct side Correct but close
$1 < \xi_i$ Crossed margin/on wrong side Misclassified

Thus, the soft-margin SVM optimization objective is expressed as:

$$J(\omega, b, \xi) = \frac{1}{2}||\omega||^2 + C \sum_i \xi_i$$

3.6 K-Nearest Neighbour (KNN): Hard Classification and Regression

K-Nearest Neighbours (KNN) is a supervised learning algorithm used for classification or regression. However, unlike other prediction models, KNN does not train a model, it just lazily stores the training data and uses it directly to make predictions.

  1. Store the training data.
  2. Select a number of neighbours to consider $k$.
  3. When a new data point comes in, measure its distance to all training points.
  4. Make a prediction:
    1. Classification: Vote within the labels of the $k$ nearest neighbours of the new data point, and whichever label appears most often "wins."
    2. Regression: Compute the average (or sometimes weighted average) of the values from the $k$ nearest neighbours.

3.6.1 Distance Measurements

Euclidian Distance: essentially by using a summed pythagorean theorem the distance between points can be found.

$$D(x, y) = \sqrt{\sum_{i=1}^{n} (x_i - y_i)^2}$$

where:

Hamming Distance: Finding the difference between binary numbers, for example:

D(0b100011, 0b110110)
= ones_count(0b100011 XOR 0b110110)
= ones_count(0b010101)
= 3

Cosine Distance: Calculating the cos of the angle between points to measure the distance.

$$D(x, y) = 1 - \frac{x \cdot y}{||x|| ||y||}$$

where:

3.6.2 Weighted K-Nearest Neighbour

In Weighted KNN, not all neighbours are treated equally. Closer neighbours are assumed to be more relevant to the prediction than farther ones. So, each neighbour is assigned a weight based on its distance to the query point.

Various weighting functions exist, for example the inverse distance weighting is given by:

$$w_i = \frac{1}{D(x_i, y_i)}$$


4 Unsupervised Prediction Models

4.1 K-Means Clustering: Hard Clustering

Unlike KNN, K-Means Clustering is an unsupervised prediction method, working to find natural groups in unlabeled data.

Decision boundaries are made separating points (or feature values) into zones for prediction.

By calculating the distance between each feature and cutting the distance line equidistant, perpendicular splits are created known as a Voronoi Diagram.

Process:

  1. Select a number of clusters or groups $k$.
  2. Plot the cluster centroids randomly in the problem space.
  3. Calculate the distance between each feature data point to each cluster centroid.
    • See KNN distance measurement notes.
  4. Assign each feature data point to its respective shortest distance cluster.
  5. Update the cluster centroids to the mean (average) of all the points assigned to that cluster.

Thus, the cost function defined by:

$$J ( C_i, \mu_i ) = \sum_{i=1}^{k} \sum_{x_j \in C_i} || x_j - \mu_i ||^2$$

where:

4.1.1 Coordinate Descent: Lloyd's Algorithm

The cost function for K-Means Clustering is optimized for by coordinate descent. This effectively alternates between optimizing 2 variables:

  1. Cluster assignments $C_i$ which cluster each point belongs to.
  2. Centroids $\mu_i$ the cluster centers.

To do this Lloyd's algorithm is used:

  1. Assignment step: $$C_i^{(t+1)} = \arg\min_i |x_j - \mu_i^{(t)}|^2$$
  2. Update step: $$\mu_i^{(t+1)} = \frac{1}{|C_i^{(t+1)}|} \sum_{x_j \in C_i^{(t+1)}} x_j$$

Each iteration strictly reduces $J$ (or leaves it unchanged), so the algorithm converges to a local minimum. Unlike gradient descent, small steps were modulated by the learning rate. In K-Means, infinitely long gradient steps are taken in one variable by directly solving $\frac{\partial J}{\partial \mu_i} = 0$, then locked.

4.1.2 Standardization and Normalization

In cases for datasets or applications with various features with large differences in units of measure, the data can be preprocessed for a better distance comparison.

Values can be normalized to be in the range 0 to 1.

Values can also be standardized by scaling each feature's values ensuring a mean of $\mu = 0$ and a variance of $\sigma^2 = 1$.

$$\sigma^2 = \frac{\sum (x_i - \mu)^2}{N} = 1$$

4.2 Gaussian Mixture Models (GMM): Soft Classification

A Gaussian Mixture Model (GMM) assumes data points come from a mix of several Gaussian (bell-shaped) distributions, each with its own mean and spread. Instead of hard-assigning each point to a single cluster like k-means, GMM gives probabilities (soft assignments) of belonging to each cluster. This makes GMM more flexible, since it can model elliptical clusters of different sizes and orientations, while k-means only captures spherical clusters of equal variance. In short, GMM is a probabilistic, more general version of clustering compared to k-means.

$$p(X \mid \mu, \Sigma) = \frac{1}{\sqrt{(2 \pi)^d \lvert \Sigma \rvert}} , e^{-\tfrac{1}{2} (X - \mu)^{T} \Sigma^{-1} (X - \mu)}$$

where: