Optimization Algorithms in Machine Learning

Muhammed ÇELİK
11 min readOct 28, 2023

Once upon a time in the world of computing, there was a quest to teach machines to learn and make smart decisions. This journey was not just about giving machines knowledge but also the wisdom to refine that knowledge.

Imagine a group of dedicated explorers known as “Data Scientists.” They embarked on this quest armed with a powerful tool: Optimization Algorithms. These algorithms were like magic spells that fine-tuned the abilities of machines to predict, classify, and uncover hidden patterns.

In their bag of tricks, they had the mighty “Gradient Descent,” a wand that optimized models to minimize errors. They also had “K-Means Clustering,” a compass that guided them through data’s uncharted territories.

But this adventure wasn’t just about technology; it was also a journey of understanding, like discovering the secret language of “t-SNE” to visualize data’s stories. They learned to solve complex riddles using “Q-Learning” and “Actor-Critic” methods, teaching machines how to make the best decisions.

Supervised Machine Learning

Supervised machine learning is all about teaching models to make informed predictions based on labeled data. Optimization algorithms are the powerhouse behind fine-tuning model parameters to minimize errors. Here’s a glance at some common optimization algorithms in this domain:

01. Closed-Form Solutions (Normal Equation Analytical Solution)

  • Some linear models, like linear regression, have closed-form solutions for optimal parameter estimation.
  • A closed-form solution, in the context of mathematics and optimization, refers to a solution that can be expressed using a finite number of mathematical operations. It is a solution that can be computed directly, often in a single step or formula, without the need for iterative methods or approximation algorithms. They’re computationally efficient but may not be applicable to complex models.

Formula:

w = ( np.linalg.inv(X.T @ X) @ X.T ) @ y

Where:

  • w represents the optimal coefficients (weights).
  • X is the feature matrix.
  • y is the target vector.
  • X.T represents the transpose of the feature matrix X.
  • np.linalg.inv(X.T @ X) represents the matrix inverse of the product of the transpose of X and X.

This equation allows you to find the optimal coefficients for a linear regression model without the need for iterative optimization algorithms like gradient descent. Instead, you can compute w directly using this closed-form solution, provided that the matrix (X^T X) is invertible (non-singular). This is a powerful and exact way to find the best-fit linear model when it’s applicable.

Real-world example: Training a linear regression model to predict house prices based on features like square footage, number of bedrooms, and location.

02. Iterative Optimization Methods (Gradient Descent)

  • Batch Gradient Descent: A method that employs the entire training dataset for updating model parameters. It aims to find the global minimum but can be slow on large datasets.
  • Stochastic Gradient Descent (SGD): Parameters are updated using a single random training example, suitable for large datasets. It converges faster but exhibits more oscillations in the cost function.
  • Mini-Batch Gradient Descent: A balanced approach using random data subsets for optimization. It combines the strengths of both batch and stochastic gradient descent.

Formula:

Weight (W) = Weight (W) — Learning Rate (α) * Gradient of the Loss Function (∇J(W))

Parameters:

  • Weight (W): Model parameters to be updated during optimization.
  • Learning Rate (α): A small value controlling the step size in weight updates.
  • Gradient of the Loss Function (∇J(W)): This points in the direction of steepest descent in the loss function, guiding parameter updates.

03. Regularized Models

Ridge and Lasso Regression:

  • These employ optimization techniques tailored to the L2 and L1 regularization terms, respectively. They’re effective in combating overfitting in linear models.
  • Ridge may use a closed-form solution but can also use gradient descent for certain cases. Gradient descent can be used as an alternative when the closed-form solution is not suitable for large datasets.

Explanation Formula for Ridge with closed-form solution:

(np.linalg.inv(X.T @ X + (alpha * Identity)) @ X.T) @ y

  1. X.T: Transpose of the feature matrix X. This operation flips the rows and columns of X, making it suitable for matrix multiplication.
  2. X.T @ X: The dot product (matrix multiplication) of the transposed feature matrix X and the original feature matrix X. This results in a square matrix representing the sum of products of corresponding elements of X.
  3. alpha * Identity: The regularization term, where alpha is the regularization parameter, and Identity is the identity matrix. This term is added to the result of X.T @ X to apply L2 (ridge) regularization. The identity matrix has diagonal elements set to 1 and off-diagonal elements set to 0.
  4. X.T @ X + (alpha * Identity): This is the sum of the matrix X.T @ X and the regularization term. It combines the data term (sum of products of elements) and the penalty term (encouraging smaller coefficients).
  5. np.linalg.inv(...): The np.linalg.inv function computes the matrix inverse of the matrix inside the parentheses. In this context, it calculates the inverse of the combined matrix that includes the data term and the regularization term.
  6. (np.linalg.inv(X.T @ X + (alpha * Identity)) @ X.T): After computing the inverse, this part of the formula performs two matrix multiplications. It first multiplies the inverse matrix by the transposed feature matrix X. This is part of the linear regression equation and calculates the model parameters.
  7. (np.linalg.inv(X.T @ X + (alpha * Identity)) @ X.T) @ y: Finally, this step involves multiplying the result from the previous step by the target vector y. This provides the model's predictions or estimates of the target values based on the given features.

In summary, the formula (np.linalg.inv(X.T @ X + (alpha * Identity)) @ X.T) @ y is the closed-form solution for linear regression with L2 (ridge) regularization. It combines the data term, the regularization term, and the linear regression equation to calculate the model parameters w that minimize the loss function. The regularization term encourages smaller coefficients in the model to prevent overfitting.

Elastic Net:

  • Combining L1 and L2 regularization, it provides a flexible approach to regularization.
  • Lasso and Elastic Net do not have closed-form solutions and typically use gradient-based optimization methods, such as coordinate descent (a specialized form of gradient descent) or gradient descent.
  • J(W) = L(W) — Regularization Term ( a * ||W||₁+ 0.5 * b * ||W||₂² )

Real-world example: Applying Lasso regression for feature selection in a machine learning model for medical diagnosis.

Regularized linear models can be specified using the solver parameter:

  • 'auto': chooses the solver automatically based on the type of data.
  • 'svd': singular value decomposition, a closed-form solution for Ridge.
  • 'cholesky': uses the standard scipy.linalg.solve function to obtain a closed-form solution.
  • 'lsqr': uses the dedicated regularized least-squares routine scipy.sparse.linalg.lsqr.
  • 'sag' uses a Stochastic Average Gradient descent, and ‘saga’ uses its improved, unbiased version named SAGA. Both methods also use an iterative procedure, and are often faster than other solvers when both n_samples and n_features are large. Note that ‘sag’ and ‘saga’ fast convergence is only guaranteed on features with approximately the same scale.
  • 'saga' (an improved coordinate descent for Lasso and Elastic Net).

Explanation Formula for Lasso, Elastic Net
and Logistic Regression with Gradient Descent:

Objective Function J(W) = Loss (W) — Regularization Term

Regularization = ( Regularization Term₁ (L1) + Regularization Term₂ (L2) )

Regularization = ( a * L₁ Norm of Weight (W) + 0.5 * b * L₂ Norm of Weight (W)² )

W = W — α * ∇J(W) — ( a * ||W||₁ ​− 0.5 * b * ||W||​₂² )

Weight (W) = Weight (W) — Learning Rate (alpha) * Gradient of the Loss Function (∇J(W)) — ( alpha * l1_ratio * ||Weight (W)||₁ + 0.5 * alpha * (1 — l1_ratio) * ||Weight (W)||₂² )

Parameters:

  • Weight (W):
    The model parameters to be updated during optimization. These parameters are what the algorithm is learning to fit the data.
  • Learning Rate (α):
    A small value that controls the step size in weight updates. It determines how much the model’s weights are adjusted during each iteration.
  • Gradient of the Loss Function (∇J(W)):
    This is a vector pointing in the direction of the steepest descent in the loss function, guiding the updates to the model’s weights.
  • L1 regularization strength (λ) = a = alpha * l1_ratio:
    A parameter that controls the strength of the L1 (Lasso) regularization term. L1 regularization encourages sparsity in the model by adding a penalty for absolute values of weights. Alpha corresponds to 1 / (2C) in other linear models such as LogisticRegression or LinearSVC.
  • L₁ Norm of Weight (W) = ||W||₁
    represents the L1 Norm (Manhattan Norm) of the weight vector W.
  • L2 regularization strength (λ) = b = alpha * (1 — l1_ratio):
    A parameter that controls the strength of the L2 (Ridge) regularization term. L2 regularization discourages large weights by adding a penalty for their squared values. Alpha corresponds to 1 / (2C) in other linear models such as LogisticRegression or LinearSVC.
  • L₂ Norm of Weight (W)² = ||W||₂²
    represents the L2 Norm (Euclidean Norm) of the weight vector W squared.

04. Logistic Regression

This is widely used for classification tasks and employs optimization methods such as gradient descent or quasi-Newton methods. Logistic regression is fundamental for binary and multiclass classification.

Real-world example: Using logistic regression to classify emails as spam or not spam based on text features.

05. Support Vector Machines (SVM)

To find the maximum-margin hyperplane, SVMs require specialized optimization algorithms like Sequential Minimal Optimization (SMO). They excel in classification and regression tasks.

Real-world example: Using an SVM for image classification, distinguishing between cats and dogs based on image features.

06. Neural Networks

Deep learning models depend heavily on optimization methods such as Stochastic Gradient Descent (SGD), Adam, RMSprop, and more. Deep learning has revolutionized various areas of machine learning, including image recognition, natural language processing, and reinforcement learning.

Real-world example: Training a deep neural network for natural language processing to perform sentiment analysis on customer reviews.

07. Decision Trees and Random Forests

Optimization focuses on finding the best split points and trees that minimize impurity. Random forests combine multiple decision trees and use ensemble techniques that combine the predictions of multiple individual models (often called “base models” or “weak learners”) to make a more accurate and robust prediction to improve accuracy and reduce overfitting.

Real-world example: Using a random forest model to predict customer churn in a subscription-based service.

08. Gradient Boosting

Algorithms like XGBoost, LightGBM, and CatBoost use gradient-based optimization to enhance ensemble models (final combined models). They excel in both classification and regression problems.

Real-world example: Using XGBoost to predict stock price movements based on historical market data.

09. Linear Discriminant Analysis (LDA)

  • LDA’s primary goal in the context of classification is to find a linear combination of features that best separates two or more classes. It does this by maximizing the differences between class means while minimizing the variations within each class.
  • LDA seeks to optimize the linear coefficients that define this combination of features to achieve the objectives mentioned above. The optimization problem in LDA typically involves finding the coefficients that maximize a certain objective function, such as the ratio of between-class variance to within-class variance (Fisher’s discriminant criterion). This is usually done using techniques like eigenvalue decomposition or singular value decomposition, which help determine the optimal linear coefficients. These coefficients are used to project the data into a lower-dimensional space, making LDA a powerful method for feature extraction and classification.

10. Generalized Linear Models (GLM)

  • GLM are a class of supervised learning models, commonly used for regression and classification tasks. GLMs are an extension of traditional linear regression, but they allow for a broader range of target variable types, not limited to continuous values. They can handle various types of response variables, including continuous, binary, count, and more.
  • The optimization method used for fitting GLMs typically involves maximizing the likelihood function. The specific optimization algorithm employed may vary depending on the software or library used for modeling. Common optimization techniques for GLMs include iterative algorithms like Newton-Raphson, Fisher scoring, and gradient descent methods, depending on the specific problem and software implementation.

Unsupervised Machine Learning

Unsupervised machine learning explores data without explicit labels or targets, aiming to uncover hidden patterns, clusters, or representations. Optimization comes into play for tasks like clustering, dimensionality reduction, and density estimation. Here are some common optimization algorithms:

1. K-Means Clustering

This algorithm optimizes cluster centroids to minimize the within-cluster sum of squares. It is widely used for data clustering and segmentation.

Real-world example: Segmenting customers into groups for targeted marketing based on their purchase histories.

2. Hierarchical Clustering

Hierarchical algorithms optimize cluster linkage at various levels, providing a tree-like structure of cluster relationships. They are valuable for understanding data hierarchy.

Real-world example: Analyzing genetic data to identify evolutionary relationships among species.

3. Principal Component Analysis (PCA)

PCA optimizes orthogonal axes to capture maximum data variance. It is a foundational technique for dimensionality reduction and data compression.

Real-world example: Reducing the dimensionality of images for facial recognition systems.

4. t-Distributed Stochastic Neighbor Embedding (t-SNE)

This method optimizes the similarity between high-dimensional and low-dimensional data representations, making it valuable for visualization and preserving local structure.

Real-world example: Visualizing high-dimensional data, such as word embeddings in natural language processing.

5. Autoencoders

These neural network-based models facilitate dimensionality reduction by optimizing encoder and decoder networks. They are versatile for feature learning and data generation tasks.

Real-world example: Anomaly detection in network security using autoencoders to detect unusual network activity.

6. Gaussian Mixture Models (GMM)

GMMs estimate parameters like mean and covariance to model data distributions. They are used in data modeling and clustering tasks.

Real-world example: Modeling the distribution of colors in an image to separate foreground and background objects.

7. Latent Dirichlet Allocation (LDA)

LDA optimizes topic-word and document-topic distributions for topic modeling. It is essential for uncovering latent themes in text data.

Real-world example: Analyzing a collection of news articles to identify the prevalent topics in a given time period.

8. Variational Autoencoders (VAE)

These models combine optimization with probabilistic modeling for unsupervised learning and data generation. They offer a probabilistic approach to generative modeling.

Real-world example: Generating realistic human faces for use in video games or simulations.

9. Isomap and Locally Linear Embedding (LLE)

These techniques optimize the representation of data in a lower-dimensional space while preserving local properties. They are valuable for nonlinear dimensionality reduction.

Real-world example: Visualizing the structure of a complex biological network for better understanding and analysis.

Reinforcement Machine Learning

Reinforcement learning trains agents to make a sequence of decisions or actions that maximize cumulative rewards. Optimization algorithms are pivotal in finding the optimal policy or value function. Here are some common optimization algorithms in reinforcement learning:

1. Q-Learning

An off-policy method that optimizes the Q-value function to estimate the expected cumulative reward for each action-state pair. It is a foundational algorithm for reinforcement learning.

Real-world example: Training an autonomous drone to navigate a complex environment and avoid obstacles.

2. Policy Gradient Methods

These methods optimize the policy directly by adjusting its parameters to maximize expected rewards. This includes algorithms like REINFORCE, which are essential for policy optimization.

Real-world example: Teaching a computer program to play chess by maximizing its chances of winning.

3. Deep Q-Networks (DQN)

DQNs combine Q-learning with deep neural networks and use optimization methods like experience replay and target networks. They have been influential in applying deep learning to reinforcement learning.

Real-world example: Developing an AI agent to play and excel at video games like Atari’s Breakout.

4. Proximal Policy Optimization (PPO)

PPO is a policy gradient method that optimizes the policy while ensuring stable and efficient learning. It is known for its robustness and performance in various tasks.

Real-world example: Training a robot to perform complex tasks in an unstructured environment, such as a warehouse.

5. Actor-Critic Methods

These methods utilize both policy (actor) and value function (critic) networks and optimize them using various techniques, including advantage-based methods. Actor-critic algorithms strike a balance between exploration and exploitation.

Real-world example: Autonomous vehicles using actor-critic methods to navigate traffic and make safe driving decisions.

6. Monte Carlo Tree Search (MCTS)

An algorithm used in decision-making for games and planning tasks, MCTS optimizes the tree of possible actions. It is prominent in game AI and strategic decision-making.

Real-world example: Designing a computer program to play the board game Go at a superhuman level.

Optimization algorithms play a critical role in machine learning, shaping the way models learn from data, discover patterns, and make decisions. They are foundational to a wide range of applications, from predicting financial market trends and autonomous vehicle navigation to healthcare diagnostics and natural language understanding. Understanding and applying these algorithms is not only valuable for data scientists and machine learning practitioners but also for anyone seeking to leverage the power of AI in today’s data-driven world.

--

--