One of the most popular optimization algorithms in machine learning is Gradient Descent algorithm. The main aim of this algorithm is to minimize the errors between actual and expected results in machine learning. In this article, what is Gradient Descent in machine learning and its basics are explained.
It is also known as Steepest Decent algorithm in machine learning. It is one of the most popular machine learning algorithms which works on optimization of cost function with the help of iteration. Gradient Descent in machine learning aids in determining a function’s local minimum.
The following is the simplest technique to define a function’s local minimum or local maximum using gradient descent:
- Moving towards a negative gradient or away from the gradient of the function at the present location will provide the function’s local minimum.
- We shall obtain the local maximum of that function whenever we move towards a positive gradient or the gradient of the function at the present position.
The two – steps to achieve the goal of Gradient descent are as follows:
Step 1: Calculates the function’s first-order derivative in order to determine the gradient or slope.
Step 2: Move away from the gradient’s direction, which indicates the slope has risen by alpha times from the present position, where Alpha is specified as the Learning Rate. It is a tuning parameter used in the optimization process to determine the duration of the stages.
Types of Gradient Descent in machine learning:
The Gradient Descent learning algorithm is classified into three types based on the error in various training models: batch gradient descent, stochastic gradient descent, and mini-batch gradient descent.
Batch Gradient Descent:
After analysing all training instances, batch gradient descent (BGD) is used to discover the error for each point in the training set and update the model. This is referred to as the training era. Simply put, it is a greedy method in which we must sum over all cases for each update.
Benefits of batch gradient descent:
- It generates less noise than other gradient descent algorithms.
- It achieves consistent gradient descent convergence.
- Because all resources are employed for all training samples, it is computationally efficient.
Stochastic Gradient Descent:
Stochastic gradient descent (SGD) is a gradient descent algorithm that runs one training sample every iteration. In other words, it processes a training epoch for each example in a dataset and changes the parameters of each training sample one at a time. It is easy to keep in allocated memory since it only takes one training sample at a time. However, it exhibits considerable computational efficiency losses as compared to batch gradient systems due to frequent updates that need more detail and speed. Furthermore, because of the frequent updates, it is regarded as a noisy gradient. However, it can occasionally be useful in locating the global minimum as well as avoiding the local minimum.
Stochastic gradient descent provides the following advantages:
Stochastic gradient descent (SGD) is a type of gradient descent that offers a few benefits over other types of gradient descent.
- It is simpler to allocate RAM at the required location.
- It is faster to compute than batch gradient descent.
- For huge datasets, it is more efficient.
Mini Batch Gradient Descent:
Mini Batch gradient descent combines batch gradient descent with stochastic gradient descent. It separates the training datasets into tiny groups and then updates each batch independently. Splitting training datasets into smaller batches strikes a compromise between batch gradient descent’s computational efficiency and the speed of stochastic gradient descent. As a result, we may construct a kind of gradient descent that is more computationally efficient and less noisy.
Mini Batch gradient descent has the following advantages:
- It is easy to fit into available memory.
- High computational efficiency.
- It achieves consistent gradient descent convergence.
Challenges of Gradient descent in machine learning:
Local Minima and Saddle Point:
For convex issues, gradient descent can readily discover the global minimum, however for non-convex situations, it can be difficult to locate the global minimum, where machine learning models perform best.
This model stops learning once the slope of the cost function is zero or near to zero. Aside from the global minimum, there are two more scenarios that might exhibit this slop: saddle point and local minimum. Local minima have a similar form to global minima in that the slope of the cost function grows on both sides of the present location.
In contrast, the negative gradient occurs exclusively on one side of the point, which achieves a local maximum on one side and a local minimum on the other. The name of a saddle point is derived from the name of a horse’s saddle.
The term “local minima” refers to a location in a local region where the value of the loss function is the smallest. The name of the global minima, on the other hand, is provided because the value of the loss function is minimal there, universally over the full domain of the loss function.
Vanishing and Exploding Gradients:
Two other difficulties that might arise in a deep neural network when trained with gradient descent and backpropagation.
Vanishing Gradients:
Vanishing Gradient happens when the gradient is less than predicted. During backpropagation, these gradient shrinks, resulting in a drop in the learning rate of the network’s early layers relative to the later layers. When this occurs, the weight parameters are updated until they are no longer important.
Exploding Gradients:
Exploding Gradient: Exploding gradient occurs when the gradient is too big and forms a stable model. Furthermore, model weight increases in this case, and they will be represented as NaN. This challenge may be tackled utilizing the dimensionality reduction approach, which helps to reduce model complexity.
In this article, what is Gradient Descent in machine learning and its basics are discussed.