What is Gradient Descent?
Gradient descent is an optimization algorithm used to reduce a model's error. It is one of the most important ideas in machine learning because many models learn by minimizing a loss function.
A loss function measures how wrong the model is. A high loss means bad predictions; a low loss means better predictions. Gradient descent changes the model parameters step by step so that the loss becomes smaller.
Imagine standing on a hill in fog. You cannot see the whole landscape, but you can feel the slope under your feet. To go downhill, you step in the direction where the ground decreases most quickly. Gradient descent does the same thing, except the "height" is model error and the "position" is the model's parameters.
Concept adapted from Tpoint Tech/Javatpoint gradient-descent material and standard optimization diagrams.
The Core Idea
For a model parameter w, gradient descent repeats this update:
new weight = old weight - learning rate * gradient
The gradient tells us the direction in which the loss increases fastest. Because we want the loss to decrease, we move in the opposite direction.
In compact notation:
w <- w - alpha * dL/dw
where:
wis a model parameter.Lis the loss function.dL/dwis the gradient of the loss with respect tow.alphais the learning rate.
Step-by-Step Training Loop
- Initialize parameters: Start with random or small values.
- Make predictions: Use the current model to predict outputs.
- Compute loss: Compare predictions with true values.
- Compute gradients: Measure how each parameter contributed to the loss.
- Update parameters: Move each parameter in the direction that lowers loss.
- Repeat: Continue until the loss stops improving or the maximum number of iterations is reached.
Learning Rate
The learning rate controls the step size. It is one of the most important hyperparameters in ML.
If the Learning Rate is Too Small
Training is stable, but very slow. The model may need thousands or millions of updates to reach a useful solution.
If the Learning Rate is Too Large
The model may jump over the minimum again and again. Loss can oscillate or even explode.
If the Learning Rate is Reasonable
The model moves quickly at first and then settles near a good minimum.
In real training, the best learning rate often changes over time. Many deep learning systems use learning-rate schedules, warmup, decay, or adaptive optimizers.
Batch, Stochastic, and Mini-Batch Gradient Descent
Batch Gradient Descent
Batch gradient descent uses the entire training dataset before making one update.
Advantages:
- Stable gradient estimate.
- Good for small datasets.
Disadvantages:
- Very slow for large datasets.
- Requires loading all data for each update.
Stochastic Gradient Descent
Stochastic gradient descent (SGD) updates the model using one training example at a time.
Advantages:
- Fast updates.
- Can escape shallow local minima due to noisy updates.
Disadvantages:
- Noisy learning path.
- Loss may fluctuate strongly.
Mini-Batch Gradient Descent
Mini-batch gradient descent uses a small batch, such as 16, 32, 64, or 128 examples, before each update. This is the common choice in modern deep learning.
Advantages:
- More stable than SGD.
- Faster than full batch training.
- Works efficiently on GPUs.
Why Gradients Matter in Deep Learning
Deep learning models may have millions or billions of parameters. Manually adjusting them is impossible. Gradients provide an automatic way to know how each parameter should change.
The backpropagation algorithm efficiently computes these gradients layer by layer. Rumelhart, Hinton, and Williams popularized backpropagation for neural networks in their 1986 Nature paper.
Common Problems
Local Minima
For simple convex problems, there may be one global minimum. For deep neural networks, the loss surface is complex. Gradient descent may end in a local minimum or flat region. In practice, saddle points and poor conditioning are often more troublesome than simple local minima.
Vanishing Gradients
In deep networks, gradients can become extremely small as they move backward through layers. Early layers then learn very slowly. Activation functions, normalization, residual connections, and better initialization help reduce this problem.
Exploding Gradients
Gradients can also become too large, causing unstable updates. Gradient clipping is a common solution, especially in recurrent neural networks.
Overfitting
Gradient descent can reduce training loss while the model becomes worse on new data. Regularization, dropout, early stopping, and validation monitoring help prevent this.
Popular Optimizers Built on Gradient Descent
- Momentum: Smooths updates by remembering the previous direction.
- RMSProp: Adapts the learning rate for each parameter using recent squared gradients.
- Adam: Combines momentum and adaptive learning rates. It is widely used in deep learning.
- AdamW: A variant of Adam with improved weight decay behavior.
Python: Gradient Descent Through scikit-learn
You normally do not write gradient descent manually for classical ML. You choose a model that uses it internally. SGDClassifier and SGDRegressor are useful when the dataset is large or arrives in batches.
from sklearn.datasets import make_classification
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
X, y = make_classification(
n_samples=2000,
n_features=20,
n_informative=8,
random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
model = make_pipeline(
StandardScaler(),
SGDClassifier(loss="log_loss", learning_rate="optimal", random_state=42)
)
model.fit(X_train, y_train)
print("Accuracy:", accuracy_score(y_test, model.predict(X_test)))
Python: Manual Gradient Descent for Linear Regression
This tiny example shows the update rule directly. It is not meant to replace libraries, but it makes the mechanics visible.
import numpy as np
rng = np.random.default_rng(42)
X = rng.normal(size=(100, 1))
y = 3 * X[:, 0] + 2 + rng.normal(scale=0.2, size=100)
w = 0.0
b = 0.0
alpha = 0.05
for epoch in range(500):
y_hat = w * X[:, 0] + b
error = y_hat - y
dw = (2 / len(X)) * np.sum(error * X[:, 0])
db = (2 / len(X)) * np.sum(error)
w -= alpha * dw
b -= alpha * db
print("learned weight:", round(w, 2))
print("learned bias:", round(b, 2))
Practical Checklist
- Normalize or standardize features before training.
- Start with a reasonable learning rate and monitor the loss curve.
- If loss explodes, reduce the learning rate.
- If loss decreases too slowly, try a larger learning rate or an adaptive optimizer.
- Use a validation set to detect overfitting.
- For neural networks, try Adam or AdamW first, then tune.
Takeaway
Gradient descent is the engine behind many learning algorithms. It repeatedly asks: "Which small parameter change will reduce the error most?" The learning rate controls how bold each step is, while the optimizer controls how those steps are shaped.
References and Further Reading
- D. E. Rumelhart, G. E. Hinton, and R. J. Williams, "Learning representations by back-propagating errors", Nature, vol. 323, pp. 533-536, 1986.
- I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, MIT Press, 2016.
- D. P. Kingma and J. Ba, "Adam: A Method for Stochastic Optimization", arXiv:1412.6980, 2014.
- Tpoint Tech/Javatpoint, "Gradient Descent in Machine Learning".