Why These Parameters Matter
Machine learning is not only about choosing an algorithm. The final result also depends on many settings, choices, and training conditions.
Some of these are learned by the model:
- Weights.
- Bias terms.
- Neural-network layer parameters.
- Q-values or policy parameters in reinforcement learning.
Some are chosen by the researcher or engineer before training:
- Learning rate.
- Batch size.
- Number of epochs.
- Random seed.
- Model depth.
- Regularization strength.
- Discount factor in reinforcement learning.
These user-chosen settings are called hyperparameters. They strongly affect speed, stability, accuracy, generalization, and reproducibility.
Parameter vs. Hyperparameter
| Term | Meaning | Example |
|---|---|---|
| Parameter | Learned from data during training | Weight w, coefficient β, neural-network bias |
| Hyperparameter | Chosen before or during training setup | Learning rate, batch size, number of layers |
| Metric | Used to evaluate performance | Accuracy, F1-score, MAE, reward |
| Loss | Function the model tries to minimize | Cross-entropy, MSE |
Simple rule:
If the model learns it, it is usually a parameter.
If you set it before training, it is usually a hyperparameter.
Random Seed
A seed controls randomness so experiments can be repeated.
Randomness appears in:
- Train-test splitting.
- Weight initialization.
- Data shuffling.
- Dropout.
- Mini-batch selection.
- Reinforcement-learning exploration.
If two experiments use the same seed and the same software/hardware behavior, the results are more likely to match.
import random
import numpy as np
random.seed(42)
np.random.seed(42)
Effect of Seed
| Seed behavior | Effect |
|---|---|
| Fixed seed | Better reproducibility |
| Different seeds | Better estimate of result stability |
| Reporting only one lucky seed | Can exaggerate performance |
Good practice is to run important experiments with multiple seeds and report the mean and variation.
Dataset Split
A dataset is usually divided into:
| Split | Purpose |
|---|---|
| Training set | Used to fit the model |
| Validation set | Used to tune hyperparameters |
| Test set | Used once at the end for final evaluation |
The test set should not influence training decisions. If it does, the evaluation becomes overly optimistic.
Feature Scaling
Feature scaling changes feature ranges so one feature does not dominate another.
Common methods:
| Method | Formula | Meaning |
|---|---|---|
| Standardization | z = (x − μ) / σ | Mean 0 and standard deviation 1 |
| Min-max scaling | x' = (x − xmin) / (xmax − xmin) | Range between 0 and 1 |
Scaling is important for:
- KNN.
- SVM/SVR.
- Ridge and Lasso.
- Kernel Ridge Regression.
- Neural networks.
- Gradient descent.
Tree-based models usually need less scaling.
Loss Function
The loss function measures how wrong the model is during training.
For regression, a common loss is mean squared error:
For classification, a common loss is cross-entropy:
Effect of Loss Function
| Loss choice | Effect |
|---|---|
| MSE | Strongly punishes large regression errors |
| MAE | More robust to outliers |
| Cross-entropy | Standard for classification probabilities |
| Huber loss | Mixes MSE and MAE behavior |
| RL reward objective | Optimizes long-term return, not ordinary prediction error |
Learning Rate
The learning rate controls how large each update step is during optimization.
Where:
| Symbol | Meaning |
|---|---|
θ | model parameters |
η | learning rate |
∇L(θ) | gradient of the loss |
Effect of Learning Rate
| Learning rate | Effect |
|---|---|
| Too small | Training is stable but slow |
| Good value | Training improves smoothly |
| Too large | Loss may jump, oscillate, or explode |
In deep learning, learning-rate schedules often reduce η during training.
Batch Size
Batch size is the number of training examples used before one parameter update.
| Batch size | Effect |
|---|---|
| Small batch | Noisier updates, can generalize well, slower per epoch |
| Medium batch | Common practical balance |
| Large batch | Stable gradients, faster hardware use, may generalize worse if not tuned |
Example:
Dataset size = 10,000
Batch size = 100
Iterations per epoch = 10,000 / 100 = 100
Epoch
One epoch means the model has seen the full training dataset once.
1 epoch = one complete pass through the training set
Effect of Epochs
| Number of epochs | Effect |
|---|---|
| Too few | Underfitting |
| Enough | Good learning |
| Too many | Possible overfitting |
Early stopping can stop training when validation performance stops improving.
Iteration
An iteration is one parameter update.
For mini-batch training:
iterations per epoch = number of training samples / batch size
Epoch and iteration are related, but they are not the same.
Weight
A weight controls how strongly an input feature affects the output.
For a linear model:
The coefficients β₁, β₂, ..., βₚ are weights.
Large weights can make a model sensitive. Regularization keeps weights controlled.
Bias
The word bias has two common meanings in machine learning.
Bias as an Intercept
In a model equation, bias is the constant term:
It shifts the prediction up or down.
Bias as Model Assumption
Bias can also mean the error caused by overly simple assumptions.
High bias often means:
- The model is too simple.
- Training error is high.
- Validation error is high.
- The model underfits.
Variance
Variance measures how much the model changes when the training data changes.
High variance often means:
- The model is too flexible.
- Training error is low.
- Validation/test error is high.
- The model overfits.
Underfitting
Underfitting happens when the model is too simple to learn the real pattern.
Signs:
- High training error.
- High validation error.
- Model predictions are too crude.
Common fixes:
- Use a more expressive model.
- Train longer.
- Add useful features.
- Reduce excessive regularization.
Overfitting
Overfitting happens when the model learns the training data too closely, including noise.
Signs:
- Very low training error.
- Much higher validation/test error.
- Model performs poorly on new data.
Common fixes:
- Use more data.
- Use regularization.
- Use dropout.
- Use early stopping.
- Use data augmentation.
- Reduce model complexity.
Bias-Variance Tradeoff
The model should be complex enough to learn real structure, but not so complex that it memorizes noise.
| Situation | Training error | Validation error | Meaning |
|---|---|---|---|
| High | High | Underfitting | |
| Low | High | Overfitting | |
| Low | Low | Good fit |
Regularization
Regularization discourages unnecessary complexity.
Common types:
| Method | Effect |
|---|---|
| L1 regularization | Can push some weights to zero |
| L2 regularization | Shrinks weights smoothly |
| Dropout | Randomly disables neurons during training |
| Early stopping | Stops before validation performance gets worse |
| Data augmentation | Creates more varied training examples |
Ridge Regression uses L2 regularization:
Dropout
Dropout is mainly used in neural networks. During training, some neurons are randomly ignored.
Effect:
- Reduces dependence on individual neurons.
- Helps prevent overfitting.
- Can slow training slightly.
Dropout is usually turned off during final evaluation.
Optimizer
An optimizer decides how parameters are updated.
| Optimizer | Common use |
|---|---|
| SGD | Simple, strong baseline |
| Momentum | Smooths noisy updates |
| RMSProp | Useful for non-stationary gradients |
| Adam | Common default in deep learning |
| AdamW | Adam with better weight decay behavior |
Optimizer choice affects training speed and stability.
Activation Function
Activation functions introduce non-linearity in neural networks.
Without activation functions, a deep network would behave like one linear model.
| Activation | Effect |
|---|---|
| Sigmoid | Outputs 0 to 1, can saturate |
| Tanh | Outputs -1 to 1 |
| ReLU | Fast and common, can create dead neurons |
| GELU | Common in transformer models |
| Softmax | Converts class scores into probabilities |
DNN-Specific Parameters
| Parameter | Effect |
|---|---|
| Number of layers | More depth can learn more complex patterns but can overfit |
| Hidden units | More units increase capacity |
| Batch normalization | Stabilizes layer inputs |
| Layer normalization | Common in transformers |
| Residual connection | Helps train deep networks |
| Weight initialization | Affects early training stability |
| Gradient clipping | Prevents exploding gradients |
ML vs. DL vs. DNN vs. DRL
| Term | Meaning |
|---|---|
| ML | General field where models learn from data |
| DL | Deep learning using neural networks with many layers |
| DNN | Deep neural network architecture |
| RL | Reinforcement learning through actions and rewards |
| DRL | Deep reinforcement learning, where deep networks are used inside RL |
In short:
DL is a part of ML.
DNN is a type of DL model.
DRL combines deep learning with reinforcement learning.
DRL-Specific Parameters
Deep Reinforcement Learning has extra parameters because the agent learns by interacting with an environment.
| Term | Meaning |
|---|---|
State s | What the agent observes |
Action a | What the agent chooses |
Reward r | Feedback from the environment |
| Policy `π(a | s)` |
| Episode | One full interaction run |
| Return | Total future reward |
The common discounted return is:
Important DRL Hyperparameters
| Hyperparameter | Effect |
|---|---|
Discount factor γ | Controls how much future rewards matter |
Exploration rate ε | Controls random exploration in epsilon-greedy methods |
| Replay buffer size | Controls how many past experiences are stored |
| Target-network update rate | Stabilizes Q-learning-style methods |
| Reward scaling | Changes the magnitude of learning signals |
| Episode length | Controls how far the agent can interact before reset |
Effects in DRL
| Setting | If too small | If too large |
|---|---|---|
γ | Agent becomes short-sighted | Agent may overvalue distant uncertain rewards |
ε | Agent may stop exploring too early | Agent may behave too randomly |
| Replay buffer | Learns from too little experience | Uses more memory and may learn from stale data |
| Learning rate | Slow learning | Unstable Q-values or policy updates |
| Reward scale | Weak gradients | Exploding or unstable updates |
Practical Tuning Order
For many ML/DL projects, tune in this order:
- Fix the data split and random seed.
- Choose a simple baseline.
- Check underfitting or overfitting.
- Tune learning rate.
- Tune batch size and epochs.
- Add regularization if overfitting.
- Increase model capacity only when the model underfits.
- Test with multiple seeds.
Quick Diagnosis Table
| Observation | Likely issue | What to try |
|---|---|---|
| Training and validation errors are both high | Underfitting | Bigger model, better features, lower regularization |
| Training error is low but validation error is high | Overfitting | More data, regularization, dropout, early stopping |
| Loss is unstable or explodes | Learning rate too high | Lower learning rate, gradient clipping |
| Training is very slow | Learning rate too low or model too large | Increase learning rate carefully, simplify model |
| Results change a lot across runs | Seed sensitivity | Run multiple seeds and report mean/std |
| DRL reward is noisy | Exploration or environment variance | Tune ε, replay buffer, reward scaling |
Takeaway
Parameters and hyperparameters control how a model learns, how stable it is, and how well it generalizes. For beginners, the most important ideas are seed, train-validation-test split, learning rate, batch size, epoch, bias, variance, overfitting, underfitting, and regularization. In DRL, also pay close attention to exploration, discount factor, replay buffer, and reward design.
References and Further Reading
- I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, MIT Press, 2016.
- R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, 2nd ed., MIT Press, 2018.
- C. M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006.
- Scikit-learn documentation, "Model evaluation".