Live
FIFA World Cup 2026 Live Score
Tarek Hassan
Knowledge Basemachine learningBasic Parameters in ML, DL, DNN, and DRL

Basic Parameters in ML, DL, DNN, and DRL

Why These Parameters Matter

Machine learning is not only about choosing an algorithm. The final result also depends on many settings, choices, and training conditions.

Some of these are learned by the model:

  • Weights.
  • Bias terms.
  • Neural-network layer parameters.
  • Q-values or policy parameters in reinforcement learning.

Some are chosen by the researcher or engineer before training:

  • Learning rate.
  • Batch size.
  • Number of epochs.
  • Random seed.
  • Model depth.
  • Regularization strength.
  • Discount factor in reinforcement learning.

These user-chosen settings are called hyperparameters. They strongly affect speed, stability, accuracy, generalization, and reproducibility.

Parameter vs. Hyperparameter

TermMeaningExample
ParameterLearned from data during trainingWeight w, coefficient β, neural-network bias
HyperparameterChosen before or during training setupLearning rate, batch size, number of layers
MetricUsed to evaluate performanceAccuracy, F1-score, MAE, reward
LossFunction the model tries to minimizeCross-entropy, MSE

Simple rule:

If the model learns it, it is usually a parameter.
If you set it before training, it is usually a hyperparameter.

Random Seed

A seed controls randomness so experiments can be repeated.

Randomness appears in:

  • Train-test splitting.
  • Weight initialization.
  • Data shuffling.
  • Dropout.
  • Mini-batch selection.
  • Reinforcement-learning exploration.

If two experiments use the same seed and the same software/hardware behavior, the results are more likely to match.

import random
import numpy as np

random.seed(42)
np.random.seed(42)

Effect of Seed

Seed behaviorEffect
Fixed seedBetter reproducibility
Different seedsBetter estimate of result stability
Reporting only one lucky seedCan exaggerate performance

Good practice is to run important experiments with multiple seeds and report the mean and variation.

Dataset Split

A dataset is usually divided into:

SplitPurpose
Training setUsed to fit the model
Validation setUsed to tune hyperparameters
Test setUsed once at the end for final evaluation

The test set should not influence training decisions. If it does, the evaluation becomes overly optimistic.

Feature Scaling

Feature scaling changes feature ranges so one feature does not dominate another.

Common methods:

MethodFormulaMeaning
Standardizationz = (x − μ) / σMean 0 and standard deviation 1
Min-max scalingx' = (x − xmin) / (xmax − xmin)Range between 0 and 1

Scaling is important for:

  • KNN.
  • SVM/SVR.
  • Ridge and Lasso.
  • Kernel Ridge Regression.
  • Neural networks.
  • Gradient descent.

Tree-based models usually need less scaling.

Loss Function

The loss function measures how wrong the model is during training.

For regression, a common loss is mean squared error:

Mean Squared Error
MSE = (1/n)Σᵢ(yᵢ − ŷᵢ)²

For classification, a common loss is cross-entropy:

Binary Cross-Entropy
L = −[y log(p) + (1 − y)log(1 − p)]

Effect of Loss Function

Loss choiceEffect
MSEStrongly punishes large regression errors
MAEMore robust to outliers
Cross-entropyStandard for classification probabilities
Huber lossMixes MSE and MAE behavior
RL reward objectiveOptimizes long-term return, not ordinary prediction error

Learning Rate

The learning rate controls how large each update step is during optimization.

Gradient Descent Update
θₜ₊₁ = θₜ − η∇L(θₜ)

Where:

SymbolMeaning
θmodel parameters
ηlearning rate
∇L(θ)gradient of the loss

Effect of Learning Rate

Learning rateEffect
Too smallTraining is stable but slow
Good valueTraining improves smoothly
Too largeLoss may jump, oscillate, or explode

In deep learning, learning-rate schedules often reduce η during training.

Batch Size

Batch size is the number of training examples used before one parameter update.

Batch sizeEffect
Small batchNoisier updates, can generalize well, slower per epoch
Medium batchCommon practical balance
Large batchStable gradients, faster hardware use, may generalize worse if not tuned

Example:

Dataset size = 10,000
Batch size = 100
Iterations per epoch = 10,000 / 100 = 100

Epoch

One epoch means the model has seen the full training dataset once.

1 epoch = one complete pass through the training set

Effect of Epochs

Number of epochsEffect
Too fewUnderfitting
EnoughGood learning
Too manyPossible overfitting

Early stopping can stop training when validation performance stops improving.

Iteration

An iteration is one parameter update.

For mini-batch training:

iterations per epoch = number of training samples / batch size

Epoch and iteration are related, but they are not the same.

Weight

A weight controls how strongly an input feature affects the output.

For a linear model:

ŷ = β₀ + β₁x₁ + β₂x₂ + ⋯ + βₚxₚ

The coefficients β₁, β₂, ..., βₚ are weights.

Large weights can make a model sensitive. Regularization keeps weights controlled.

Bias

The word bias has two common meanings in machine learning.

Bias as an Intercept

In a model equation, bias is the constant term:

ŷ = β₀ + β₁x
Here, β₀ is the bias or intercept.

It shifts the prediction up or down.

Bias as Model Assumption

Bias can also mean the error caused by overly simple assumptions.

High bias often means:

  • The model is too simple.
  • Training error is high.
  • Validation error is high.
  • The model underfits.

Variance

Variance measures how much the model changes when the training data changes.

High variance often means:

  • The model is too flexible.
  • Training error is low.
  • Validation/test error is high.
  • The model overfits.

Underfitting

Underfitting happens when the model is too simple to learn the real pattern.

Signs:

  • High training error.
  • High validation error.
  • Model predictions are too crude.

Common fixes:

  • Use a more expressive model.
  • Train longer.
  • Add useful features.
  • Reduce excessive regularization.

Overfitting

Overfitting happens when the model learns the training data too closely, including noise.

Signs:

  • Very low training error.
  • Much higher validation/test error.
  • Model performs poorly on new data.

Common fixes:

  • Use more data.
  • Use regularization.
  • Use dropout.
  • Use early stopping.
  • Use data augmentation.
  • Reduce model complexity.

Bias-Variance Tradeoff

The model should be complex enough to learn real structure, but not so complex that it memorizes noise.

SituationTraining errorValidation errorMeaning
HighHighUnderfitting
LowHighOverfitting
LowLowGood fit

Regularization

Regularization discourages unnecessary complexity.

Common types:

MethodEffect
L1 regularizationCan push some weights to zero
L2 regularizationShrinks weights smoothly
DropoutRandomly disables neurons during training
Early stoppingStops before validation performance gets worse
Data augmentationCreates more varied training examples

Ridge Regression uses L2 regularization:

minimize   Σᵢ(yᵢ − ŷᵢ)² + λΣⱼβⱼ²

Dropout

Dropout is mainly used in neural networks. During training, some neurons are randomly ignored.

Effect:

  • Reduces dependence on individual neurons.
  • Helps prevent overfitting.
  • Can slow training slightly.

Dropout is usually turned off during final evaluation.

Optimizer

An optimizer decides how parameters are updated.

OptimizerCommon use
SGDSimple, strong baseline
MomentumSmooths noisy updates
RMSPropUseful for non-stationary gradients
AdamCommon default in deep learning
AdamWAdam with better weight decay behavior

Optimizer choice affects training speed and stability.

Activation Function

Activation functions introduce non-linearity in neural networks.

Without activation functions, a deep network would behave like one linear model.

Single Neuron
a = g(β₀ + β₁x₁ + β₂x₂ + ⋯ + βₚxₚ)
ActivationEffect
SigmoidOutputs 0 to 1, can saturate
TanhOutputs -1 to 1
ReLUFast and common, can create dead neurons
GELUCommon in transformer models
SoftmaxConverts class scores into probabilities

DNN-Specific Parameters

ParameterEffect
Number of layersMore depth can learn more complex patterns but can overfit
Hidden unitsMore units increase capacity
Batch normalizationStabilizes layer inputs
Layer normalizationCommon in transformers
Residual connectionHelps train deep networks
Weight initializationAffects early training stability
Gradient clippingPrevents exploding gradients

ML vs. DL vs. DNN vs. DRL

TermMeaning
MLGeneral field where models learn from data
DLDeep learning using neural networks with many layers
DNNDeep neural network architecture
RLReinforcement learning through actions and rewards
DRLDeep reinforcement learning, where deep networks are used inside RL

In short:

DL is a part of ML.
DNN is a type of DL model.
DRL combines deep learning with reinforcement learning.

DRL-Specific Parameters

Deep Reinforcement Learning has extra parameters because the agent learns by interacting with an environment.

TermMeaning
State sWhat the agent observes
Action aWhat the agent chooses
Reward rFeedback from the environment
Policy `π(as)`
EpisodeOne full interaction run
ReturnTotal future reward

The common discounted return is:

Discounted Return
Gₜ = rₜ + γrₜ₊₁ + γ²rₜ₊₂ + ⋯

Important DRL Hyperparameters

HyperparameterEffect
Discount factor γControls how much future rewards matter
Exploration rate εControls random exploration in epsilon-greedy methods
Replay buffer sizeControls how many past experiences are stored
Target-network update rateStabilizes Q-learning-style methods
Reward scalingChanges the magnitude of learning signals
Episode lengthControls how far the agent can interact before reset

Effects in DRL

SettingIf too smallIf too large
γAgent becomes short-sightedAgent may overvalue distant uncertain rewards
εAgent may stop exploring too earlyAgent may behave too randomly
Replay bufferLearns from too little experienceUses more memory and may learn from stale data
Learning rateSlow learningUnstable Q-values or policy updates
Reward scaleWeak gradientsExploding or unstable updates

Practical Tuning Order

For many ML/DL projects, tune in this order:

  1. Fix the data split and random seed.
  2. Choose a simple baseline.
  3. Check underfitting or overfitting.
  4. Tune learning rate.
  5. Tune batch size and epochs.
  6. Add regularization if overfitting.
  7. Increase model capacity only when the model underfits.
  8. Test with multiple seeds.

Quick Diagnosis Table

ObservationLikely issueWhat to try
Training and validation errors are both highUnderfittingBigger model, better features, lower regularization
Training error is low but validation error is highOverfittingMore data, regularization, dropout, early stopping
Loss is unstable or explodesLearning rate too highLower learning rate, gradient clipping
Training is very slowLearning rate too low or model too largeIncrease learning rate carefully, simplify model
Results change a lot across runsSeed sensitivityRun multiple seeds and report mean/std
DRL reward is noisyExploration or environment varianceTune ε, replay buffer, reward scaling

Takeaway

Parameters and hyperparameters control how a model learns, how stable it is, and how well it generalizes. For beginners, the most important ideas are seed, train-validation-test split, learning rate, batch size, epoch, bias, variance, overfitting, underfitting, and regularization. In DRL, also pay close attention to exploration, discount factor, replay buffer, and reward design.

References and Further Reading

  • I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, MIT Press, 2016.
  • R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, 2nd ed., MIT Press, 2018.
  • C. M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006.
  • Scikit-learn documentation, "Model evaluation".