Md Tarek Hassan | PhD Researcher in 6G Wireless, RIS and ISAC

Why These Parameters Matter

Machine learning is not only about choosing an algorithm. The final result also depends on many settings, choices, and training conditions.

Some of these are learned by the model:

Weights.
Bias terms.
Neural-network layer parameters.
Q-values or policy parameters in reinforcement learning.

Some are chosen by the researcher or engineer before training:

Learning rate.
Batch size.
Number of epochs.
Random seed.
Model depth.
Regularization strength.
Discount factor in reinforcement learning.

These user-chosen settings are called hyperparameters. They strongly affect speed, stability, accuracy, generalization, and reproducibility.

Parameter vs. Hyperparameter

Term	Meaning	Example
Parameter	Learned from data during training	Weight `w`, coefficient `β`, neural-network bias
Hyperparameter	Chosen before or during training setup	Learning rate, batch size, number of layers
Metric	Used to evaluate performance	Accuracy, F1-score, MAE, reward
Loss	Function the model tries to minimize	Cross-entropy, MSE

Simple rule:

If the model learns it, it is usually a parameter.
If you set it before training, it is usually a hyperparameter.

Random Seed

A seed controls randomness so experiments can be repeated.

Randomness appears in:

Train-test splitting.
Weight initialization.
Data shuffling.
Dropout.
Mini-batch selection.
Reinforcement-learning exploration.

If two experiments use the same seed and the same software/hardware behavior, the results are more likely to match.

import random
import numpy as np

random.seed(42)
np.random.seed(42)

Effect of Seed

Seed behavior	Effect
Fixed seed	Better reproducibility
Different seeds	Better estimate of result stability
Reporting only one lucky seed	Can exaggerate performance

Good practice is to run important experiments with multiple seeds and report the mean and variation.

Dataset Split

A dataset is usually divided into:

Split	Purpose
Training set	Used to fit the model
Validation set	Used to tune hyperparameters
Test set	Used once at the end for final evaluation

The test set should not influence training decisions. If it does, the evaluation becomes overly optimistic.

Feature Scaling

Feature scaling changes feature ranges so one feature does not dominate another.

Common methods:

Method	Formula	Meaning
Standardization	`z = (x − μ) / σ`	Mean 0 and standard deviation 1
Min-max scaling	`x' = (x − xmin) / (xmax − xmin)`	Range between 0 and 1

Scaling is important for:

KNN.
SVM/SVR.
Ridge and Lasso.
Kernel Ridge Regression.
Neural networks.
Gradient descent.

Tree-based models usually need less scaling.

Loss Function

The loss function measures how wrong the model is during training.

For regression, a common loss is mean squared error:

Mean Squared Error MSE = (1/n)Σᵢ(yᵢ - ŷᵢ)²

For classification, a common loss is cross-entropy:

Binary Cross-Entropy L = -[y log(p) + (1 - y)log(1 - p)]

Effect of Loss Function

Loss choice	Effect
MSE	Strongly punishes large regression errors
MAE	More robust to outliers
Cross-entropy	Standard for classification probabilities
Huber loss	Mixes MSE and MAE behavior
RL reward objective	Optimizes long-term return, not ordinary prediction error

Learning Rate

The learning rate controls how large each update step is during optimization.

Gradient Descent Update θₜ₊₁ = θₜ - η\nablaL(θₜ)

Where:

Symbol	Meaning
`θ`	model parameters
`η`	learning rate
`∇L(θ)`	gradient of the loss

Effect of Learning Rate

Learning rate	Effect
Too small	Training is stable but slow
Good value	Training improves smoothly
Too large	Loss may jump, oscillate, or explode

In deep learning, learning-rate schedules often reduce η during training.

Batch Size

Batch size is the number of training examples used before one parameter update.

Batch size	Effect
Small batch	Noisier updates, can generalize well, slower per epoch
Medium batch	Common practical balance
Large batch	Stable gradients, faster hardware use, may generalize worse if not tuned

Example:

Dataset size = 10,000
Batch size = 100
Iterations per epoch = 10,000 / 100 = 100

Epoch

One epoch means the model has seen the full training dataset once.

1 epoch = one complete pass through the training set

Effect of Epochs

Number of epochs	Effect
Too few	Underfitting
Enough	Good learning
Too many	Possible overfitting

Early stopping can stop training when validation performance stops improving.

Iteration

An iteration is one parameter update.

For mini-batch training:

iterations per epoch = number of training samples / batch size

Epoch and iteration are related, but they are not the same.

Weight

A weight controls how strongly an input feature affects the output.

For a linear model:

ŷ = β₀ + β₁x₁ + β₂x₂ + \dots + βₚxₚ

The coefficients β₁, β₂, ..., βₚ are weights.

Large weights can make a model sensitive. Regularization keeps weights controlled.

Bias

The word bias has two common meanings in machine learning.

Bias as an Intercept

In a model equation, bias is the constant term:

ŷ = β₀ + β₁x Here, β₀ is the bias or intercept.

It shifts the prediction up or down.

Bias as Model Assumption

Bias can also mean the error caused by overly simple assumptions.

High bias often means:

The model is too simple.
Training error is high.
Validation error is high.
The model underfits.

Variance

Variance measures how much the model changes when the training data changes.

High variance often means:

The model is too flexible.
Training error is low.
Validation/test error is high.
The model overfits.

Underfitting

Underfitting happens when the model is too simple to learn the real pattern.

Signs:

High training error.
High validation error.
Model predictions are too crude.

Common fixes:

Use a more expressive model.
Train longer.
Add useful features.
Reduce excessive regularization.

Overfitting

Overfitting happens when the model learns the training data too closely, including noise.

Signs:

Very low training error.
Much higher validation/test error.
Model performs poorly on new data.

Common fixes:

Use more data.
Use regularization.
Use dropout.
Use early stopping.
Use data augmentation.
Reduce model complexity.

Bias-Variance Tradeoff

The model should be complex enough to learn real structure, but not so complex that it memorizes noise.

Situation	Training error	Validation error
High	High	Underfitting
Low	High	Overfitting
Low	Low	Good fit

Regularization

Regularization discourages unnecessary complexity.

Common types:

Method	Effect
L1 regularization	Can push some weights to zero
L2 regularization	Shrinks weights smoothly
Dropout	Randomly disables neurons during training
Early stopping	Stops before validation performance gets worse
Data augmentation	Creates more varied training examples

Ridge Regression uses L2 regularization:

minimize Σᵢ(yᵢ - ŷᵢ)² + λΣⱼβⱼ²

Dropout

Dropout is mainly used in neural networks. During training, some neurons are randomly ignored.

Effect:

Reduces dependence on individual neurons.
Helps prevent overfitting.
Can slow training slightly.

Dropout is usually turned off during final evaluation.

Optimizer

An optimizer decides how parameters are updated.

Optimizer	Common use
SGD	Simple, strong baseline
Momentum	Smooths noisy updates
RMSProp	Useful for non-stationary gradients
Adam	Common default in deep learning
AdamW	Adam with better weight decay behavior

Optimizer choice affects training speed and stability.

Activation Function

Activation functions introduce non-linearity in neural networks.

Without activation functions, a deep network would behave like one linear model.

Single Neuron a = g(β₀ + β₁x₁ + β₂x₂ + \dots + βₚxₚ)

Activation	Effect
Sigmoid	Outputs 0 to 1, can saturate
Tanh	Outputs -1 to 1
ReLU	Fast and common, can create dead neurons
GELU	Common in transformer models
Softmax	Converts class scores into probabilities

DNN-Specific Parameters

Parameter	Effect
Number of layers	More depth can learn more complex patterns but can overfit
Hidden units	More units increase capacity
Batch normalization	Stabilizes layer inputs
Layer normalization	Common in transformers
Residual connection	Helps train deep networks
Weight initialization	Affects early training stability
Gradient clipping	Prevents exploding gradients

ML vs. DL vs. DNN vs. DRL

Term	Meaning
ML	General field where models learn from data
DL	Deep learning using neural networks with many layers
DNN	Deep neural network architecture
RL	Reinforcement learning through actions and rewards
DRL	Deep reinforcement learning, where deep networks are used inside RL

In short:

DL is a part of ML.
DNN is a type of DL model.
DRL combines deep learning with reinforcement learning.

DRL-Specific Parameters

Deep Reinforcement Learning has extra parameters because the agent learns by interacting with an environment.

Term	Meaning
State `s`	What the agent observes
Action `a`	What the agent chooses
Reward `r`	Feedback from the environment
Policy `π(a	s)`
Episode	One full interaction run
Return	Total future reward

The common discounted return is:

Discounted Return Gₜ = rₜ + γrₜ₊₁ + γ²rₜ₊₂ + \dots

Important DRL Hyperparameters

Hyperparameter	Effect
Discount factor `γ`	Controls how much future rewards matter
Exploration rate `ε`	Controls random exploration in epsilon-greedy methods
Replay buffer size	Controls how many past experiences are stored
Target-network update rate	Stabilizes Q-learning-style methods
Reward scaling	Changes the magnitude of learning signals
Episode length	Controls how far the agent can interact before reset

Effects in DRL

Setting	If too small	If too large
`γ`	Agent becomes short-sighted	Agent may overvalue distant uncertain rewards
`ε`	Agent may stop exploring too early	Agent may behave too randomly
Replay buffer	Learns from too little experience	Uses more memory and may learn from stale data
Learning rate	Slow learning	Unstable Q-values or policy updates
Reward scale	Weak gradients	Exploding or unstable updates

Practical Tuning Order

For many ML/DL projects, tune in this order:

Fix the data split and random seed.
Choose a simple baseline.
Check underfitting or overfitting.
Tune learning rate.
Tune batch size and epochs.
Add regularization if overfitting.
Increase model capacity only when the model underfits.
Test with multiple seeds.

Quick Diagnosis Table

Observation	Likely issue	What to try
Training and validation errors are both high	Underfitting	Bigger model, better features, lower regularization
Training error is low but validation error is high	Overfitting	More data, regularization, dropout, early stopping
Loss is unstable or explodes	Learning rate too high	Lower learning rate, gradient clipping
Training is very slow	Learning rate too low or model too large	Increase learning rate carefully, simplify model
Results change a lot across runs	Seed sensitivity	Run multiple seeds and report mean/std
DRL reward is noisy	Exploration or environment variance	Tune `ε`, replay buffer, reward scaling

Takeaway

Parameters and hyperparameters control how a model learns, how stable it is, and how well it generalizes. For beginners, the most important ideas are seed, train-validation-test split, learning rate, batch size, epoch, bias, variance, overfitting, underfitting, and regularization. In DRL, also pay close attention to exploration, discount factor, replay buffer, and reward design.

References and Further Reading

I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, MIT Press, 2016.
R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, 2nd ed., MIT Press, 2018.
C. M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006.
Scikit-learn documentation, "Model evaluation".

Basic Parameters in ML, DL, DNN, and DRL