Tarek Hassan
Knowledge Basemachine learningReinforcement Learning & DRL

Reinforcement Learning & DRL

What is Reinforcement Learning?

Reinforcement Learning (RL) is about learning how to act. An agent interacts with an environment, takes actions, receives rewards, and gradually learns a strategy that produces high long-term reward.

Unlike supervised learning, the agent is not given the correct answer for every situation. It must learn from consequences.

Examples:

  • A robot learns to walk.
  • A game agent learns to win.
  • A wireless controller learns channel access or power allocation.
  • A vehicle learns safe merging behavior in simulation.

The RL Loop

Every RL problem has a feedback loop:

  1. The agent observes the current state.
  2. The agent chooses an action.
  3. The environment changes.
  4. The agent receives a reward.
  5. The agent observes the next state.
Reinforcement-learning picture
agent -> action -> environment -> reward + next state -> agent

Concept adapted from Tpoint Tech/Javatpoint reinforcement-learning diagrams and Sutton-Barto RL notation.

State -> Agent -> Action -> Environment -> Reward + Next State

Key Components

Agent

The learner or decision maker.

Examples: robot, game-playing AI, scheduler, radio resource controller.

Environment

The world the agent interacts with.

Examples: a game, a robot simulation, a wireless network, a traffic system.

State

The information available to the agent at a given time.

Examples: robot joint positions, channel state, queue length, game screen, or vehicle speed.

Action

The choice made by the agent.

Examples: move left, increase transmit power, select a beam, allocate a resource block.

Reward

The feedback signal.

Examples: +1 for success, -1 for collision, high reward for throughput, penalty for energy use.

Policy

The policy is the agent's behavior rule. It maps states to actions.

policy: state -> action

The goal of RL is to learn a policy that maximizes long-term reward.

Immediate Reward vs. Long-Term Return

RL is difficult because the best action now may not give the best immediate reward. The agent must think about future consequences.

Example:

  • A robot may take a slow step now to avoid falling later.
  • A network scheduler may serve a weaker user now to improve fairness over time.
  • A game agent may sacrifice a piece now to win later.

The discounted return is a common way to value future rewards:

return = r1 + gamma*r2 + gamma^2*r3 + ...

where gamma is the discount factor. A larger gamma makes the agent care more about future rewards.

Exploration vs. Exploitation

The agent must balance two behaviors:

  • Exploration: Try new actions to discover better strategies.
  • Exploitation: Use the best-known action to collect reward.

If the agent only exploits, it may get stuck with a mediocre strategy. If it only explores, it may never use what it has learned.

Common exploration methods:

  • Epsilon-greedy action selection.
  • Softmax action selection.
  • Upper confidence bound.
  • Entropy bonuses in policy-gradient methods.

Value Functions

A value function estimates how good a state or action is.

State Value

The state value estimates the expected future reward from a state if the agent follows a policy.

V(s) = expected return from state s

Action Value

The action value estimates the expected future reward after taking action a in state s.

Q(s, a) = expected return after action a in state s

Q-values are central to Q-learning and Deep Q-Networks.

Q-Learning

Q-learning is a classic value-based RL algorithm. It learns a table of action values.

For each state-action pair, the agent updates its estimate using the reward received and the estimated value of the next state.

Intuition

If an action leads to good future rewards, increase its Q-value. If it leads to bad outcomes, decrease its Q-value.

Limitation

Q-learning works well when the state and action spaces are small. It becomes impractical when states are huge, such as raw images or continuous sensor readings.

Policy Gradient Methods

Policy gradient methods learn the policy directly instead of learning a value table first.

Intuition

If an action leads to high reward, make that action more likely in similar situations. If it leads to poor reward, make it less likely.

These methods are useful for continuous action spaces, such as robot control or power-control problems.

Deep Reinforcement Learning

Deep Reinforcement Learning (DRL) combines neural networks with RL. The neural network approximates a policy, a value function, or both.

High-dimensional state -> Neural network -> Action or value estimate

The breakthrough result from DeepMind showed that a Deep Q-Network could learn to play Atari games directly from raw pixels. This connected representation learning with reinforcement learning at scale.

Deep Q-Network

A Deep Q-Network (DQN) replaces the Q-table with a neural network.

Key Ideas

  • Experience replay: Store past transitions and sample them randomly during training.
  • Target network: Use a slowly updated copy of the network to stabilize targets.
  • CNN feature extraction: Learn useful representations from game images or other high-dimensional observations.

These ideas made value-based deep RL much more stable.

Actor-Critic Methods

Actor-critic methods combine two models:

  • Actor: Chooses actions.
  • Critic: Estimates how good those actions are.

The critic helps the actor learn more efficiently. Many modern DRL methods are actor-critic variants.

RL in Wireless and Communication Systems

RL is attractive in communication systems because many problems involve sequential decisions under uncertainty.

Examples:

  • Beam selection.
  • Power control.
  • Spectrum access.
  • Handover decisions.
  • Edge computing offloading.
  • RIS phase configuration.
  • Resource allocation under changing traffic.

However, RL must be used carefully. Real wireless systems have safety, latency, sample-efficiency, and explainability constraints. Simulation-to-reality mismatch is also a major issue.

Practical Challenges

  • Sample inefficiency: RL may need many interactions.
  • Reward design: A poorly designed reward can produce unwanted behavior.
  • Stability: Training can be noisy or unstable.
  • Safety: Exploration can be risky in real systems.
  • Generalization: A policy trained in one environment may fail in another.
  • Evaluation: Results should be tested across many random seeds and scenarios.

Python: Minimal Q-Learning Table

This tiny example shows the skeleton of tabular Q-learning. It assumes you already have an environment with reset() and step(action) methods, like classic Gymnasium environments.

import numpy as np

n_states = 16
n_actions = 4
q = np.zeros((n_states, n_actions))

alpha = 0.1
gamma = 0.99
epsilon = 0.1
rng = np.random.default_rng(42)

for episode in range(1000):
    state = env.reset()[0]
    done = False

    while not done:
        if rng.random() < epsilon:
            action = rng.integers(n_actions)
        else:
            action = np.argmax(q[state])

        next_state, reward, terminated, truncated, info = env.step(action)
        done = terminated or truncated

        best_next = np.max(q[next_state])
        q[state, action] += alpha * (
            reward + gamma * best_next - q[state, action]
        )

        state = next_state

Python: Load a DRL Agent with Stable-Baselines3

For practical DRL, it is better to start with a tested library. Stable-Baselines3 provides implementations of DQN, PPO, A2C, SAC, TD3, and other algorithms.

# pip install stable-baselines3 gymnasium

import gymnasium as gym
from stable_baselines3 import DQN

env = gym.make("CartPole-v1")

model = DQN(
    "MlpPolicy",
    env,
    learning_rate=1e-3,
    buffer_size=50_000,
    learning_starts=1_000,
    batch_size=32,
    gamma=0.99,
    verbose=1,
)

model.learn(total_timesteps=20_000)
model.save("dqn_cartpole")

loaded_model = DQN.load("dqn_cartpole", env=env)
obs, info = env.reset()

for _ in range(500):
    action, _ = loaded_model.predict(obs, deterministic=True)
    obs, reward, terminated, truncated, info = env.step(action)
    if terminated or truncated:
        obs, info = env.reset()

Python: PPO for Continuous or More Stable Control

PPO is often a strong first choice because it is relatively stable across many environments.

import gymnasium as gym
from stable_baselines3 import PPO

env = gym.make("CartPole-v1")

agent = PPO("MlpPolicy", env, verbose=1)
agent.learn(total_timesteps=20_000)

agent.save("ppo_cartpole")

For wireless-system research, the main work is usually not calling DQN() or PPO(). The hard part is designing the environment: state representation, action space, reward function, constraints, and realistic channel/network dynamics.

Takeaway

Reinforcement learning is about learning actions from reward. Value-based methods estimate how good actions are, policy-gradient methods learn behavior directly, and deep RL uses neural networks for large or complex state spaces. It is powerful, but it needs careful reward design, reliable simulation, and strong evaluation.

References and Further Reading