Principles of Reinforcement Learning: An Introduction with Python

Image by Editor | Midjourney

Reinforcement Learning (RL) is a type of machine learning. It trains an agent to make decisions by interacting with an environment. This article covers the basic concepts of RL. These include states, actions, rewards, policies, and the Markov Decision Process (MDP). By the end, you will understand how RL works. You will also learn how to implement it in Python.

Key Concepts in Reinforcement Learning

Reinforcement Learning (RL) involves several core ideas that shape how machines learn from experience and make decisions:

Agent: It’s the decision-maker that interacts with its environment.
Environment: The external system with which the agent interacts.
State: A representation of the current situation of the environment.
Action: Choices that the agent can take in a given state.
Reward: Immediate feedback the agent gets after taking an action in a state.
Policy: A set of rules the agent follows to decide its actions based on states.
Value Function: Estimates the expected long-term reward from a specific state under a policy.

Markov Decision Process

A Markov Decision Process (MDP) is a mathematical framework. MDPs give a structured way to describe the environment in reinforcement learning.

An MDP is defined by the tuple (S,A,T,R,γ). The components of the tuple are described below.

States: A set of all possible states in the environment.
Actions (A): A set of all possible actions the agent can take.
Transition Model (T): The probability of transitioning from one state to another.
Reward Function (R): The immediate reward received after transitioning from one state to another.
Discount Factor (γ): A factor between 0 and 1 that represents the importance of future rewards.

Bellman Equation

The Bellman equation calculates the value of being in a state or taking an action based on the expected future rewards.

It breaks down the expected total reward. The first part is the immediate reward received. The second part is the discounted value of future rewards. This equation helps agents make decisions to maximize their long-term benefits.

Steps of Reinforcement Learning

Define the Environment: Specify the states, actions, transition rules, and rewards.
Initialize Policies and Value Functions: Set up initial strategies for decision-making and value estimations.
Observe the Initial State: Gather information about the initial conditions of the environment.
Choose an Action: Decide on an action based on current strategies.
Observe the Outcome: Receive feedback in the form of a new state and reward from the environment.
Update Strategies: Adjust decision-making policies and value estimations based on the received feedback.

Reinforcement Learning Algorithms

There are several algorithms used in reinforcement learning.

Q-Learning: A model-free algorithm that learns the value of actions in a state-action space.
Deep Q-Network (DQN): An extension of Q-Learning using deep neural networks to handle large state spaces.
Policy Gradient Methods: Directly optimize the policy by adjusting the policy parameters using gradient ascent.
Actor-Critic Methods: Combine value-based and policy-based methods. The actor updates the policy, and the critic evaluates the action.

Q-Learning Algorithm

Q-Learning is a key algorithm in reinforcement learning. It is a model-free method. This means that it doesn’t need a model of the environment. Q-Learning learns actions by directly interacting with the environment. Its main goal is to find the best action-selection policy that maximizes cumulative reward.

Key Concepts

Q-Value: The Q-value, denoted as Q(s,a), represents the expected cumulative reward of taking a specific action in a specific state and following the policy thereafter.
Q-Table: A table where each cell Q(s,a) corresponds to the Q-value for a state-action pair. This table is continually updated as the agent learns from its experiences.
Learning Rate (α): A factor that determines how much new information should overwrite old information It lies between 0 and 1.
Discount Factor (γ): A factor that reduces the value of future rewards. It also lies between 0 and 1.

Implementation of Q-Learning with Python

Import required libraries

Import the necessary libraries. ‘gym’ is used to create and interact with the environment. Furthermore, ‘numpy’ is used for numerical operations.

import gym import numpy as np

import gym

import numpy as np

Initialize the Environment and Q-Table

Create the FrozenLake environment and initialize the Q-table with zeros.

env = gym.make(“FrozenLake-v1”, is_slippery=False) Q = np.zeros((env.observation_space.n, env.action_space.n))

env = gym.make(“FrozenLake-v1”, is_slippery=False)

Q = np.zeros((env.observation_space.n, env.action_space.n))

Define Hyperparameters

Define the hyperparameters for the Q-Learning algorithm.

learning_rate = 0.8 discount_factor = 0.95 epsilon = 0.1 episodes = 10000 max_steps = 100

learning_rate = 0.8

discount_factor = 0.95

epsilon = 0.1

episodes = 10000

max_steps = 100

Implementing Q-Learning

Implement the Q-Learning algorithm on the above setup.

for episode in range(episodes): state = env.reset() done = False for _ in range(max_steps): # Choose action (epsilon-greedy strategy) if np.random.uniform(0, 1) < epsilon: action = env.action_space.sample() else: action = np.argmax(Q[state, :]) # Perform action and observe the outcome next_state, reward, done, _ = env.step(action) # Update Q-value using the Bellman equation Q [state, action] = Q [state, action] + learning_rate * (reward + discount_factor * np.max(Q [next_state,:]) – Q [state, action]) # Transition to next state state = next_state # If the episode is finished, break the loop if done: break

for episode in range(episodes):

state = env.reset()

done = False

for _ in range(max_steps):

# Choose action (epsilon-greedy strategy)

if np.random.uniform(0, 1) < epsilon:

action = env.action_space.sample()

else:

action = np.argmax(Q[state, :])

# Perform action and observe the outcome

next_state, reward, done, _ = env.step(action)

# Update Q-value using the Bellman equation

Q [state, action] = Q [state, action] + learning_rate * (reward + discount_factor * np.max(Q [next_state,:]) – Q [state, action])

# Transition to next state

state = next_state

# If the episode is finished, break the loop

if done:

break

Evaluate the Trained Agent

Calculate the total reward collected as the agent interacts with the environment.

state = env.reset() done = False total_reward = 0 while not done: action = np.argmax(Q[state, :]) next_state, reward, done, _ = env.step(action) total_reward += reward state = next_state env.render()

state = env.reset()

done = False

total_reward = 0

while not done:

action = np.argmax(Q[state, :])

next_state, reward, done, _ = env.step(action)

total_reward += reward

state = next_state

env.render()

Conclusion

This article introduces fundamental principles and offers a beginner-friendly example of reinforcement learning. As you explore further, you’ll encounter advanced methods such as deep reinforcement learning. This approach integrates RL with neural networks to manage complex state and action spaces effectively.

About Jayita Gulati

Jayita Gulati is a machine learning enthusiast and technical writer driven by her passion for building machine learning models. She holds a Master’s degree in Computer Science from the University of Liverpool.

Principles of Reinforcement Learning: An Introduction with Python

Key Concepts in Reinforcement Learning

Markov Decision Process

Bellman Equation

Steps of Reinforcement Learning

Reinforcement Learning Algorithms

Q-Learning Algorithm

Key Concepts

Implementation of Q-Learning with Python

Import required libraries

Initialize the Environment and Q-Table

Define Hyperparameters

Implementing Q-Learning

Evaluate the Trained Agent

Conclusion

Discover How Machine Learning Algorithms Work!

See How Algorithms Work in Minutes

Finally, Pull Back the Curtain on
Machine Learning Algorithms

About Jayita Gulati

Recent Articles

11 Best MagSafe Wallets (2025), Tested and Reviewed

Building a Personal Knowledge Management Tool with Reor

Understanding Text Generation Parameters in Transformers

ToyMaker Uses LAGTOY to Sell Access to CACTUS Ransomware Gangs for Double Extortion

More Hazbin Hotel and Its Helluva Boss Spinoff are Coming

Related Stories

Leave A Reply Cancel reply

Principles of Reinforcement Learning: An Introduction with Python

Key Concepts in Reinforcement Learning

Markov Decision Process

Bellman Equation

Steps of Reinforcement Learning

Reinforcement Learning Algorithms

Q-Learning Algorithm

Key Concepts

Implementation of Q-Learning with Python

Import required libraries

Initialize the Environment and Q-Table

Define Hyperparameters

Implementing Q-Learning

Evaluate the Trained Agent

Conclusion

Discover How Machine Learning Algorithms Work!

See How Algorithms Work in Minutes

Finally, Pull Back the Curtain onMachine Learning Algorithms

About Jayita Gulati

Recent Articles

Related Stories

Leave A Reply Cancel reply

Finally, Pull Back the Curtain on
Machine Learning Algorithms