Unpacking MDPs for Beginners. MDPs Explained by a Beginner for… | by VARDHAN UPPALA | May, 2025


MDPs Explained by a Beginner for Beginners!!

Hi everyone! I’m new to reinforcement learning and ofc I’m curious to dig deeper into RL . I’ve been thinking how can a machine learn by itself the answer is by machine learning so how can a machine improve by itself iteratively and that is by reinforcement learning . okay let me give an example: think about a toddler learning to stand up and walk. The child doesn’t know at the start it tries continuously by applying various strategies to get that toy on the table you know that’s the goal, child finally get the toy after many failures so basically it is trial and error method . i think you have got some basic idea about reinforcement learning .

lets understand some terms: Reinforcement learning maps states or situations to actions in order to maximize some numerical reward. That is, the algorithm knows about the current input (the state), and the possible things it can do (the actions), and its aim is to maximize the reward. There is a clear distinction drawn between the agent that is doing the learning and the environment, which is where the agent acts, and which produces the state and the rewards.

Just as the toddler learns through trial and error, robots in RL face similar challenges . A classic example in reinforcement learning involves an agent, such as a robot, operating in a real-world environment filled with vast amounts of data and uncertainty. Let’s consider a simple scenario: an apple is on a table, and it falls to the other side, out of the robot’s sight. The robot doesn’t realize the apple has fallen — it thinks the apple has simply disappeared. In reality, the robot should move to the other side of the table to check for the apple. This situation highlights a partially observable environment, where the robot can’t see the full state of the world (the apple’s location). In RL, the robot learns through rewards: it might receive a small reward for exploring the other side of the table, or a larger reward at the end for finding the apple, depending on the task’s design. This setup encourages the robot to explore and adapt, even when it can’t see everything. This apple scenario hints at a partially observable MDP, but let’s start with the basics of MDPs where the robot can see everything.

An agent is supposed to decide the best action to select based on his current state. When this step is repeated, the problem is known as a Markov Decision Process.

MDP’s terms

okay, we’ll understand by an example which is TREASURE HUNT ADVENTURE:

States(S): These are all possible spots you can be in maze. For example, in our maze, states are like different rooms the robot can be in, such as Room A or Room B.

Actions(A):These are the moves you can make in each spot. For helocopter the actions might be like controlling stick in various directions.

Transition Probabilities (P): This is the tricky part of the maze, when you make a move, you don’t always know exactly where you’ll end up! For each spot and action, there’s a chance of landing in different rooms. every room has its own probabilities .

Rewards(R): These are the gold coins you recieve along the way. Everytime you take an action in a room. you get a reward. for example, finding treasure in room A might give you +10 but slipping into wrong room might cost you -1.

Discount factor(γ): This is a number between 0 and 1 (like 0.9) that makes future coins worth less than coins you grab now.so when you keep on playing the game and the high γ value make you focus on avoiding immediate penalties, even if it means taking a longer path. A low 𝛾 might lead you to rush to the end, ignoring long-term benefit.

lets understand the process quick :

  1. you start at a room(lets say A).
  2. you pick up an action, like search.
  3. the maze decides where you go next based upon the transition probabilities. The room A (80%) and room B (20%). let’s say you land in room B;
  4. you get a reward -1 coin because there’s no tressure in room B.
  5. this keeps on going: room c to room d and so on.

let’s discover two best algorithms for picking up the best actions, before that you should know about Bellman equation:

The Bellman equation is a key concept in reinforcement learning (RL) used to solve Markov Decision Processes (MDPs). It helps find the optimal policy (best actions to take) and optimal value functions (how good each state is).

Bellman equation

The optimal value function V(s)* gives the maximum possible value (future rewards) for a state s. It’s calculated by:

  1. Choosing the action that maximizes the value.
  2. Taking the immediate reward for that action in the current state.
  3. Adding the discounted future value of the next state (using a discount factor, usually denoted γ, to weigh future rewards less).

Now, how does the robot find the best actions? one way is by value iteration where we keep on updating the value until its optimal.

Algorithm of Value Iteration

How It Works:

  1. Start with ( V(s) = 0 ) for every state .
  2. Update each state using:
  • solve using above equation in algorithm.
  • Look at the reward now, plus the discounted value of future states.

3. Repeat until the values stop changing much.

4. Pick actions that lead to the highest values.

Policy Iteration is a method for finding the best way to act in a situation, like a game or task, by repeatedly evaluating a current plan and then improving it

Algorithm of Policy Iteration

How It Works:

  1. Start with a random policy (e.g., “always search”).
  2. Evaluate: Calculate the value of this policy by solving
  3. Improve: Update the policy to pick the best action based on those values.
  4. Repeat until the policy doesn’t change.

QUICK SUMMARY :

  • Value Iteration: The robot guesses the value of each maze spot, updates them by combining current rewards with the best discounted future rewards, and repeats until the values settle. It then picks the best moves. Great for big mazes, but takes more steps.
  • Policy Iteration: The robot starts with a random plan, tests it by calculating expected rewards, tweaks the plan to improve it, and repeats until the plan is perfect. Faster for small mazes, but slower for big ones due to solving math problems each step.

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here