One way to think about Reinforcement Learning for Recommender Systems | by Jordan Totten | Jun, 2024

Hansel realizing the files are in the computer

Applying reinforcement learning- (RL) based techniques to recommender systems (RecSys) has gained increased attention over the years. And while industrial R&D teams have demonstrated that RL can play a significant role in RecSys (see appendix), problem formulation and code implementation can still be a challenge.

This article will introduce a framework for understanding how to apply RL to RecSys. To do this, we’ll highlight a few foundational RL concepts (just the ones we need), clarify their implications to RecSys, and illustrate how they can be applied to supervised learning (SL) methods we may already know. Our goal is to understand SL-based RecSys modeling in RL terms. This will prepare us for subsequent posts in which we dive deeper into applying RL to RecSys.

We won’t review code here, but the interested reader can check out tf_vertex_agents, where I’m creating e2e code examples for implementing RL and bandit algorithms with TF-Agents and Vertex AI

In the general RL framework, an agent learns to perform actions by interacting with an environment, where the goal is to find the best sequence of actions that lead to an optimal objective (e.g., max reward):

Environment starts with an initial state: s(t)
Agent takes action a(t)
Environment gives reward: r(t+1)

*Figure 1: Simple illustration of RL framework in which an agent learns to perform actions by interacting with an environment.*

RL can play a significant role in RecSys because it provides a mechanism for better decision making that can (1) optimize for long-term objectives, (2) adapt quickly to dynamic user preferences, and (3) efficiently explore regions of uncertainty. To make this more clear, let’s review a few foundational RL concepts.

Environment simulation

In RL, we design agents that learn by interacting with an environment. The environment represents the task or problem to be solved. At every time step, it provides the agent with an observation, the agent chooses an action, the action is applied to the environment, and the environment returns a reward and new observation. Iterating through an environment’s time steps, an agent eventually learns a policy to select the actions which maximize the sum of all rewards.

Figure 2: Environments are used to model behaviors we expect our agent to encounter. If training an agent to detect abrupt or slowly drifting changes to the reward distribution, we may choose to simulate with some non-stationary environment implementation.

With RecSys, environments are used to simulate the online environment, essentially preparing the agent for its real world task(s). Conceptually, this would be like preparing a robotic agent for the real world by teaching it how to walk in a “3D replica” of the world.

For use cases with large action spaces, simulating specific individual events can be extremely challenging, but also unnecessary. Instead, we try to generate events with similar frequencies as those observed in real world data. The goal is typically to address broader questions with aggregated statistics (e.g., “how does a policy adapt to an abrupt shift in user behavior vs an incremental drift over time?”). Ultimately, we want to simulate known factors of variation in the environment that are important for decision-making.

RL Agents

RL agents are the learner entity interacting with environments, and their goal is to learn the optimal sequence of actions that lead to the highest reward. There are three properties of an agent to highlight for our discussion:

To understand these in terms of RecSys, let’s consider a policy that produces a slate of items to display to users. And let’s also assume we have a list of brand new items just recently added to our catalog. For a given user, the value function predicts scores for individual items. The policy fills the slate in rank order according to the predicted values, and epsilon (ε)% of the time, it replaces the 3rd item in the slate with a random item from our list (this general concept is called epsilon-greedy).

*Figure 3: A general workflow for an* *epsilon-greedy* *policy and its value function. For a given state (user, context), the policy fills a slate with the top-k action values; in each response, the 3rd position in the slate has ε* probability of being replaced with a random item from our list.

Here, the value function is telling us the predicted value of each action (item) for an environment state (user and context). The policy uses these predicted values and an epsilon-greedy algorithm to produce the final ranking response, where the 3rd-highest slate position has an ε probability of being replaced with a random item from our list.

Over time, as the agent learns from its actions and updates its policy, it learns a policy that knows when to recommend the brand new items in the future.

Exploring information to exploit

Recommended reading | viewing: Exploration in Recommender Systems by Minmin Chen (paper | recording)

Figure 4: RL and Bandit agents learn from interacting with an unknown or partially unknown environment; bandits are a subset of RL in which the agent’s actions do not influence the next environment state (e.g., no credit to past actions, no planning ahead).

The explore vs exploit tradeoff is faced by all RL and bandit agents, and it describes the tradeoff between exploiting actions known to produce high rewards versus exploring unseen actions to potentially discover better outcomes. Exploration is needed in RL because the agent has an incomplete view of the environment. As an agent explores different actions and observes their outcomes, it learns more about the environment, and in turn becomes closer to determining an optimal policy.

Balancing this tradeoff is crucial for optimizing long-term objectives in the face of uncertainty. Let’s consider common sources of uncertainty in RecSys:

Training SL-based RecSys models with historical interaction data reinforces patterns learned from past system behavior, creating a feedback loop that encourages two limitations:

Myopic recommendations: users are shown the same familiar content likely to lead to an immediate response (instead of content with longer-term user value)
System bias: we only observe feedback on items previously recommended

Exploration is a key mechanism for escaping this feedback loop and optimizing for long-term objectives. To do this efficiently, we consider three kinds of exploration:

Online exploration is especially critical for implementing an efficient exploration strategy because it uses real time feedback to self-correct errors made during user and system exploration. We’ll revisit this in subsequent sections.

Exploration implies taking risks to acquire new information about the environment. As these risks translate to real cost in our RecSys, understanding the cost and value of this information can help us guide RL agents to explore efficiently.

One way to think about Reinforcement Learning for Recommender Systems | by Jordan Totten | Jun, 2024

Environment simulation

RL Agents

Exploring information to exploit

Recent Articles

Deploy DeepSeek-R1 distilled Llama models with Amazon Bedrock Custom Model Import

Cybercriminals Use Go Resty and Node Fetch in 13 Million Password Spraying Attempts

Players Club: A Free Astro Template for Showcasing Music Artists

ML Feature Management: A Practical Evolution Guide

From Resume to Cover Letter Using AI and LLM, with Python and Streamlit

Related Stories

Leave A Reply Cancel reply