How Regret Bound works part8(Machine Learning Optimization) | by Monodeep Mukherjee | May, 2024

Tackling Heavy-Tailed Rewards in Reinforcement Learning with Function Approximation: Minimax Optimal and Instance-Dependent Regret Bounds

Authors: Jiayi Huang, Han Zhong, Liwei Wang, Lin F. Yang

Abstract: While numerous works have focused on devising efficient algorithms for reinforcement learning (RL) with uniformly bounded rewards, it remains an open question whether sample or time-efficient algorithms for RL with large state-action space exist when the rewards are \emph{heavy-tailed}, i.e., with only finite (1+ε)-th moments for some ε∈(0,1]. In this work, we address the challenge of such rewards in RL with linear function approximation. We first design an algorithm, \textsc{Heavy-OFUL}, for heavy-tailed linear bandits, achieving an \emph{instance-dependent} T-round regret of O~(dT1−ε2(1+ε)∑Tt=1ν2t−−−−−−√+dT1−ε2(1+ε)), the \emph{first} of this kind. Here, d is the feature dimension, and ν1+εt is the (1+ε)-th central moment of the reward at the t-th round. We further show the above bound is minimax optimal when applied to the worst-case instances in stochastic and deterministic linear bandits. We then extend this algorithm to the RL settings with linear function approximation. Our algorithm, termed as \textsc{Heavy-LSVI-UCB}, achieves the \emph{first} computationally efficient \emph{instance-dependent} K-episode regret of O~(dHU∗−−−−√K11+ε+dHV∗K−−−−−−√). Here, H is length of the episode, and U∗,V∗ are instance-dependent quantities scaling with the central moment of reward and value functions, respectively. We also provide a matching minimax lower bound Ω(dHK11+ε+dH3K−−−−√) to demonstrate the optimality of our algorithm in the worst case. Our result is achieved via a novel robust self-normalized concentration inequality that may be of independent interest in handling heavy-tailed noise in general online regression problems.

2. Regret Bounds for Risk-sensitive Reinforcement Learning with Lipschitz Dynamic Risk Measures

Authors: Hao Liang, Zhi-quan Luo

Abstract: We study finite episodic Markov decision processes incorporating dynamic risk measures to capture risk sensitivity. To this end, we present two model-based algorithms applied to \emph{Lipschitz} dynamic risk measures, a wide range of risk measures that subsumes spectral risk measure, optimized certainty equivalent, distortion risk measures among others. We establish both regret upper bounds and lower bounds. Notably, our upper bounds demonstrate optimal dependencies on the number of actions and episodes, while reflecting the inherent trade-off between risk sensitivity and sample complexity. Additionally, we substantiate our theoretical results through numerical experiments.

How Regret Bound works part8(Machine Learning Optimization) | by Monodeep Mukherjee | May, 2024

Recent Articles

Why the Newest LLMs use a MoE (Mixture of Experts) Architecture

Using Machine Learning in Customer Segmentation

NYT ‘Connections’ hints and answers for July 27: Tips to solve ‘Connections’ #412.

Crooks Bypassed Google’s Email Verification to Create Workspace Accounts, Access 3rd-Party Services – Krebs on Security

🤖 The AI Developer’s Toolkit: Essential Skills and Resources [2023 Edition] 🔧 | by Jett Black | Jul, 2024

Related Stories

Leave A Reply Cancel reply