Intrinsic Rewards for Exploration without Harm from Observational Noise: A Simulation Study Based on the Free Energy Principle

2405.07473

Published 5/14/2024 by Theodore Jerome Tinker, Kenji Doya, Jun Tani

👀

Abstract

In Reinforcement Learning (RL), artificial agents are trained to maximize numerical rewards by performing tasks. Exploration is essential in RL because agents must discover information before exploiting it. Two rewards encouraging efficient exploration are the entropy of action policy and curiosity for information gain. Entropy is well-established in literature, promoting randomized action selection. Curiosity is defined in a broad variety of ways in literature, promoting discovery of novel experiences. One example, prediction error curiosity, rewards agents for discovering observations they cannot accurately predict. However, such agents may be distracted by unpredictable observational noises known as curiosity traps. Based on the Free Energy Principle (FEP), this paper proposes hidden state curiosity, which rewards agents by the KL divergence between the predictive prior and posterior probabilities of latent variables. We trained six types of agents to navigate mazes: baseline agents without rewards for entropy or curiosity, and agents rewarded for entropy and/or either prediction error curiosity or hidden state curiosity. We find entropy and curiosity result in efficient exploration, especially both employed together. Notably, agents with hidden state curiosity demonstrate resilience against curiosity traps, which hinder agents with prediction error curiosity. This suggests implementing the FEP may enhance the robustness and generalization of RL models, potentially aligning the learning processes of artificial and biological agents.

Create account to get full access

Overview

The paper explores the role of entropy and curiosity in encouraging efficient exploration in reinforcement learning (RL) agents.
Two types of curiosity are examined: prediction error curiosity and hidden state curiosity based on the Free Energy Principle (FEP).
Agents are trained to navigate mazes, with different reward schemes including entropy, prediction error curiosity, and hidden state curiosity.
The authors find that entropy and curiosity, especially when used together, lead to efficient exploration, and that hidden state curiosity helps agents avoid "curiosity traps".

Plain English Explanation

In reinforcement learning, artificial agents are trained to perform tasks in order to maximize numerical rewards. Exploration is crucial in this process, as agents need to discover new information before they can exploit it. Two ways to encourage efficient exploration are rewarding agents for the entropy of their action policy (promoting randomized action selection) and curiosity for information gain.

Curiosity can be defined in various ways, such as prediction error curiosity, which rewards agents for discovering observations they cannot accurately predict. However, this approach can lead to agents being distracted by unpredictable "noise" in their observations, known as "curiosity traps".

Based on the Free Energy Principle (FEP), this paper proposes hidden state curiosity, which rewards agents based on the difference between their predicted and actual understanding of the underlying, hidden states of the environment.

The researchers trained several types of agents to navigate mazes, including baseline agents without entropy or curiosity rewards, and agents with entropy, prediction error curiosity, or hidden state curiosity rewards. They found that entropy and curiosity, especially when used together, led to efficient exploration. Importantly, agents with hidden state curiosity were more resilient to curiosity traps than those with prediction error curiosity.

This suggests that incorporating the FEP principles into reinforcement learning models could enhance their robustness and ability to generalize, potentially aligning artificial agents' learning processes with those of biological agents.

Technical Explanation

The paper examines the role of entropy and curiosity in encouraging efficient exploration in reinforcement learning (RL) agents. Two types of curiosity are investigated: prediction error curiosity and hidden state curiosity based on the Free Energy Principle (FEP).

The authors trained six types of agents to navigate mazes:

Baseline agents without rewards for entropy or curiosity.
Agents rewarded for entropy of their action policy.
Agents rewarded for prediction error curiosity.
Agents rewarded for hidden state curiosity.
Agents rewarded for both entropy and prediction error curiosity.
Agents rewarded for both entropy and hidden state curiosity.

They found that entropy and curiosity, especially when used together, led to efficient exploration. Notably, agents with hidden state curiosity demonstrated resilience against "curiosity traps" - distractions caused by unpredictable observational noise - which hindered agents with prediction error curiosity.

The authors suggest that implementing the FEP principles in RL models may enhance their robustness and generalization, potentially aligning the learning processes of artificial and biological agents.

Critical Analysis

The paper provides a novel approach to encouraging exploration in RL agents by incorporating the principles of the Free Energy Principle (FEP). The authors' comparison of prediction error curiosity and hidden state curiosity is a valuable contribution, as it highlights the potential issues with the former and the benefits of the latter in avoiding "curiosity traps".

However, the paper does not delve into the specific mechanisms by which hidden state curiosity outperforms prediction error curiosity. A more detailed explanation of the underlying FEP-based formulation and its advantages would strengthen the technical understanding of the proposed approach.

Additionally, the experiments are limited to navigating mazes, which may not fully capture the complexities of real-world environments. Further research is needed to assess the performance of the FEP-based curiosity reward in more diverse and challenging RL tasks, such as open-ended exploration or multi-agent scenarios.

Overall, the paper presents a promising direction for enhancing exploration in RL by grounding the curiosity reward in the principles of the Free Energy Principle. Further research to address the limitations and elucidate the underlying mechanisms could solidify the potential of this approach.

Conclusion

This paper explores the use of entropy and curiosity to encourage efficient exploration in reinforcement learning agents. The authors propose a novel approach based on the Free Energy Principle, called "hidden state curiosity," which outperforms the more commonly used "prediction error curiosity" in avoiding "curiosity traps."

The findings suggest that incorporating principles from the Free Energy Principle into reinforcement learning models could enhance their robustness and generalization, potentially aligning the learning processes of artificial and biological agents. While further research is needed to fully understand the mechanisms and expand the evaluation to more complex scenarios, this work represents an important step towards developing more efficient and adaptive reinforcement learning systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Surprise-Adaptive Intrinsic Motivation for Unsupervised Reinforcement Learning

Adriana Hugessen, Roger Creus Castanyer, Faisal Mohamed, Glen Berseth

Both entropy-minimizing and entropy-maximizing (curiosity) objectives for unsupervised reinforcement learning (RL) have been shown to be effective in different environments, depending on the environment's level of natural entropy. However, neither method alone results in an agent that will consistently learn intelligent behavior across environments. In an effort to find a single entropy-based method that will encourage emergent behaviors in any environment, we propose an agent that can adapt its objective online, depending on the entropy conditions by framing the choice as a multi-armed bandit problem. We devise a novel intrinsic feedback signal for the bandit, which captures the agent's ability to control the entropy in its environment. We demonstrate that such agents can learn to control entropy and exhibit emergent behaviors in both high- and low-entropy regimes and can learn skillful behaviors in benchmark tasks. Videos of the trained agents and summarized findings can be found on our project page https://sites.google.com/view/surprise-adaptive-agents

5/28/2024

cs.LG cs.AI

Uncertainty-Aware Reward-Free Exploration with General Function Approximation

Junkai Zhang, Weitong Zhang, Dongruo Zhou, Quanquan Gu

Mastering multiple tasks through exploration and learning in an environment poses a significant challenge in reinforcement learning (RL). Unsupervised RL has been introduced to address this challenge by training policies with intrinsic rewards rather than extrinsic rewards. However, current intrinsic reward designs and unsupervised RL algorithms often overlook the heterogeneous nature of collected samples, thereby diminishing their sample efficiency. To overcome this limitation, in this paper, we propose a reward-free RL algorithm called alg. The key idea behind our algorithm is an uncertainty-aware intrinsic reward for exploring the environment and an uncertainty-weighted learning process to handle heterogeneous uncertainty in different samples. Theoretically, we show that in order to find an $epsilon$-optimal policy, GFA-RFE needs to collect $tilde{O} (H^2 log N_{mathcal F} (epsilon) mathrm{dim} (mathcal F) / epsilon^2 )$ number of episodes, where $mathcal F$ is the value function class with covering number $N_{mathcal F} (epsilon)$ and generalized eluder dimension $mathrm{dim} (mathcal F)$. Such a result outperforms all existing reward-free RL algorithms. We further implement and evaluate GFA-RFE across various domains and tasks in the DeepMind Control Suite. Experiment results show that GFA-RFE outperforms or is comparable to the performance of state-of-the-art unsupervised RL algorithms.

7/2/2024

cs.LG cs.AI

Beyond Optimism: Exploration With Partially Observable Rewards

Simone Parisi, Alireza Kazemipour, Michael Bowling

Exploration in reinforcement learning (RL) remains an open challenge. RL algorithms rely on observing rewards to train the agent, and if informative rewards are sparse the agent learns slowly or may not learn at all. To improve exploration and reward discovery, popular algorithms rely on optimism. But what if sometimes rewards are unobservable, e.g., situations of partial monitoring in bandits and the recent formalism of monitored Markov decision process? In this case, optimism can lead to suboptimal behavior that does not explore further to collapse uncertainty. With this paper, we present a novel exploration strategy that overcomes the limitations of existing methods and guarantees convergence to an optimal policy even when rewards are not always observable. We further propose a collection of tabular environments for benchmarking exploration in RL (with and without unobservable rewards) and show that our method outperforms existing ones.

6/21/2024

cs.LG

RLeXplore: Accelerating Research in Intrinsically-Motivated Reinforcement Learning

Mingqi Yuan, Roger Creus Castanyer, Bo Li, Xin Jin, Glen Berseth, Wenjun Zeng

Extrinsic rewards can effectively guide reinforcement learning (RL) agents in specific tasks. However, extrinsic rewards frequently fall short in complex environments due to the significant human effort needed for their design and annotation. This limitation underscores the necessity for intrinsic rewards, which offer auxiliary and dense signals and can enable agents to learn in an unsupervised manner. Although various intrinsic reward formulations have been proposed, their implementation and optimization details are insufficiently explored and lack standardization, thereby hindering research progress. To address this gap, we introduce RLeXplore, a unified, highly modularized, and plug-and-play framework offering reliable implementations of eight state-of-the-art intrinsic reward algorithms. Furthermore, we conduct an in-depth study that identifies critical implementation details and establishes well-justified standard practices in intrinsically-motivated RL. The source code for RLeXplore is available at https://github.com/RLE-Foundation/RLeXplore.

5/31/2024

cs.LG