minimax: Efficient Baselines for Autocurricula in JAX

Read original: arXiv:2311.12716 - Published 8/27/2024 by Minqi Jiang, Michael Dennis, Edward Grefenstette, Tim Rocktaschel

🧪

Overview

Unsupervised Environment Design (UED) is a form of automatic curriculum learning for training robust decision-making agents to perform well in unseen environments.
UED experiments have traditionally required lengthy training times, often taking several weeks, which has been a major obstacle to rapid innovation in the field.
This work introduces the minimax library, which enables accelerated UED training through the use of JAX for fully-tensorized environments and autocurriculum algorithms.

Plain English Explanation

The paper discusses a technique called Unsupervised Environment Design (UED), which is a way of automatically creating a curriculum of training tasks for reinforcement learning (RL) agents. The goal is to train agents that can perform well in a wide variety of environments, even ones they've never seen before.

Previous UED experiments have required a lot of computation, often taking weeks to train. This has made it difficult for researchers to quickly test and improve these methods. The researchers introduce the minimax library, which uses a programming framework called JAX to speed up the training process. By fully representing the environments and learning algorithms as tensors, they can take advantage of hardware acceleration to train the models much faster.

The minimax library includes a simple grid-world environment based on the MiniGrid benchmark, as well as reusable code for running UED experiments in procedurally-generated environments. Using this library, the researchers demonstrate new UED methods that can achieve over 120 times speedup compared to previous implementations.

Technical Explanation

The paper introduces the minimax library, which is designed to enable rapid experimentation with Unsupervised Environment Design (UED) techniques for training robust reinforcement learning agents.

The key innovations of the minimax library are:

Fully-Tensorized Environments and Algorithms: The library uses JAX to represent environments and learning algorithms as tensors, enabling efficient hardware acceleration.
Compiled Training Loop: By compiling the entire training loop, minimax can take advantage of hardware acceleration to achieve over 120x speedups compared to previous UED implementations.
Reusable Abstractions: The library includes a tensorized grid-world environment based on MiniGrid, as well as modular components for conducting UED experiments in procedurally-generated environments.

Using these innovations, the researchers demonstrate new UED methods that can train robust agents much more efficiently than previous approaches. The minimax library is available under the Apache 2.0 license.

Critical Analysis

The paper presents a compelling approach to accelerating Unsupervised Environment Design (UED) research through the use of the minimax library. By leveraging JAX and hardware acceleration, the researchers have significantly reduced the computational burden of UED experiments, making it more feasible for researchers to rapidly iterate on and improve these techniques.

However, the paper does not address some potential limitations or areas for further research. For example, it is unclear how the minimax library's performance scales to more complex environments or larger models. Additionally, the paper does not provide a detailed analysis of the trade-offs between the speedups achieved and any potential impacts on the quality or robustness of the trained agents.

Further research could explore ways to extend the minimax library to support a wider range of environments and RL algorithms, as well as investigate the long-term implications of using accelerated UED training for real-world applications.

Conclusion

The minimax library represents an important step forward in enabling rapid innovation in the field of Unsupervised Environment Design (UED) for training robust reinforcement learning agents. By leveraging JAX and hardware acceleration, the library can achieve significant speedups in UED training, making it more feasible for researchers to quickly test and improve these techniques.

The availability of the minimax library under an open-source license is also a valuable contribution, as it provides a reusable platform for the RL community to build upon and advance the state of the art in UED and related areas of reinforcement learning and meta-learning.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🧪

minimax: Efficient Baselines for Autocurricula in JAX

Minqi Jiang, Michael Dennis, Edward Grefenstette, Tim Rocktaschel

Unsupervised environment design (UED) is a form of automatic curriculum learning for training robust decision-making agents to zero-shot transfer into unseen environments. Such autocurricula have received much interest from the RL community. However, UED experiments, based on CPU rollouts and GPU model updates, have often required several weeks of training. This compute requirement is a major obstacle to rapid innovation for the field. This work introduces the minimax library for UED training on accelerated hardware. Using JAX to implement fully-tensorized environments and autocurriculum algorithms, minimax allows the entire training loop to be compiled for hardware acceleration. To provide a petri dish for rapid experimentation, minimax includes a tensorized grid-world based on MiniGrid, in addition to reusable abstractions for conducting autocurricula in procedurally-generated environments. With these components, minimax provides strong UED baselines, including new parallelized variants, which achieve over 120$times$ speedups in wall time compared to previous implementations when training with equal batch sizes. The minimax library is available under the Apache 2.0 license at https://github.com/facebookresearch/minimax.

8/27/2024

No Regrets: Investigating and Improving Regret Approximations for Curriculum Discovery

Alexander Rutherford, Michael Beukman, Timon Willi, Bruno Lacerda, Nick Hawes, Jakob Foerster

What data or environments to use for training to improve downstream performance is a longstanding and very topical question in reinforcement learning. In particular, Unsupervised Environment Design (UED) methods have gained recent attention as their adaptive curricula enable agents to be robust to in- and out-of-distribution tasks. We ask to what extent these methods are themselves robust when applied to a novel setting, closely inspired by a real-world robotics problem. Surprisingly, we find that the state-of-the-art UED methods either do not improve upon the na{i}ve baseline of Domain Randomisation (DR), or require substantial hyperparameter tuning to do so. Our analysis shows that this is due to their underlying scoring functions failing to predict intuitive measures of ``learnability'', i.e., in finding the settings that the agent sometimes solves, but not always. Based on this, we instead directly train on levels with high learnability and find that this simple and intuitive approach outperforms UED methods and DR in several binary-outcome environments, including on our domain and the standard UED domain of Minigrid. We further introduce a new adversarial evaluation procedure for directly measuring robustness, closely mirroring the conditional value at risk (CVaR). We open-source all our code and present visualisations of final policies here: https://github.com/amacrutherford/sampling-for-learnability.

8/30/2024

NAVIX: Scaling MiniGrid Environments with JAX

Eduardo Pignatelli, Jarek Liesen, Robert Tjarko Lange, Chris Lu, Pablo Samuel Castro, Laura Toni

As Deep Reinforcement Learning (Deep RL) research moves towards solving large-scale worlds, efficient environment simulations become crucial for rapid experimentation. However, most existing environments struggle to scale to high throughput, setting back meaningful progress. Interactions are typically computed on the CPU, limiting training speed and throughput, due to slower computation and communication overhead when distributing the task across multiple machines. Ultimately, Deep RL training is CPU-bound, and developing batched, fast, and scalable environments has become a frontier for progress. Among the most used Reinforcement Learning (RL) environments, MiniGrid is at the foundation of several studies on exploration, curriculum learning, representation learning, diversity, meta-learning, credit assignment, and language-conditioned RL, and still suffers from the limitations described above. In this work, we introduce NAVIX, a re-implementation of MiniGrid in JAX. NAVIX achieves over 200 000x speed improvements in batch mode, supporting up to 2048 agents in parallel on a single Nvidia A100 80 GB. This reduces experiment times from one week to 15 minutes, promoting faster design iterations and more scalable RL model development.

7/30/2024

UniZero: Generalized and Efficient Planning with Scalable Latent World Models

Yuan Pu, Yazhe Niu, Jiyuan Ren, Zhenjie Yang, Hongsheng Li, Yu Liu

Learning predictive world models is essential for enhancing the planning capabilities of reinforcement learning agents. Notably, the MuZero-style algorithms, based on the value equivalence principle and Monte Carlo Tree Search (MCTS), have achieved superhuman performance in various domains. However, in environments that require capturing long-term dependencies, MuZero's performance deteriorates rapidly. We identify that this is partially due to the textit{entanglement} of latent representations with historical information, which results in incompatibility with the auxiliary self-supervised state regularization. To overcome this limitation, we present textit{UniZero}, a novel approach that textit{disentangles} latent states from implicit latent history using a transformer-based latent world model. By concurrently predicting latent dynamics and decision-oriented quantities conditioned on the learned latent history, UniZero enables joint optimization of the long-horizon world model and policy, facilitating broader and more efficient planning in latent space. We demonstrate that UniZero, even with single-frame inputs, matches or surpasses the performance of MuZero-style algorithms on the Atari 100k benchmark. Furthermore, it significantly outperforms prior baselines in benchmarks that require long-term memory. Lastly, we validate the effectiveness and scalability of our design choices through extensive ablation studies, visual analyses, and multi-task learning results. The code is available at textcolor{magenta}{https://github.com/opendilab/LightZero}.

6/18/2024