Adversarial Environment Design via Regret-Guided Diffusion Models

Hojun Chung, Junseo Lee , Minsoo Kim , Dohyeong Kim , Songhwai Oh

Spotlight, NeurIPS 2024

Video

Abstract

Training agents that are robust to environmental changes remains a significant challenge in deep reinforcement learning (RL). Unsupervised environment design (UED) has recently emerged to address this issue by generating a set of training environments tailored to the agent's capabilities. While prior works demonstrate that UED has the potential to learn a robust policy, their performance is constrained by the capabilities of the environment generation. To this end, we propose a novel UED algorithm, adversarial environment design via regret-guided diffusion models (ADD). The proposed method guides the diffusion-based environment generator with the regret of the agent to produce environments that the agent finds challenging but conducive to further improvement. By exploiting the representation power of diffusion models, ADD can directly generate adversarial environments while maintaining the diversity of training environments, enabling the agent to effectively learn a robust policy. Our experimental results demonstrate that the proposed method successfully generates an instructive curriculum of environments, outperforming UED baselines in zero-shot generalization across novel, out-of-distribution environments.

Overview

First, a diffusion-based environment generator, which is pre-trained on the randomly generated environment dataset, produces a set of environments for the agent. After the agent interacts with the generated environments and is trained via reinforcement learning, the episodic results are utilized to update the environment critic. Then, the environment critic estimates the regret of the agent in a differentiable form and guides the reverse process of the diffusion-based environment generator, resulting in environment parameters that the agent finds challenging but conducive to further improvement. By repeating this process, the agent learns the policy, which is robust to the variations of environments.

Experiments

Minigrid Test Environments

We train the diffusion-based environment generator on 10M random environments whose number of blocks uniformly varies from zero to 60. Then, we train the LSTM-based policy for 250M environmental steps and evaluate the zero-shot performance on 12 challenging test environments from prior works [1, 2].

[1] Dennis et. al. "Emergent complexity and zero-shot transfer via unsupervised environment design," NeurIPS, 2020.

[2] Jiang et. al. "Replay-guided adversarial environment design," NeruIPS, 2021.

Minigrid Results

The proposed method consistently outperforms the baselines in the challenging test environments. It also successfully generates adversarial environments while preserving diversity.

Minigrid Controllable Generation

ADD allows us to control the difficulty levels of the environments it generates by guiding the generator with the probability of achieving a specific return. It enables the reuse of the learned generator in various applications, such as generating benchmarks.

BipedalWalker Test Environments

We train the RL agent for two billion environmental steps and evaluate the zero-shot performance on six test environments from prior work [3].

[3] Parker-Holder et. al. "Evolving curricula with regret-based environment design," ICML, 2022.

BipedalWalker Results

The proposed algorithm achieves the highest return across all environments, with an average of 149.6. Furthermore, we can infer that the proposed algorithm generates environments that are not merely more difficult but are conducive to the agent's learning process.