Meta-Explore: Exploratory Hierarchical Vision-and-Language Navigation Using Scene Object Spectrum Grounding

Minyoung Hwang1, Jaeyeon Jeong1, Minsoo Kim3, Yoonseon Oh2*, Songhwai Oh1*

1Electrical and Computer Engineering and ASRI, Seoul National University, 2Department of Electronic Engineering, Hanyang University, 3Interdisciplinary Major in Artificial Intelligence, Seoul National University

*Corresponding authors

CVPR 2023 accepted

Paper Video

Abstract

The main challenge in vision-and-language navigation (VLN) is how to understand natural-language instructions in an unseen environment. The main limitation of conventional VLN algorithms is that if an action is mistaken, the agent fails to follow the instructions or explores unnecessary regions, leading the agent to an irrecoverable path. To tackle this problem, we propose Meta-Explore, a hierarchical navigation method deploying an exploitation policy to correct misled recent actions. We show that an exploitation policy, which moves the agent toward a well-chosen local goal among unvisited but observable states, outperforms a method which moves the agent to a previously visited state. We also highlight the demand for imagining regretful explorations with semantically meaningful clues. The key to our approach is understanding the object placements around the agent in spectral-domain. Specifically, we present a novel visual representation, called scene object spectrum (SOS), which performs category-wise 2D Fourier transform of detected objects. Combining exploitation policy and SOS features, the agent can correct its path by choosing a promising local goal. We evaluate our method in three VLN benchmarks: R2R, SOON, and REVERIE. Meta-Explore outperforms other baselines and shows significant generalization performance. In addition, local goal search using the proposed spectral-domain SOS features significantly improves the success rate by 17.1% and SPL by 20.6% for the SOON benchmark.



Illustration of Meta-Explore

Alt Text

At each episode, a natural language instruction is given to the agent to navigate to a goal location. The agent explores the environment and constructs a topological map by recording visited nodes and next step reachable nodes. The agent chooses an unvisited local goal to solve the regretful exploration problem.

Hierarchical Exploration Local Goal Search Scene Object Spectrum
We propose a hierarchical navigation method deploying a learnable mode selector. The exploitation policy finds an unvisited and near-optimal local goal instead of simply backtracking. We assert the necessity of spectral grounding of objects in hierarchical exploration.

Hierarchical Exploration

Hierarchical Exploration

1

Meta-Explore is a hierarchical navigation method deploying an exploitation policy to correct misled recent actions.


Traditional Exploitation Methods

                   Our Solution

1

1

We show that an exploitation policy, which moves the agent toward a well-chosen local goal among unvisited but observable states, outperforms a method which moves the agent to a previously visited state.


Scene Object Spectrum

1

Meta-Explore imagines regretful explorations with semantically meaningful clues. We present a novel visual representation, called scene object spectrum (SOS), which performs category-wise 2D Fourier transform of detected objects.


Toy Example

1

Combining the exploitation policy and SOS features, the agent can correct its path by choosing a promising local goal. We score the corrected trajectories to measure the alignment with the language instruction L.

1


Experiments in Discrete Environments

VLN Benchmark - R2R

1

Meta-Explore outperforms other baselines and shows significant generalization performance.

1

Local goal search using the proposed spectral-domain SOS features significantly improves the success rate and SPL.


Experiments in Continuous Environments

Image-Goal Navigation

1

We extend Meta-Explore to image-goal navigation task in continuous environments to address the impact of hierarchical exploration in realistic environments.


Bibtex

@inproceedings{2023metaexplore,
  title={Meta-Explore: Exploratory Hierarchical Vision-and-Language Navigation Using Scene Object Spectrum Grounding},
  author={Hwang, Minyoung and Jeong, Jaeyeon and Kim, Minsoo and Oh, Yoonseon and Oh, Songhwai},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2023}
}