Verifiably Following Complex Robot Instructions with Foundation Models

Brown University

Abstract

Enabling mobile robots to follow complex natural language instructions is an important yet challenging problem. People want to flexibly express constraints, refer to arbitrary landmarks and verify behavior when instructing robots. Conversely, robots must disambiguate human instructions into specifications and ground instruction referents in the real world. We propose Language Instruction grounding for Motion Planning (LIMP), an approach that enables robots to verifiably follow expressive and complex open-ended instructions in real-world environments without prebuilt semantic maps. LIMP constructs a symbolic instruction representation that reveals the robot’s alignment with an instructor’s intended motives and affords the synthesis of robot behaviors that are correct-byconstruction. We perform a large scale evaluation and demonstrate our approach on 150 instructions in five real-world environments showing the generality of our approach and the ease of deployment in novel unstructured domains. In our experiments, LIMP performs comparably with state-of-the-art LLM task planners and LLM code-writing planners on standard open vocabulary tasks and additionally achieves 79% success rate on complex spatiotemporal instructions while LLM and Code-writing planners both achieve 38%.

Limp teaser image.

This figure visualizes our approach, Language Instruction grounding for Motion Planning (LIMP), executing the instruction: "Bring the green plush toy to the whiteboard in front of it, watch out for the robot in front of the toy". LIMP has no semantic information of the environment prior to this instruction, rather at runtime, our approach leverages VLMs and spatial reasoning to detect and ground open-vocabulary instruction referents. LIMP then generates a verifiably correct task and motion plan that enables the robot to navigate from its start location (yellow, A), to the green plush toy (green, B), execute a pick skill which searches for and grasps the object, then navigate to the whiteboard (blue, C) while avoiding the robot in the space (red circles), to finally execute a place skill which sets the object down.

Real-World Demonstration
(All robot videos are 1x speed)

Below are additional real world examples of LIMP generating task and motion plans to follow expressive instructions with complex spatiotemporal constraints. Each demonstration has two videos, the top video visualizes instruction translation, referent grounding, task progression semantic maps and computed motion plans for each example. The bottom video shows a robot executing the generated plan in the real world. Please see our paper for more details on our approach.


Demo 1

Demo 2

Demo 3

Baseline Comparison
(All robot videos are 1x speed)

We compare LIMP with baseline implementations of an LLM task planner (NLmap-Saycan) and an LLM code-writing planner (Code-as-policies), representing state-of-the-art approaches for open-ended language-conditioned robot instruction following. To ensure competitive performance, we integrate our spatial grounding module and low-level robot control into these baselines, allowing them to query our module for 3D object positions, execute manipulation options, and use our path planner. We observe that LLM and code-writing planners are quite adept at generating sequential subgoals, but struggle with temporal constraint adherence. In contrast, our approach ensures each robot step adheres to constraints while achieving subgoals, as illustrated in the example below.

Instruction
"Hey, I want you to bring the plush toy on the table to the tree, make sure to avoid the trash bin when bringing the toy"


LIMP (Ours)

NLMap-Saycan

Code-as-policies

More Demonstration Videos

We run a comprehensive evaluation on 150 natural language instructions in multiple real world environments. See below for some additional instruction following videos.

BibTeX


        @article{quartey2024verifiably,
          title={Verifiably Following Complex Robot Instructions with Foundation Models},
          author={Quartey, Benedict and Rosen, Eric and Tellex, Stefanie and Konidaris, George},
          journal={arXiv preprint arXiv:2402.11498},
          year={2024}
        }

Special thanks to Nerfies for an awesome website template!