Dynamic Open-Vocabulary 3D Scene Graphs for Long-term Language-Guided Mobile Manipulation

1Beihang University, 2Nanyang Technological University, 3Minzu University of China, 4Afanti Tech LLC

Abstract

Enabling mobile robots to perform long-term tasks in dynamic real-world environments is a formidable challenge, especially when the environment changes frequently due to human-robot interactions or the robot's own actions. Traditional methods typically assume static scenes, which limits their applicability in the continuously changing real world. To overcome these limitations, we present DovSG, a novel mobile manipulation framework that leverages dynamic open-vocabulary 3D scene graphs and a language-guided task planning module for long-term task execution. DovSG takes RGB-D sequences as input and utilizes vision-language models (VLMs) for object detection to obtain high-level object semantic features. Based on the segmented objects, a structured 3D scene graph is generated for low-level spatial relationships. Furthermore, an efficient mechanism for locally updating the scene graph, allows the robot to adjust parts of the graph dynamically during interactions without the need for full scene reconstruction. This mechanism is particularly valuable in dynamic environments, enabling the robot to continually adapt to scene changes and effectively support the execution of long-term tasks. We validated our system in real-world environments with varying degrees of manual modifications, demonstrating its effectiveness and superior performance in long-term tasks.



DovSG is a mobile robotic system designed to perform long-term tasks in real-world environments. It can detect changes in the scene during task execution, ensuring that subsequent subtasks are completed correctly. The system consists of five main components: perception, memory, task planning, navigation, and manipulation. The memory module includes a lower-level semantic memory and a higher-level scene graph, both of which are continuously updated as the robot explores the environment. This enables the robot to promptly detect manual changes (e.g., keys being moved from cabinet to table) and make the necessary adjustments for subsequent tasks (such as correctly executing Task 2-2).

DovSG constructs a Dynamic 3D Scene Graph and leverages task decomposition with large language models, enabling localized updates of the 3D scene graphs during interactive exploration. This assists mobile robots in accurately executing long-term tasks, even in scenarios where human modifications to the environment are present.

Contributions

  • We propose a novel robotic framework that integrates dynamic open-vocabulary 3D scene graphs with language-guided task planning, enabling accurate long-term task execution in dynamic and interactive environments.
  • We construct dynamic 3D scene graphs that capture rich object semantics and spatial relations, performing localized updates as the robot interacts with its environment, allowing it to adapt efficiently to incremental modifications.
  • We develop a task planning method that decomposes complex tasks into manageable subtasks, including pick-up, place, and navigation, enhancing the robot’s flexibility and scalability in long-term missions.
  • We implement DovSG on real-world mobile robots and demonstrate its capabilities across dynamic environments, showing excellent performance in both long-term tasks and subtasks like navigation and manipulation.

Demos

Our DovSG enables long-term task execution by constructing dynamic scene graphs and leveraging large language models, even in scenarios where human interactions alter the environment.

Minor Adjustment: Please move the red pepper to the plate, then move the green pepper to plate.

Appearance: Grab the blue toy and place it to the green toy, then move the green pepper to plate.
Minor Adjustment: Move the green pepper to the plate, then throw the Coca-Cola bottle into the trash can.
Appearance: Please move the eggplant to the yellow cabinet, and place the bananas on the plate.
Positional Shift: Place the green pepper in the plate, and the red pepper in the plate.
Minor Adjustment: Put the banana onto the plate on the white table, and then put the blue dinosaur from the box onto the green mouse pad.
Appearance: Move the corn to the plate in the cabinet, and put the green pepper onto the plate in the cabinet.
Positional Shift: Set the red pepper in the green container, and set the corn in the green container.


Point Cloud to 3D Scene Graphs with Semantic Memory

DovSG can accurately create 3D Scene Graphs that effectively guide robot navigation.



Initialization and Construction of 3D Scene Graphs

We first use the RGB-D-based DROID-SLAM model to predict the pose of each frame in the scene. Then, we apply an advanced Open-Vocal segmentation model to segment regions in the RGB images, extract semantic feature vectors for each region, and project them onto a 3D point cloud. Based on semantic, geometric, and CLIP feature similarities, the same object captured from multiple views is gradually associated and fused, resulting in a series of 3D objects. Next, we infer the relationships between objects based on their spatial positions and generate edges connecting these objects, forming a scene graph. This scene graph provides a structured and comprehensive understanding of the scene, allowing efficient localization of target objects and enabling easy reconstruction and updating in dynamic environments, and supports task planning for large language models.

Adaptation in interactions with manually modified scenes.

(1) We train the scene-specific regression MLP of the ACE model using RGB images and their poses, making the process highly efficient. (2) After manual scene modification, multi-view observations allow rough global pose estimation via ACE, refined further using LightGlue and ICP. The new viewpoint’s point cloud closely aligns with the stored pose. (3) The bottom image shows accurate local updates to the scene based on observations from the new viewpoint.

Two proposed grasp strategies in DovSG.

In the first row, we cropped the point cloud input into anyGrasp within a certain range around the target object, allowing anyGrasp to focus more on the target object without compromising the generation of collision-free grasps. Furthermore, we filtered the grasps based on translational and rotational costs, with the red grasps indicating the highest confidence. In the second row, we show our heuristic grasp strategy, which leverages the object's bounding box information to rotate and select the most appropriate grasp.



Detailed Relocalization Pipeline.

(a) The robot captures multiple images from various angles to obtain a wide field of view. These images include RGB images and depth maps (RGBDs), with each frame corresponding to its pose in the robot's base coordinate system. (b) Since the pose information for each frame is known in the robot's base coordinate system, the relative poses between frames within the robot's coordinate system can be accurately calculated. This provides important geometric constraints for the subsequent relocalization steps. (c) Using the LightGlue feature matching algorithm, the newly collected RGBDs are matched with the image data retained from the previous scene scanning phase. From the matched historical frames, the RGBDs most similar to the newly collected data and their corresponding poses in the world coordinate system are extracted. (d) Based on the ACE model, the rough pose of each newly observed frame in the world coordinate system is estimated, and each estimate is assigned a confidence score. The pose with the highest confidence is then selected to transform the new observation point cloud into the world coordinate system. (e) The historical image information associated with the new observation data (matched via LightGlue) is projected into the 3D point cloud within the world coordinate system, generating the target point cloud. (f) Using the transformed point cloud from step (d) as the source point cloud and the target point cloud from step (e) as the reference, the Iterative Closest Point (ICP) algorithm is employed for precise registration of the two. After ICP optimization, the source point cloud is accurately aligned with the world coordinate system. (g) Based on the above process, the pose of the new observation point cloud is accurately estimated. Figure (g) shows the point cloud result after relocalization, where the new observation point cloud successfully integrates into the global map with precise positioning.



We tested the effectiveness of DovSG in four rooms of varying sizes and complexity.

We tested the effectiveness of DovSG in four rooms of varying sizes and complexity: Room 1 is approximately 60 m², Room 2 is about 40 m², Room 3 is around 100 m², and Room 4 is about 150 m². Rooms 1, 2, and 4 have relatively complex scenes, while Room 3 is simpler.



The Success Rate of Long-term Tasks and SubTasks.

We compared with OK-Robot, and the success rates of the experimental subtasks and long-term tasks are shown in the table above.

Youtube Video

BibTeX

@misc{yan2024dynamicopenvocabulary3dscene,
        title={Dynamic Open-Vocabulary 3D Scene Graphs for Long-term Language-Guided Mobile Manipulation}, 
        author={Zhijie Yan and Shufei Li and Zuoxu Wang and Lixiu Wu and Han Wang and Jun Zhu and Lijiang Chen and Jihong Liu},
        year={2024},
        eprint={2410.11989},
        archivePrefix={arXiv},
        primaryClass={cs.RO},
        url={https://arxiv.org/abs/2410.11989}, 
  }