LiveScene: Language Embedding Interactive Radiance Fields for Physical Scene Rendering and Control

TL;DR: Embedding language feature to interactive physical scenes, grounding and manipulating interactable objects with language instructions; Interaction space factorization; Dataset;

Overview

LiveScene aims to advance the progress of physical world interactive scene reconstruction by extending the interactive object reconstruction from single object level to complex scene level. To accurately model the interactive motions of multiple objects in complex scenes, LiveScene proposes an efficient factorization that decomposes the interactive scene into multiple local deformable fields to separately reconstruct individual interactive objects, achieving the first accurate and independent control on multiple interactive objects in a complex scene. Moreover, LiveScene introduces an interaction-aware language embedding method that generates varying language embeddings to localize individual interactive objects under different interactive states, enabling arbitrary control of interactive objects using natural language.

Contributions:

The first scene-level language embedded interactive radiance field that efficiently reconstructs and controls complex physical scenes, allowing for the control of multiple interactive objects within the neural scene with diverse interaction variations and language instructions.
An efficient space factorization and sampling technique that decomposes interactive scenes into local deformable fields and samples the interaction-relevant 3D points to control individual interactive objects in a complex scene. An interaction-relevant language embedding method that generates interaction-relevant varying language embeddings to localize and control interactive objects.
Construct the first scene-level physical interaction datasets OminiSim and InterReal, containing 28 subsets and 70 interactive objects. Extensive experiments demonstrate the SOTA performance in novel view synthesis, video frame interpolation, and scene interactive control.

Pipeline

Given a camera view and control variable \(\boldsymbol{\kappa}\) of one specific interactive object, a series 3D points are sampled in a local deformable field that models the interactive motions of this specific interactive object, and then the interactive object with novel interactive motion state is generated via volume-rendering. Moreover, an interaction-aware language embedding is utilized to localize and control individual interactive objects using natural language.

Multi-scale Interaction Space Factorization

LiveScene maintains mutiple local deformable fields \(\left \{\mathcal{R}_1, \mathcal{R}_2, \cdots \mathcal{R}_\alpha \right \}\) for each interactive object in the 4D space, and project high-dimensional interaction features into a compact multi-scale 4D space. In training, LiveScene denotes a feature repulsion loss and to amplify the feature differences between distinct deformable scenes, which relieve the boundary ray sampling and feature storage conflicts.

Interaction-Aware Language Embedding

LiveScene Leverages the proposed multi-scale interaction space factorization to efficiently store language features in lightweight planes by indexing the maximum probability sampling instead of 3D fields in LERF. For any sampling point \(\mathbf{p}\), it retrieves local language feature group, and perform bilinear interpolation to obtain a language embedding that adapts to interactive variable changes from surrounding clip features.

Interactive Dataset

To our knowledge, existing view synthetic datasets for interactive scene rendering are primarily limited to a few interactive objects, making it impractical to scale up to real scenarios involving multi-object interactions. To bridge this gap, we construct two scene-level, high-quality annotated datasets to advance research progress in reconstructing and understanding interactive scenes: OminiSim and InterReal, containing 28 subsets and 70 interactive objects with 2 million samples, providing rgbd images, camera trajectories, interactive object masks, prompt captions, and corresponding object state quantities at each time step.

More Demos

Click the left and right button on the side of video to preview different scenes.

Acknowledgement

We adapt codes from some awesome repositories, including Nerfstudio, Omnigibson, Kplanes, LeRF and Conerf. Thanks for making the code available! 🤗

Citation

  If you use this work or find it helpful, please consider citing: (bibtex)
  @misc{livescene2024,
    title={LiveScene: Language Embedding Interactive Radiance Fields for Physical Scene Rendering and Control}, 
    author={Delin Qu*, Qizhi Chen*, Pingrui Zhang, Xianqiang Gao, Bin Zhao, Zhigang Wang, Dong Wang†, Xuelong Li†},
    year={2024},
    eprint={2406.16038},
    archivePrefix={arXiv},
  }

LiveScene

Language Embedding Interactive Radiance Fields for Physical Scene Rendering and Control

Paper

Video