EgoSim: Egocentric World Simulator for Embodiment Interaction Generation

Jinkun Hao1*, Mingda Jia2*, Ruiyan Wang1, Xihui Liu3, Ran Yi1†, Lizhuang Ma1†, Jiangmiao Pang2, Xudong Xu2
1 Shanghai Jiao Tong University    2 Shanghai AI Laboratory    3 The University of Hong Kong
* Equal Contribution    Corresponding Author

Abstract

World simulators generate realistic synthetic observations based on an initial environment state and vivid actions of embodiments within the world. A generalized egocentric world simulator should be capable of generating diverse embodiment-object interactions with high spatial consistency across various real-life scenes. Additionally, it is critical to memorize and update the environment state from its generated observations to enable continuous simulation. To address these challenges, we propose EgoSim, an egocentric world simulator that generates high-quality interactions via dexterous action inputs, while incorporating an updatable interaction-aware 3D state to support continuous simulation. For improved generalization, we design a scalable data pipeline that extracts high-quality scene-interaction pairs from in-the-wild egocentric videos. Extensive experiments demonstrate that EgoSim outperforms existing methods in interaction quality, diversity, spatial consistency, and generalization, while also achieving continuous generation ability.

EgoSim Teaser

EgoSim Given an initial 3D state and a sequence of actions, EgoSim generates temporally and spatially consistent egocentric observations and high-quality dexterous interactions. EgoSim also persistently updates a 3D state for continuous simulation. We propose a data curation pipeline that strengthens the generalization ability of EgoSim with scalable scene-interaction pairs. EgoSim could transfer to in-the-wild real scenes and multiple embodiments with few-shot demonstrations.

Continuous Generation

Scenario: Add / Remove Lid

Condition (0-60 frames)
Prediction (0-60 frames)
Updated 3D State
Condition (60-120 frames)
Prediction (60-120 frames)
Condition (0-60 frames)
Prediction (0-60 frames)
Updated 3D State
Condition (60-120 frames)
Prediction (60-120 frames)

Scenario: Make Sandwich

Condition (0-60 frames)
Prediction (0-60 frames)
Updated 3D State
Condition (60-120 frames)
Prediction (60-120 frames)
Condition (0-60 frames)
Prediction (0-60 frames)
Updated 3D State
Condition (60-120 frames)
Prediction (60-120 frames)

Visual Comparisons

Compare our method with various baselines on EgoDex and EgoVid datasets. Use arrows to switch scenes and buttons to select baseline methods.

EgoDex

EgoVid

EgoCap Results

Results of our method on the EgoCap dataset.

Agibot Ablation

Ablation study on the effect of hand data pretraining.