Shaofei Cai · Researcher at DeepSeek AI · Ph.D. at Peking University

Selected Publications (view all )

Scalable Multi-Task Reinforcement Learning for Generalizable Spatial Intelligence in Visuomotor Agents

Shaofei Cai*, Zhancun Mu*, Haiwen Xia, Bowei Zhang, Anji Liu, Yitao Liang (* equal contribution)

arxiv 2025

The first reinforcement learning trained multi-task policy in the Minecraft world, demonstrating zero-shot generalization capability to other 3D domains.

[Paper] [Code] [Page]

Scalable Multi-Task Reinforcement Learning for Generalizable Spatial Intelligence in Visuomotor Agents

Shaofei Cai*, Zhancun Mu*, Haiwen Xia, Bowei Zhang, Anji Liu, Yitao Liang (* equal contribution)

arxiv 2025

The first reinforcement learning trained multi-task policy in the Minecraft world, demonstrating zero-shot generalization capability to other 3D domains.

[Paper] [Code] [Page]

ROCKET-2: Steering Visuomotor Policy via Cross-View Goal Alignment

Shaofei Cai, Zhancun Mu, Anji Liu, Yitao Liang

Proceedings of the AAAI Conference on Artificial Intelligence (AAAI'26) 2025

We aim to develop a goal specification method that is semantically clear, spatially sensitive, and intuitive for human users to guide agent interactions in embodied environments. Specifically, we propose a novel cross-view goal alignment framework that allows users to specify target objects using segmentation masks from their own camera views rather than the agent's observations.

[Paper] [Code] [Page]

ROCKET-2: Steering Visuomotor Policy via Cross-View Goal Alignment

Shaofei Cai, Zhancun Mu, Anji Liu, Yitao Liang

Proceedings of the AAAI Conference on Artificial Intelligence (AAAI'26) 2025

[Paper] [Code] [Page]

ROCKET-1: Mastering Open-World Interaction with Visual-Temporal Context Prompting

Shaofei Cai, Zihao Wang, Kewei Lian, Zhancun Mu, Xiaojian Ma, Anji Liu, Yitao Liang

IEEE/CVF Computer Vision and Pattern Recognition (CVPR'25) 2025

We propose visual-temporal context prompting, a novel communication protocol between VLMs and policy models. This protocol leverages object segmentation from past observations to guide policy-environment interactions. Using this approach, we train ROCKET-1, a low-level policy that predicts actions based on concatenated visual observations and segmentation masks, supported by real-time object tracking from SAM-2.

[Paper] [Code] [Demo] [Page] [Video] [Twitter]

ROCKET-1: Mastering Open-World Interaction with Visual-Temporal Context Prompting

Shaofei Cai, Zihao Wang, Kewei Lian, Zhancun Mu, Xiaojian Ma, Anji Liu, Yitao Liang

IEEE/CVF Computer Vision and Pattern Recognition (CVPR'25) 2025

[Paper] [Code] [Demo] [Page] [Video] [Twitter]

GROOT: Learning to Follow Instructions by Watching Gameplay Videos

Shaofei Cai, Bowei Zhang, Zihao Wang, Xiaojian Ma, Anji Liu, Yitao Liang

International Conference on Learning Representations (ICLR'24) 2024 Spotlight Top 6.2%

This paper studies the problem of building a controller that can follow open-ended instructions in open-world environments. We propose to follow reference videos as instructions, which offer expressive goal specifications while eliminating the need for expensive text-gameplay annotations. A new learning framework is derived to allow learning such instruction-following controllers from gameplay videos while producing a video instruction encoder that induces a structured goal space.

[Paper] [Code] [Page] [Twitter]

GROOT: Learning to Follow Instructions by Watching Gameplay Videos

Shaofei Cai, Bowei Zhang, Zihao Wang, Xiaojian Ma, Anji Liu, Yitao Liang

International Conference on Learning Representations (ICLR'24) 2024 Spotlight Top 6.2%

[Paper] [Code] [Page] [Twitter]

Automatic Relation-aware Graph Network Proliferation

Shaofei Cai, Liang Li, Xinzhe Han, Jiebo Luo, Zhengjun Zha, Qingming Huang

IEEE/CVF Computer Vision and Pattern Recognition (CVPR'22) 2022 Oral Top 4.2%

This paper proposes Automatic Relation-aware Graph Network Proliferation (ARGNP) for efficiently searching GNNs with a relation-guided message passing mechanism. Specifically, we first devise a novel dual relation-aware graph search space that comprises both node and relation learning operations. These operations can extract hierarchical node/relational information and provide anisotropic guidance for message passing on a graph. Second, analogous to cell proliferation, we design a network proliferation search paradigm to progressively determine the GNN architectures by iteratively performing network division and differentiation.

[Paper] [Code] [Poster]

Automatic Relation-aware Graph Network Proliferation

Shaofei Cai, Liang Li, Xinzhe Han, Jiebo Luo, Zhengjun Zha, Qingming Huang

IEEE/CVF Computer Vision and Pattern Recognition (CVPR'22) 2022 Oral Top 4.2%

[Paper] [Code] [Poster]

Education

Experience

Honors & Awards

News

Selected Publications (view all )

Scalable Multi-Task Reinforcement Learning for Generalizable Spatial Intelligence in Visuomotor Agents

Scalable Multi-Task Reinforcement Learning for Generalizable Spatial Intelligence in Visuomotor Agents

ROCKET-2: Steering Visuomotor Policy via Cross-View Goal Alignment

ROCKET-2: Steering Visuomotor Policy via Cross-View Goal Alignment

ROCKET-1: Mastering Open-World Interaction with Visual-Temporal Context Prompting

ROCKET-1: Mastering Open-World Interaction with Visual-Temporal Context Prompting

GROOT: Learning to Follow Instructions by Watching Gameplay Videos

GROOT: Learning to Follow Instructions by Watching Gameplay Videos

Automatic Relation-aware Graph Network Proliferation

Automatic Relation-aware Graph Network Proliferation

All publications