Homepage - Shaofei Cai

Shaofei Cai (蔡少斐)

Ph.D. Candidate at Institute for Artificial Intelligence, Peking University

My dream is to witness the emergence of artificial general intelligence and to contribute my efforts.

My doctoral research centered on multi-task agents operating within open-world settings. A core aspect of this work involved identifying and developing a task representation method characterized by high expressiveness, low ambiguity, and scalability for efficient training. In the realm of 3D video games, I spearheaded the development of the GROOT and ROCKET series as the first author. These projects empower AI agents to execute intricate tasks in Minecraft based on human instructions, consistently pushing the frontiers of AI agent capabilities within the Minecraft environment. Concurrently, I've delved into applying reinforcement learning techniques to bolster the visual reasoning abilities of visuomotor agents. I'm particularly enthusiastic about creating intelligent agents that can perceive and retain information with the fluidity and coherence akin to human cognition. Furthermore, I actively track the progress in general agents, holding a strong belief that the scalable generation and validation of tasks within digital worlds is currently the most viable route to achieving Artificial General Intelligence (AGI).

Research Interests: Machine Learning, Computer Vision, Reinforcement Learning, AI Agents, Embodied Intelligence

caishaofei(at)stu.pku.edu.cn Google Scholar GitHub LinkedIn ORCID

Education

Peking University

Institute for Artificial Intelligence
Ph.D. Candidate

Sep. 2022 - Jul. 2026
University of Chinese Academy of Sciences

Institute of Computing Technology, Chinese Academy of Sciences
M.S. in Computer Science

Sep. 2019 - Jul. 2022
Xi'an Jiaotong University

Software College
B.Eng. in Software Engineering

Sep. 2015 - Jul. 2019

Experience

Beijing Institute for General Artificial Intelligence

Research Intern

Mar. 2022 - Sep. 2024
Beijing Bytedance technology company Limited

Research Intern

Oct. 2020 - May 2021

Honors & Awards

Gold Medal 🏅 Top 3.3%
at the 43rd ICPC International Collegiate Programming Contest Asia Regional

2018
Outstanding Student Award
at Xi’an Jiaotong University for undergraduate students

2019
3rd Academic Star Award (Top 5 Students) at Peking University

2025
Winner, 1st at ATEC 2025 Robotics and AI Challenge, Software Track

2025

News

2025

We have released a state-of-the-art Minecraft Agent, "ROCKET-2", supporting cross-view goal specification!

Mar 05

We are happy that "ROCKET-1" has been accepted by the Computer Vision and Pattern Recognition (CVPR) 2025!

Feb 28

We are happy that "GROOT-2" has been accepted by the International Conference on Learning Representations (ICLR) 2025!

Jan 23

2024

We have released the first comprehensive Minecraft AI agent development toolkit, "MineStudio"! Doc PyPI Dataset

Oct 19

We have released "ROCKET-1", the first AI agent to master all the interaction tasks in the Minecraft! Code Demo

Oct 19

We are happy that "OmniJARVIS" has been accepted by the Neural Information Processing Systems (NeurIPS) 2024!

Oct 01

We are happy that "GROOT" was accepted by the International Conference on Learning Representations (ICLR) 2024! Spotlight Top 6.2%

Jan 30

Selected Publications (view all )

ROCKET-2: Steering Visuomotor Policy via Cross-View Goal Alignment

Shaofei Cai, Zhancun Mu, Anji Liu, Yitao Liang

arxiv 2025

We aim to develop a goal specification method that is semantically clear, spatially sensitive, and intuitive for human users to guide agent interactions in embodied environments. Specifically, we propose a novel cross-view goal alignment framework that allows users to specify target objects using segmentation masks from their own camera views rather than the agent's observations.

[Paper] [Code] [Page]

ROCKET-2: Steering Visuomotor Policy via Cross-View Goal Alignment

Shaofei Cai, Zhancun Mu, Anji Liu, Yitao Liang

arxiv 2025

[Paper] [Code] [Page]

ROCKET-1: Mastering Open-World Interaction with Visual-Temporal Context Prompting

Shaofei Cai, Zihao Wang, Kewei Lian, Zhancun Mu, Xiaojian Ma, Anji Liu, Yitao Liang

IEEE/CVF Computer Vision and Pattern Recognition (CVPR) 2025

We propose visual-temporal context prompting, a novel communication protocol between VLMs and policy models. This protocol leverages object segmentation from past observations to guide policy-environment interactions. Using this approach, we train ROCKET-1, a low-level policy that predicts actions based on concatenated visual observations and segmentation masks, supported by real-time object tracking from SAM-2.

[Paper] [Code] [Demo] [Page] [Video] [Twitter]

ROCKET-1: Mastering Open-World Interaction with Visual-Temporal Context Prompting

Shaofei Cai, Zihao Wang, Kewei Lian, Zhancun Mu, Xiaojian Ma, Anji Liu, Yitao Liang

IEEE/CVF Computer Vision and Pattern Recognition (CVPR) 2025

[Paper] [Code] [Demo] [Page] [Video] [Twitter]

MineStudio: A Streamlined Package for Minecraft AI Agent Development

Shaofei Cai*, Zhancun Mu*, Kaichen He, Bowei Zhang, Xinyue Zheng, Anji Liu, Yitao Liang (* equal contribution)

arxiv 2024

MineStudio is an open-source software package designed to streamline embodied policy development in Minecraft. MineStudio represents the first comprehensive integration of seven critical engineering components: simulator, data, model, offline pretraining, online finetuning, inference, and benchmark, thereby allowing users to concentrate their efforts on algorithm innovation.

[Paper] [Code] [Doc] [PyPI] [Dataset]

MineStudio: A Streamlined Package for Minecraft AI Agent Development

Shaofei Cai*, Zhancun Mu*, Kaichen He, Bowei Zhang, Xinyue Zheng, Anji Liu, Yitao Liang (* equal contribution)

arxiv 2024

[Paper] [Code] [Doc] [PyPI] [Dataset]

GROOT: Learning to Follow Instructions by Watching Gameplay Videos

Shaofei Cai, Bowei Zhang, Zihao Wang, Xiaojian Ma, Anji Liu, Yitao Liang

International Conference on Learning Representations (ICLR) 2024 Spotlight Top 6.2%

This paper studies the problem of building a controller that can follow open-ended instructions in open-world environments. We propose to follow reference videos as instructions, which offer expressive goal specifications while eliminating the need for expensive text-gameplay annotations. A new learning framework is derived to allow learning such instruction-following controllers from gameplay videos while producing a video instruction encoder that induces a structured goal space.

[Paper] [Code] [Page] [Twitter]

GROOT: Learning to Follow Instructions by Watching Gameplay Videos

Shaofei Cai, Bowei Zhang, Zihao Wang, Xiaojian Ma, Anji Liu, Yitao Liang

International Conference on Learning Representations (ICLR) 2024 Spotlight Top 6.2%

[Paper] [Code] [Page] [Twitter]

Automatic Relation-aware Graph Network Proliferation

Shaofei Cai, Liang Li, Xinzhe Han, Jiebo Luo, Zhengjun Zha, Qingming Huang

IEEE/CVF Computer Vision and Pattern Recognition (CVPR) 2022 Oral Top 4.2%

This paper proposes Automatic Relation-aware Graph Network Proliferation (ARGNP) for efficiently searching GNNs with a relation-guided message passing mechanism. Specifically, we first devise a novel dual relation-aware graph search space that comprises both node and relation learning operations. These operations can extract hierarchical node/relational information and provide anisotropic guidance for message passing on a graph. Second, analogous to cell proliferation, we design a network proliferation search paradigm to progressively determine the GNN architectures by iteratively performing network division and differentiation.

[Paper] [Code] [Poster]

Automatic Relation-aware Graph Network Proliferation

Shaofei Cai, Liang Li, Xinzhe Han, Jiebo Luo, Zhengjun Zha, Qingming Huang

IEEE/CVF Computer Vision and Pattern Recognition (CVPR) 2022 Oral Top 4.2%

[Paper] [Code] [Poster]

Warning

Action required

Education

Experience

Honors & Awards

News

Selected Publications (view all )

ROCKET-2: Steering Visuomotor Policy via Cross-View Goal Alignment

ROCKET-2: Steering Visuomotor Policy via Cross-View Goal Alignment

ROCKET-1: Mastering Open-World Interaction with Visual-Temporal Context Prompting

ROCKET-1: Mastering Open-World Interaction with Visual-Temporal Context Prompting

MineStudio: A Streamlined Package for Minecraft AI Agent Development

MineStudio: A Streamlined Package for Minecraft AI Agent Development

GROOT: Learning to Follow Instructions by Watching Gameplay Videos

GROOT: Learning to Follow Instructions by Watching Gameplay Videos

Automatic Relation-aware Graph Network Proliferation

Automatic Relation-aware Graph Network Proliferation

All publications