Shaofei Cai (蔡少斐)
Logo Ph.D. Candidate at Institute for Artificial Intelligence, Peking University

My dream is to witness the emergence of artifical general intelligence and to contribute my own efforts toward it.

During my undergraduate years, I was passionate about studying classical algorithms and participated in algorithm competitions. This not only honed my programming skills but also taught me to approach problems using computational thinking. In graduate school, I delved into deep learning and conducted research in areas such as neural architecture search and computer vision. These experiences allowed me to appreciate the fascinating world of artificial intelligence and marked the beginning of my journey as a researcher. During my doctoral studies, I focused on open-world instruction-following agents as my research topic, diving deep into both the technical and theoretical aspects of this field. Over time, I have developed the ability to examine problems in machine learning from a broader and more unified perspective, often through the lens of probability and learning. I am deeply passionate about the theory and application of general-purpose foundational decision-making models and am committed to advancing research in this area.

Research Interests: Machine Learning, Generative Models, Computer Vision, Robotics, Sequential Control


Education
  • Peking University
    Peking University
    Institute for Artificial Intelligence
    Ph.D. Candidate
    Sep. 2022 - Jul. 2026
  • University of Chinese Academy of Sciences
    University of Chinese Academy of Sciences
    Institute of Computing Technology, Chinese Academy of Sciences
    M.S. in Computer Science
    Sep. 2019 - Jul. 2022
  • Xi'an Jiaotong University
    Xi'an Jiaotong University
    Software College
    B.Eng. in Software Engineering
    Sep. 2015 - Jul. 2019
Experience
  • Beijing Institute for General Artificial Intelligence
    Beijing Institute for General Artificial Intelligence
    Research Intern
    Mar. 2022 - Sep. 2024
  • Beijing Bytedance technology company Limited
    Beijing Bytedance technology company Limited
    Research Intern
    Oct. 2020 - May 2021
Honors & Awards
  • Gold Medal 🏅 Top 3.3%
    at the 43rd ICPC International Collegiate Programming Contest Asia Regional
    2018
  • Outstanding Student Award
    at Xi’an Jiaotong University for undergraduate students
    2019
News
2025
We are happy that "GROOT-2" has been accepted by the International Conference on Learning Representations (ICLR) 2025!
Jan 23
2024
We have released the first comprehensive Minecraft AI agent development toolkit, "MineStudio"! Doc PyPI Dataset
Oct 19
We have released "ROCKET-1", the first AI agent to master all the interaction tasks in the Minecraft! Code Demo
Oct 19
We are happy that "OmniJARVIS" has been accepted by the Neural Information Processing Systems (NeurIPS) 2024!
Oct 01
We are happy that "GROOT" was accepted by the International Conference on Learning Representations (ICLR) 2024! Spotlight Top 6.2%
Jan 30
Selected Publications (view all )
MineStudio: A Streamlined Package for Minecraft AI Agent Development
MineStudio: A Streamlined Package for Minecraft AI Agent Development

Shaofei Cai*, Zhancun Mu*, Kaichen He, Bowei Zhang, Xinyue Zheng, Anji Liu, Yitao Liang (* equal contribution)

arxiv 2024

MineStudio is an open-source software package designed to streamline embodied policy development in Minecraft. MineStudio represents the first comprehensive integration of seven critical engineering components: simulator, data, model, offline pretraining, online finetuning, inference, and benchmark, thereby allowing users to concentrate their efforts on algorithm innovation.

MineStudio: A Streamlined Package for Minecraft AI Agent Development

Shaofei Cai*, Zhancun Mu*, Kaichen He, Bowei Zhang, Xinyue Zheng, Anji Liu, Yitao Liang (* equal contribution)

arxiv 2024

MineStudio is an open-source software package designed to streamline embodied policy development in Minecraft. MineStudio represents the first comprehensive integration of seven critical engineering components: simulator, data, model, offline pretraining, online finetuning, inference, and benchmark, thereby allowing users to concentrate their efforts on algorithm innovation.

ROCKET-1: Mastering Open-World Interaction with Visual-Temporal Context Prompting
ROCKET-1: Mastering Open-World Interaction with Visual-Temporal Context Prompting

Shaofei Cai, Zihao Wang, Kewei Lian, Zhancun Mu, Xiaojian Ma, Anji Liu, Yitao Liang

arxiv 2024

We propose visual-temporal context prompting, a novel communication protocol between VLMs and policy models. This protocol leverages object segmentation from past observations to guide policy-environment interactions. Using this approach, we train ROCKET-1, a low-level policy that predicts actions based on concatenated visual observations and segmentation masks, supported by real-time object tracking from SAM-2.

ROCKET-1: Mastering Open-World Interaction with Visual-Temporal Context Prompting

Shaofei Cai, Zihao Wang, Kewei Lian, Zhancun Mu, Xiaojian Ma, Anji Liu, Yitao Liang

arxiv 2024

We propose visual-temporal context prompting, a novel communication protocol between VLMs and policy models. This protocol leverages object segmentation from past observations to guide policy-environment interactions. Using this approach, we train ROCKET-1, a low-level policy that predicts actions based on concatenated visual observations and segmentation masks, supported by real-time object tracking from SAM-2.

GROOT: Learning to Follow Instructions by Watching Gameplay Videos
GROOT: Learning to Follow Instructions by Watching Gameplay Videos

Shaofei Cai, Bowei Zhang, Zihao Wang, Xiaojian Ma, Anji Liu, Yitao Liang

International Conference on Learning Representations (ICLR) 2024 Spotlight Top 6.2%

This paper studies the problem of building a controller that can follow open-ended instructions in open-world environments. We propose to follow reference videos as instructions, which offer expressive goal specifications while eliminating the need for expensive text-gameplay annotations. A new learning framework is derived to allow learning such instruction-following controllers from gameplay videos while producing a video instruction encoder that induces a structured goal space.

GROOT: Learning to Follow Instructions by Watching Gameplay Videos

Shaofei Cai, Bowei Zhang, Zihao Wang, Xiaojian Ma, Anji Liu, Yitao Liang

International Conference on Learning Representations (ICLR) 2024 Spotlight Top 6.2%

This paper studies the problem of building a controller that can follow open-ended instructions in open-world environments. We propose to follow reference videos as instructions, which offer expressive goal specifications while eliminating the need for expensive text-gameplay annotations. A new learning framework is derived to allow learning such instruction-following controllers from gameplay videos while producing a video instruction encoder that induces a structured goal space.

Automatic Relation-aware Graph Network Proliferation
Automatic Relation-aware Graph Network Proliferation

Shaofei Cai, Liang Li, Xinzhe Han, Jiebo Luo, Zhengjun Zha, Qingming Huang

IEEE/CVF Computer Vision and Pattern Recognition (CVPR) 2022 Oral Top 4.2%

This paper proposes Automatic Relation-aware Graph Network Proliferation (ARGNP) for efficiently searching GNNs with a relation-guided message passing mechanism. Specifically, we first devise a novel dual relation-aware graph search space that comprises both node and relation learning operations. These operations can extract hierarchical node/relational information and provide anisotropic guidance for message passing on a graph. Second, analogous to cell proliferation, we design a network proliferation search paradigm to progressively determine the GNN architectures by iteratively performing network division and differentiation.

Automatic Relation-aware Graph Network Proliferation

Shaofei Cai, Liang Li, Xinzhe Han, Jiebo Luo, Zhengjun Zha, Qingming Huang

IEEE/CVF Computer Vision and Pattern Recognition (CVPR) 2022 Oral Top 4.2%

This paper proposes Automatic Relation-aware Graph Network Proliferation (ARGNP) for efficiently searching GNNs with a relation-guided message passing mechanism. Specifically, we first devise a novel dual relation-aware graph search space that comprises both node and relation learning operations. These operations can extract hierarchical node/relational information and provide anisotropic guidance for message passing on a graph. Second, analogous to cell proliferation, we design a network proliferation search paradigm to progressively determine the GNN architectures by iteratively performing network division and differentiation.

All publications