I am a second-year M.S. in Computer Science student at the Umass Amherst, advised by Prof. Chuang Gan. I also work closely with Frank Dou from MIT CSAIL. In addition, I collaborate with Dr. Julian Tanke, and Dr. Takashi Shibuya at SonyAI on ongoing research projects. I received my B.Eng. degree from Fudan University, where I was fortunate to be advised by Prof. Yaqian Zhou. During senior years, I was a research intern at SRIBD, supervised by Prof. Haizhou Li.
My research interest lies in human-centered animation. More specifically, I work on character animation, computer graphics, physics-based simulation, and motion estimation. My ultimate goal is to build a neural-based physics engine for simulating human motion as well as human–object–scene interactions — in other words, the system I refer to as Clotho.
🔥 News
- 2025.10: 🎉🎉 One co-first-author paper TalkCuts has been accepted to NeurIPS 2025.
- 2025.07: 🎉🎉 One paper Rapverse has been accepted to ICCV 2025.
📝 Publications
* indicates equal contributions.
NeurIPS 2025
TalkCuts: A Large-Scale Dataset for Multi-Shot Human Speech Video Generation
Neural Information Processing Systems (NeurIPS), 2025
project page / paper / code
In this work, we present TalkCuts, a large-scale benchmark dataset designed to facilitate the study of multi-shot human speech video generation.
ICCV 2025
RapVerse: Coherent Vocals and Whole-Body Motions Generations from Text
International Conference on Computer Vision (ICCV), 2025
project page / paper / code
In this paper, we introduce a challenging task for simultaneously generating 3D holistic body motions and singing vocals directly from textual lyrics inputs. To facilitate this, we first collect the RapVerse dataset, a large dataset containing synchronous rapping vocals, lyrics, and high-quality 3D holistic body meshes.
🛠️ Project
SpeechGPT2: End-to-End Speech-Centric Large Language Model for Unified Listening, Speaking, and Understanding
SpeechGPT2 is an end-to-end speech-centric large language model designed to unify listening, speech understanding, and spoken response generation. The model supports natural multi-turn audio-based interaction **without** requiring intermediate text conversion.
📖 Educations
💻 Experiences
🎬Life
- I love anime and I’m a pretty hardcore ACG fan. I still follow new series every month. My favorite anime isEvangelion. I also watch a lot of movies in my spare time.
- My favorite game is Baldur’s Gate 3. My dream is to use AI to rebuild Baldur’s Gate, increase its level of freedom, and create a highly open-world experience with far more player agency.
- The two courses I enjoyed the most in college were A Reading of Descartes’ Meditations on First Philosophy and Film Appreciation.