I am a fourth-year PhD student at CMU's Robotics Institute, advised by Prof. Kris Kitani. I previously received my bachelor's degree
in Computer Science at CMU in 2022, also working with Prof. Kris Kitani.
I had the opportunity to conduct research at the MSC Lab in UC Berkeley for two summers, advised by Prof. Masayoshi Tomizuka and Dr. Wei Zhan.
I previously interned at Meta working on 3D panoptic reconstruction, parametric human body modeling, and promptable mesh recovery.
I have also interned at Applied Intuition working on 3D occupancy estimation.
I'm broadly interested in computer vision, joint 2D/3D understanding, human motion modeling, and vision-centric humanoid control. Much
of my research focuses on bridging 2D and 3D representations for a cohesive
understanding of the world.
SAM 3D Body is a promptable full-body mesh recovery model built on MHR that uses 2D keypoint/mask prompts and large-scale data curation for robust body and hand pose estimation in the wild.
MHR is a parametric human body model incorporating ATLAS with a production-ready decoupled skeleton/shape rig, semantic expression blendshapes, and sparse pose correctives for expressive, anatomically plausible animation.
S2GO is a streaming, sparse query-based 3D occupancy framework that decodes queries into semantic Gaussians and uses a denoising rendering objective to capture scene geometry, achieving state-of-the-art accuracy and 5.9x faster inference.
Enforcing temporal consistency and leveraging forward-backward ensembling of temporal models improves semi-supervised learning for camera-based 3D detection.
Aligning predicted depth maps with observed depth points by propagating depth corrections improves
depth completion for sparse and varying input point densities.
Combining long-term, low-resolution and short-term, high-resolution matching for temporal stereo
yields efficient and performant camera-only 3D detectors.
Consistency between 2D and 3D pseudo-labels for joint 2D-3D semi-supervised learning stymies
single-modality error propagation and improves performance.
Multi-modal fusion with prediction consistency between privileged teacher and noisy student
alleivates collapse in difficult capture conditions and improves performance in ideal conditions.