I am a fourth-year PhD student at CMU's Robotics Institute, advised by Prof. Kris Kitani. I previously received my bachelor's degree
in Computer Science at CMU in 2022, also working with Prof. Kris Kitani.
I had the opportunity to conduct research at the MSC Lab in UC Berkeley for two summers, advised by Prof. Masayoshi Tomizuka and Dr. Wei Zhan.
I previously interned at Meta working on 3D panoptic reconstruction, parametric human body modeling, and promptable mesh recovery.
I have also interned at Applied Intuition working on 3D occupancy estimation.
I'm broadly interested in computer vision, joint 2D/3D understanding, human motion modeling, and vision-centric humanoid control. Much
of my research focuses on bridging 2D and 3D representations for a cohesive
understanding of the world.
LatentWorld represents driving scenes as sparse, grounded 3D latents (one editable latent per actor, many for background) and factorizes 4D generation into layout, feature, and motion diffusion, yielding controllable, temporally coherent 4D occupancy with state-of-the-art results on CarlaSC and Waymo.
SAM 3D Body is a promptable full-body mesh recovery model built on MHR that uses 2D keypoint/mask prompts and large-scale data curation for robust body and hand pose estimation in the wild.
S2GO is a streaming, sparse query-based 3D occupancy framework that decodes queries into semantic Gaussians and uses a denoising rendering objective to capture scene geometry, achieving state-of-the-art accuracy and 5.9x faster inference.
MHR is a parametric human body model incorporating ATLAS with a production-ready decoupled skeleton/shape rig, semantic expression blendshapes, and sparse pose correctives for expressive, anatomically plausible animation.
Enforcing temporal consistency and leveraging forward-backward ensembling of temporal models improves semi-supervised learning for camera-based 3D detection.
Aligning predicted depth maps with observed depth points by propagating depth corrections improves
depth completion for sparse and varying input point densities.
Combining long-term, low-resolution and short-term, high-resolution matching for temporal stereo
yields efficient and performant camera-only 3D detectors.
Consistency between 2D and 3D pseudo-labels for joint 2D-3D semi-supervised learning stymies
single-modality error propagation and improves performance.
Multi-modal fusion with prediction consistency between privileged teacher and noisy student
alleivates collapse in difficult capture conditions and improves performance in ideal conditions.