Despite the efficiency and performance of sparse query-based representations for detection, state-of-the-art 3D occupancy estimation methods still rely on voxel-based or dense Gaussian-based 3D representations. However, dense representations are slow, and they lack flexibility in capturing the temporal dynamics of driving scenes. Distinct from prior work, we instead summarize the scene into a compact set of 3D queries which are propagated through time in an online, streaming fashion. These queries are then decoded into semantic Gaussians at each timestep. We couple our framework with a denoising rendering objective to guide the queries and their constituent Gaussians in effectively capturing scene geometry. Due to its efficient, query-based representation, S2GO achieves state-of-the-art performance on the nuScenes and KITTI occupancy benchmarks, outperforming prior art (e.g., GaussianWorld) by 2.7 IoU with 5.9x faster inference.
Core idea: we represent the dense 3D world using a small set of persistent 3D queries. These queries act as a compact, streaming summary of the scene: they carry long-horizon context forward, interact globally in sparse query space, and are refined online using the current camera views.
Queries are the streaming state that we propagate over time. This makes long-term temporal integration efficient and flexible, since global feature interaction happens over ~1k queries instead of a dense 3D representation. At each timestep, we decode the current queries into a denser set of semantic Gaussians to produce occupancy.
Streaming state matters. Using persistent queries as the streaming state preserves a clean, object-level memory over time, which helps avoid instance merging artifacts in long-horizon sequences.
At each timestep, S2GO maintains a set of 3D queries with associated locations and features. We refine these queries using cross-attention to the current multi-view image features and self-attention to a short queue of past queries, so temporal fusion happens in sparse query space rather than in a dense 3D representation. From the refined query features, the model predicts a per-query position offset, velocity, and Gaussian parameters. Each query then expands into a small bundle of Gaussians that capture local geometry and semantics.
Concretely, our streaming state is extremely compact: ~1k queries (900 for S2GO-Small, 1800 for S2GO-Base). This stands in contrast to dense Gaussian occupancy methods that rely on tens of thousands of Gaussians (e.g., 25.6k–144k) to cover the scene.
Each query is decoded into a local set of Gaussians, predicting:
Position offset (a correction for the entire Gaussian group)
Semantic class (and RGB during pretraining)
Query opacity (overall occupancy confidence)
Gaussians (multiple Gaussians for more precise local modeling)
Velocity (for dynamic modeling and multi-frame supervision)
Direct semantic occupancy supervision alone provides an ambiguous signal for query motion. The supervision is highly local and can lead to poor local minima: when queries start in empty space, the model can reduce voxel loss through small, local changes in the decoded Gaussians instead of moving queries through free space to reach occupied regions. This often produces stagnant query behavior and poor geometric coverage.
Stage 1 (geometry denoising pretraining). We initialize query locations at noised LiDAR points and train the network to recover geometry through a denoising objective. To capture fine-grained local shape, decoded Gaussians are rendered from the current and neighboring views and supervised with RGB and depth.
Stage 2 (semantic occupancy). LiDAR is no longer used, and query locations are randomly initialized. The network uses its pretrained geometry prior to reposition queries, decode to Gaussians, and predict semantic occupancy.
Impact of denoising pretraining. Without pretraining, queries remain largely stagnant and Gaussians fail to capture 3D structure. With denoising + rendering, queries move towards occupied regions and Gaussians self-organize to represent the scene.
Pretraining ablation. Direct training underperforms. Initializing queries from noised LiDAR points is important, and combining denoising with RGB+depth rendering objectives yields state-of-the-art results.
| Setting | Query Init | Depth | RGB | Denoise | mIoU | IoU |
|---|---|---|---|---|---|---|
| (a) | - | ✗ | ✗ | ✗ | 13.02 | 25.73 |
| (b) | Learnable | ✓ | ✓ | ✗ | 12.42 | 26.64 |
| (c) | LiDAR | ✓ | ✓ | ✗ | 13.62 | 27.08 |
| (d) | LiDAR + ε | ✓ | ✓ | ✗ | 20.55 | 32.68 |
| (e) | LiDAR + ε | ✓ | ✗ | ✗ | 20.25 | 32.44 |
| (f) | LiDAR + ε | ✓ | ✓ | ✓ | 21.60 | 33.91 |
S2GO applies dense occupancy supervision by splatting Gaussians into a voxel grid. A key detail is to perform opacity-aware semantic splatting. In prior Gaussian-to-voxel formulations, opacity is only used to weight Gaussians inside the class mixture and has no bearing on whether a location is occupied. This leads to unexpected behavior: Gaussians in unoccupied regions can keep significant opacity while reducing occupancy contribution by shrinking or drifting between voxel centers. We instead weight occupancy probability by opacity, encouraging background Gaussians to simply predict low opacity and making the representation more consistent with rendering.
We also introduce an efficient voxel splatting implementation with custom CUDA kernels that exploit locality. We block voxels into small grids (e.g., 4×4×4) and collaboratively load nearby Gaussians for better cache behavior, and we carefully structure the backward pass to avoid expensive atomics.
| Opacity in occupancy | Efficient G2V | mIoU | IoU | Train GPU hours | Infer GPU Mem. | Infer FPS |
|---|---|---|---|---|---|---|
| ✗ | ✗ | 16.97 | 28.75 | 55h | 2436 MB | 25.2 |
| ✓ | ✗ | 20.13 | 32.28 | 129h | 7043 MB | 20.4 |
| ✓ | ✓ | 20.55 | 32.68 | 28h | 2448 MB | 25.3 |
A streaming model requires an explicit propagation rule for query state. S2GO propagates a subset of queries based on confidence and enforces a minimum-distance constraint to avoid redundant propagated queries clustering in a small region. This simple constraint improves spatial coverage and improves performance.
| Propagation | mIoU | IoU |
|---|---|---|
| None | 17.92 | 29.24 |
| top-k opacity | 19.94 | 32.03 |
| δ-dist top-k opacity | 20.51 | 32.51 |
Longer history improves occupancy performance, highlighting the benefit of maintaining a persistent, streaming query state.
History length vs performance. Longer history consistently improves occupancy performance, showcasing the advantage of streaming queries.
Query persistence. We visualize the distribution of query lifetimes, highlighting that queries are persistent over time.
We report quantitative 3D occupancy results on the nuScenes-SurroundOcc validation set. S2GO achieves state-of-the-art accuracy while maintaining real-time throughput.
| Method | FPS | IoU | mIoU | barrier | bicycle | bus | car | const. veh. | motorcycle | pedestrian | traffic cone | trailer | truck | drive. suf. | other flat | sidewalk | terrain | manmade | vegetation |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MonoScene | - | 24.0 | 7.3 | 4.0 | 0.4 | 8.0 | 8.0 | 2.9 | 0.3 | 1.2 | 0.7 | 4.0 | 4.4 | 27.7 | 5.2 | 15.1 | 11.3 | 9.0 | 14.9 |
| Atlas | - | 28.7 | 15.0 | 10.6 | 5.7 | 19.7 | 24.9 | 8.9 | 8.8 | 6.5 | 3.3 | 10.4 | 16.2 | 34.9 | 15.5 | 21.9 | 21.0 | 11.2 | 20.5 |
| BEVFormer | 3.3 | 30.5 | 16.8 | 14.2 | 6.6 | 23.5 | 28.3 | 8.7 | 10.8 | 6.6 | 4.1 | 11.2 | 17.8 | 37.3 | 18.0 | 22.9 | 22.2 | 13.8 | 22.2 |
| TPVFormer | 2.9 | 30.9 | 17.1 | 16.0 | 5.3 | 23.9 | 27.3 | 9.8 | 8.7 | 7.1 | 5.2 | 11.0 | 19.2 | 38.9 | 21.3 | 24.3 | 23.2 | 11.7 | 20.8 |
| OccFormer | - | 31.4 | 19.0 | 18.7 | 10.4 | 23.9 | 30.3 | 10.3 | 14.2 | 13.6 | 10.1 | 12.5 | 20.8 | 38.8 | 19.8 | 24.2 | 22.2 | 13.5 | 21.4 |
| SurroundOcc | 3.3 | 31.5 | 20.3 | 20.6 | 11.7 | 28.1 | 30.9 | 10.7 | 15.1 | 14.1 | 12.1 | 14.4 | 22.3 | 37.3 | 23.7 | 24.5 | 22.8 | 14.9 | 21.9 |
| GaussianFormer | 2.7 | 29.8 | 19.1 | 19.5 | 11.3 | 26.1 | 29.8 | 10.5 | 13.8 | 12.6 | 8.7 | 12.7 | 21.6 | 39.6 | 23.3 | 24.5 | 23.0 | 9.6 | 19.1 |
| GaussianFormer-2 | 2.8 | 31.7 | 20.8 | 21.4 | 13.4 | 28.5 | 30.8 | 10.9 | 15.8 | 13.6 | 10.5 | 14.0 | 22.9 | 40.6 | 24.4 | 26.1 | 24.3 | 13.8 | 22.0 |
| QuadricFormer | 6.2 | 31.2 | 20.1 | 19.6 | 13.1 | 27.3 | 29.6 | 11.3 | 16.3 | 12.7 | 9.2 | 12.5 | 21.2 | 40.2 | 24.3 | 25.7 | 24.2 | 13.0 | 21.9 |
| GaussianWorld* | 4.4 | 32.8 | 21.8 | 21.6 | 13.3 | 27.3 | 31.2 | 13.9 | 16.9 | 13.3 | 11.8 | 14.8 | 23.7 | 41.9 | 24.3 | 28.4 | 26.3 | 15.7 | 24.5 |
| ALOcc-mini-GF | 5.4 | 34.6 | 23.1 | 22.2 | 16.0 | 27.9 | 32.7 | 12.1 | 18.9 | 16.6 | 15.3 | 14.5 | 23.9 | 46.0 | 28.2 | 29.0 | 26.6 | 15.8 | 23.7 |
| ALOcc-GF | 0.9 | 38.2 | 25.5 | 24.3 | 18.8 | 29.8 | 34.3 | 17.9 | 19.6 | 17.5 | 17.2 | 15.5 | 26.5 | 47.6 | 29.9 | 31.2 | 29.2 | 20.0 | 29.0 |
| S2GO-Small (ours) | 26.1 | 34.3 | 22.1 | 20.8 | 13.1 | 27.5 | 30.3 | 14.5 | 16.5 | 11.7 | 10.9 | 13.5 | 23.3 | 46.3 | 29.2 | 29.7 | 28.4 | 13.0 | 25.1 |
| S2GO-Base (ours) | 19.6 | 35.5 | 22.7 | 21.9 | 13.4 | 27.5 | 32.1 | 14.9 | 15.3 | 12.9 | 11.8 | 13.4 | 24.0 | 46.9 | 29.1 | 30.3 | 29.1 | 14.7 | 26.4 |
nuScenes visualization. Top: RGB images. Bottom-left: S2GO predictions. Bottom-right: ground truth.
@inproceedings{Park2026S2GO,
title={S2GO: Streaming Sparse Gaussian Occupancy},
author={Jinhyung Park and Chensheng Peng and Yihan Hu and Wenzhao Zheng and Kris Kitani and Wei Zhan},
booktitle={International Conference on Learning Representations (ICLR)},
year={2026}
}