OSCD

Garden

Lounge

Meeting Room

Abstract

Online Scene Change Detection (SCD) is an extremely challenging problem that requires an agent to detect relevant changes on the fly while observing the scene from unconstrained viewpoints. Existing online SCD methods are significantly less accurate than offline approaches. We present the first online SCD approach that is pose-agnostic, label-free, and ensures multi-view consistency, while operating at over 10 FPS and achieving new state-of-the-art performance, surpassing even the best offline approaches.

Our method introduces a new self-supervised fusion loss to infer scene changes from multiple cues and observations, PnP-based fast pose estimation against the reference scene, and a fast change-guided representation update strategy for the 3D Gaussian Splatting. Extensive experiments on complex real-world datasets demonstrate that our approach outperforms both online and offline baselines.

Our online Scene Change Detection method establishes a new state of the art, detecting changes more reliably than all prior methods, including the strongest offline baselines. It operates at a runtime comparable to the fastest online approaches while achieving substantially higher F1 scores. These gains are enabled by a self-supervised loss enforcing multi-view consistency and a lightweight PnP-based pose estimation module.

Method Overview

We register an incoming inference image I_inf^k to an existing reference representation ℛ_ref with a lightweight PnP-based pose estimator. Using the estimated pose P_inf^k and ℛ_ref to render an aligned image I_ren^k, we extract change cues C^k as a combination of pixel- and feature-level cues. Our novel self-supervised fusion loss L_SSF guides the fusion of all observed change cues to build a change representation ℛ_change that collectively learns change information from multiple viewpoints and infer change masks M^k. Finally, we selectively reconstruct changed regions to update the 3D Gaussian Splatting representation to the current state ℛ_inf.

Qualitative Results

Lounge

Porch

Meeting Room

Qualitative comparison with MV3DCD. MV3DCD's hard thresholding and intersection heuristic lead to missed or spurious detections, especially for subtle appearance changes in semantically similar objects (red-to-blue T-shaped object in Meeting Room, blue-to-black bench in Porch). Hard thresholding risks discarding subtle but important changes, while the intersection fails to capture true changes unless present in both masks. Our method jointly learns complementary change information in pixel- and feature-level cues via our novel self-supervised loss, capturing fine-grained changes and achieving state-of-the-art performance in both online and offline settings.

Quantitative Results

Quantitative results for SCD on PASLCD averaged over all 20 instances. LF: Label-Free, PA: Pose-Agnostic, MV: Multi-View consistency for change detection. We report total runtime for offline methods and operating frame rate (FPS) for online methods. Our method achieves the best performance in both settings.

Offline Methods

Method	LF	PA	MV	mIoU ↑	F1 ↑	Runtime
R-SCD	—	—	—	0.118	0.199	194s
CYWS2D	—	—	—	0.273	0.398	189s
GeSCD	✓	—	—	0.477	0.611	298s
ZeroSCD	✓	—	—	0.306	0.414	409s
3DGS-CD	✓	✓	✓	0.209	0.339	824s
MV3DCD	✓	✓	✓	0.478	0.628	479s
Ours	✓	✓	✓	0.552	0.694	156s

Online Methods

Method	LF	PA	MV	mIoU ↑	F1 ↑	FPS
ChangeSim	—	—	—	0.018	0.034	11.5
CS+CYWS2D	—	—	—	0.243	0.360	8.2
CS+GeSCD	✓	—	—	0.181	0.270	<1
OmniposeAD	✓	✓	—	0.168	0.262	<1
SplatPose	✓	✓	—	0.173	0.281	<1
SplatPose+	✓	✓	—	0.237	0.358	<1
Ours	✓	✓	✓	0.486	0.638	11.2

Change-Guided Efficient Gaussian Representation Update

Qualitative comparison of rendered views from the updated representation with CLNeRF and 3DGS (from scratch). Our method more accurately reconstructs changed regions (red boxes) while reusing primitives from ℛ_ref to preserve high fidelity in unchanged areas (yellow boxes), compared to naïve reconstruction at each time.

Method	PASLCD				CL-Splats
Method	PSNR (dB) ↑	SSIM ↑	LPIPS ↓	Runtime (s) ↓	PSNR (dB) ↑	SSIM ↑	LPIPS ↓	Runtime (s) ↓
3DGS	22.21	0.7558	0.2426	550	30.31	0.9319	0.1178	364
3DGS-LM	22.26	0.7562	0.2422	340	29.95	0.9322	0.1177	275
SpeedySplats	22.25	0.7603	0.2618	399	29.89	0.9349	0.1290	312

CLNeRF	22.27	0.6239	0.3907	451	26.29	0.7867	0.2235	301
Ours	23.70	0.7868	0.2491	42	30.54	0.9356	0.1256	39

Quantitative comparison of scene representation update on PASLCD and CL-Splats. Our method achieves comparable or higher reconstruction quality than approaches that fully re-optimize the evolved scene from scratch, while providing updated representations within seconds (<60s), achieving up to 8–9× faster runtimes. Results are averaged over all instances and scenes.

BibTeX

@inproceedings{galappaththige2026online,
  title={Changes in Real Time: Online Scene Change Detection with Multi-View Fusion},
  author={Galappaththige, Chamuditha Jayanga and Lai, Jason and Windrim, Lloyd and Dansereau, Donald and Sunderhauf, Niko and Miller, Dimity},
  booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
  year={2026}
}

Acknowledgement

This work was supported by the ARC Research Hub in Intelligent Robotic Systems for Real-Time Asset Management (IH210100030) and Abyss Solutions. C.J., N.S., and D.M. also acknowledge ongoing support from the QUT Centre for Robotics.