Changes in Real Time: Online Scene Change Detection with Multi-View Fusion

1QUT Centre for Robotics,   2ARIAM Hub,   3ACFR, University of Sydney,   4Abyss Solutions

Abstract

Online Scene Change Detection (SCD) is an extremely challenging problem that requires an agent to detect relevant changes on the fly while observing the scene from unconstrained viewpoints. Existing online SCD methods are significantly less accurate than offline approaches. We present the first online SCD approach that is pose-agnostic, label-free, and ensures multi-view consistency, while operating at over 10 FPS and achieving new state-of-the-art performance, surpassing even the best offline approaches.

Our method introduces a new self-supervised fusion loss to infer scene changes from multiple cues and observations, PnP-based fast pose estimation against the reference scene, and a fast change-guided representation update strategy for the 3D Gaussian Splatting. Extensive experiments on complex real-world datasets demonstrate that our approach outperforms both online and offline baselines.

Performance Comparison

Our online Scene Change Detection method establishes a new state of the art, detecting changes more reliably than all prior methods, including the strongest offline baselines. It operates at a runtime comparable to the fastest online approaches while achieving substantially higher F1 scores. These gains are enabled by a self-supervised loss enforcing multi-view consistency and a lightweight PnP-based pose estimation module.

Method Overview

Method Overview

We register an incoming inference image Iinfk to an existing reference representation ℛref with a lightweight PnP-based pose estimator. Using the estimated pose Pinfk and ℛref to render an aligned image Irenk, we extract change cues Ck as a combination of pixel- and feature-level cues. Our novel self-supervised fusion loss LSSF guides the fusion of all observed change cues to build a change representation ℛchange that collectively learns change information from multiple viewpoints and infer change masks Mk. Finally, we selectively reconstruct changed regions to update the 3D Gaussian Splatting representation to the current state ℛinf.

Qualitative Results

Legend

Qualitative comparison with MV3DCD. MV3DCD's hard thresholding and intersection heuristic lead to missed or spurious detections, especially for subtle appearance changes in semantically similar objects (red-to-blue T-shaped object in Meeting Room, blue-to-black bench in Porch). Hard thresholding risks discarding subtle but important changes, while the intersection fails to capture true changes unless present in both masks. Our method jointly learns complementary change information in pixel- and feature-level cues via our novel self-supervised loss, capturing fine-grained changes and achieving state-of-the-art performance in both online and offline settings.

Quantitative Results

Quantitative results for SCD on PASLCD averaged over all 20 instances. LF: Label-Free, PA: Pose-Agnostic, MV: Multi-View consistency for change detection. We report total runtime for offline methods and operating frame rate (FPS) for online methods. Our method achieves the best performance in both settings.

Offline Methods

Method LF PA MV mIoU ↑ F1 ↑ Runtime
R-SCD 0.118 0.199 194s
CYWS2D 0.273 0.398 189s
GeSCD 0.477 0.611 298s
ZeroSCD 0.306 0.414 409s
3DGS-CD 0.209 0.339 824s
MV3DCD 0.478 0.628 479s
Ours 0.552 0.694 156s

Online Methods

Method LF PA MV mIoU ↑ F1 ↑ FPS
ChangeSim 0.018 0.034 11.5
CS+CYWS2D 0.243 0.360 8.2
CS+GeSCD 0.181 0.270 <1
OmniposeAD 0.168 0.262 <1
SplatPose 0.173 0.281 <1
SplatPose+ 0.237 0.358 <1
Ours 0.486 0.638 11.2

Change-Guided Efficient Gaussian Representation Update

Change-Guided Update Comparison

Qualitative comparison of rendered views from the updated representation with CLNeRF and 3DGS (from scratch). Our method more accurately reconstructs changed regions (red boxes) while reusing primitives from ℛref to preserve high fidelity in unchanged areas (yellow boxes), compared to naïve reconstruction at each time.

Method PASLCD CL-Splats
PSNR (dB) ↑ SSIM ↑ LPIPS ↓ Runtime (s) ↓ PSNR (dB) ↑ SSIM ↑ LPIPS ↓ Runtime (s) ↓
3DGS 22.21 0.7558 0.2426 550 30.31 0.9319 0.1178 364
3DGS-LM 22.26 0.7562 0.2422 340 29.95 0.9322 0.1177 275
SpeedySplats 22.25 0.7603 0.2618 399 29.89 0.9349 0.1290 312
CLNeRF 22.27 0.6239 0.3907 451 26.29 0.7867 0.2235 301
Ours 23.70 0.7868 0.2491 42 30.54 0.9356 0.1256 39

Quantitative comparison of scene representation update on PASLCD and CL-Splats. Our method achieves comparable or higher reconstruction quality than approaches that fully re-optimize the evolved scene from scratch, while providing updated representations within seconds (<60s), achieving up to 8–9× faster runtimes. Results are averaged over all instances and scenes.

BibTeX

@article{galappaththige2024online,
  title={Changes in Real Time: Online Scene Change Detection with Multi-View Fusion},
  author={Galappaththige, Chamuditha Jayanga and Lai, Jason and Windrim, Lloyd and Dansereau, Donald and S{\"u}nderhauf, Niko and Miller, Dimity},
  journal={arXiv preprint arXiv:2511.12370},
  year={2024}
}

Acknowledgement

This work was supported by the ARC Research Hub in Intelligent Robotic Systems for Real-Time Asset Management (IH210100030) and Abyss Solutions. C.J., N.S., and D.M. also acknowledge ongoing support from the QUT Centre for Robotics.