Modern SLAM: learned features and Gaussian splatting
SuperPoint, DROID-SLAM, Gaussian splats — the deep-learning wave reshaping the SLAM landscape. What's new, what classical methods still beat, and where the production frontier is in 2026.
For 30 years SLAM was geometry: features + matching + optimization, all hand-crafted. Since 2018 deep learning has been eating piece after piece of the pipeline. Learned features replaced ORB. Learned matchers replaced descriptor distance. Learned end-to-end SLAM systems exist. And Gaussian splatting brought differentiable photorealistic mapping to the field. Here's what's actually production in 2026.
The four lines of attack
- Replace components: keep the classical pipeline; swap in learned features, matchers, depth estimators.
- End-to-end SLAM: train a neural network that ingests video and outputs trajectory + map.
- Neural fields for mapping: replace point clouds with NeRF or Gaussian splat representations.
- Foundation-model SLAM: leverage VLMs and diffusion models as priors over scene structure.
1. Component-level replacements
SuperPoint (2018) and SuperGlue (2020)
SuperPoint: a CNN that detects keypoints + computes descriptors in one forward pass. Significantly more repeatable than ORB across viewpoint, lighting, and weather changes.
SuperGlue: a graph-neural-network matcher. Given two sets of SuperPoint descriptors, output the optimal one-to-one matching. Robust to occlusion, viewpoint change, low texture.
Practical impact: SuperPoint + SuperGlue beats ORB + brute-force matching by ~30 percentage points on hard wide-baseline benchmarks. Used in modern SfM (COLMAP-Pixel) and several SLAM systems.
Cost: 5–10× slower than ORB; requires GPU.
LightGlue (2023)
SuperGlue with adaptive computation — fast pairs are matched quickly, hard pairs get more attention. ~3× faster than SuperGlue with similar accuracy. The new "default learned matcher."
Learned depth (MiDaS, Depth Anything, ZoeDepth)
Single-image depth estimation. Useful as a prior for monocular SLAM (which is otherwise scale-ambiguous). Doesn't replace stereo or LiDAR but provides robust scale anchors.
2. End-to-end neural SLAM
DROID-SLAM (2021)
One model takes a video stream, outputs camera poses + dense depth maps. Internally: differentiable bundle adjustment via a recurrent network. End-to-end trained on synthetic data.
Strengths: dense reconstruction; works in textureless scenes; good loop closure.
Weaknesses: heavy compute (RTX 3090+); slow real-time; not yet robust enough to replace ORB-SLAM3 in production.
NICE-SLAM, Co-SLAM (2022)
Combine classical tracking with neural-field mapping. Tracking is geometric; the map is a multi-resolution feature grid (NICE) or hash grid (Co). Reconstruction is dense; tracking remains real-time.
3. Gaussian splatting for SLAM
3D Gaussian Splatting (Kerbl et al., 2023) represents a scene as millions of 3D Gaussians, each with position, covariance, color, and opacity. Trained by gradient descent on multi-view photo loss. The result is a continuous-density representation that renders extremely fast (real-time at 30+ FPS) and looks photorealistic.
SLAM systems using Gaussian splats:
- SplaTAM (2024): tracks the camera by minimizing rendering loss against the current Gaussian map. Updates the map by adding new Gaussians from each frame.
- MonoGS (2024): monocular Gaussian-splat SLAM.
- RTG-SLAM (2024): real-time variant; faster updates but lower quality.
What this buys: dense, photorealistic maps usable for visualization, novel-view synthesis, and downstream perception. The map IS the rendering.
What it doesn't buy yet: as fast as classical SLAM. Memory-heavy. Doesn't yet replace LiDAR-LOAM for autonomous-driving-scale environments.
4. Foundation models in SLAM
Most exploratory work in 2024–25. Examples:
- VLM-driven place recognition: use a vision-language model to embed images; loop-closure detection via the embedding distance.
- Diffusion priors for mapping: when the scene is partially observed, a diffusion model fills in plausible geometry.
- LLM-driven semantic SLAM: caption regions of the map; query with natural language.
Production-ready in 2026? Mostly no. Promising on benchmarks; not yet robust enough to replace classical pipelines.
What classical methods still win on
- Speed: ORB-SLAM3 / LIO-SAM run real-time on a single CPU. Most learned SLAM needs a GPU.
- Memory: classical maps are kilobytes per square meter; Gaussian-splat maps are megabytes.
- Robustness in known regimes: feature-based SLAM has well-characterized failure modes; neural SLAM can fail in surprising ways.
- Auditability: when classical SLAM goes wrong, you can trace which feature mismatched. Neural SLAM gives you a black box.
- Edge deployment: classical SLAM runs on a Jetson Nano; neural needs an Orin or better.
What learned methods win on
- Featureless / hard scenes: white walls, uniform ground, fog. Classical features fail; learned features find subtle patterns.
- Wide-baseline matching: revisits from very different angles. SuperGlue/LightGlue beat ORB+brute-force handily.
- Photorealistic rendering: Gaussian splats produce maps you can fly through visually. Classical SLAM produces a sparse point cloud.
- Long-term changes: learned descriptors generalize better across day/night, summer/winter.
The production hybrid
Most 2026 production SLAM stacks combine:
- Classical front-end: ORB or LiDAR features, fast feature extraction.
- Learned matcher (LightGlue): when the scene is hard.
- Classical bundle adjustment: the math is well-understood; converges fast.
- Optional Gaussian-splat layer: for visualization or downstream tasks needing dense maps.
- Optional learned loop-closure: NetVLAD or recent vision-language embeddings.
Each component is the best tool for its specific role. Pure-classical or pure-neural systems both lose to thoughtful hybrids.
The compute trajectory
Learned SLAM's compute requirement keeps dropping (better models, faster GPUs). 2018 SuperPoint needed a desktop GPU. 2026 LightGlue runs on Jetson Orin. The frontier of "what's deployable" expands every year.
By ~2028, end-to-end neural SLAM at edge-deployable rates is plausible. By then "classical SLAM" will be a heritage technology like Kalman filters — still useful, no longer the cutting edge.
Where to start
- Run SuperPoint/LightGlue as a drop-in replacement in your existing visual SLAM pipeline. Compare accuracy on hard scenes.
- Try SplaTAM on TUM RGB-D. Watch it produce photorealistic maps from RGB-D streams.
- Read DROID-SLAM's paper. Understand why differentiable bundle adjustment is interesting (and what it costs).
Exercise
On a YouTube video of an indoor walk, run COLMAP (classical SfM) and compare with a Gaussian-splat reconstruction. The classical version produces a sparse point cloud that you can use for navigation. The splat version produces a 3D model you can fly through. Different outputs, different uses.
Next
GPS, RTK, and outdoor state estimation — when you don't need SLAM because the satellites can tell you where you are.
Comments
Sign in to post a comment.