RobotForge
Published·~13 min

Modern SLAM: learned features and Gaussian splatting

SuperPoint, DROID-SLAM, Gaussian splats — the deep-learning wave reshaping the SLAM landscape. What's new, what classical methods still beat, and where the production frontier is in 2026.

by RobotForge
#slam#deep-learning#modern

For 30 years SLAM was geometry: features + matching + optimization, all hand-crafted. Since 2018 deep learning has been eating piece after piece of the pipeline. Learned features replaced ORB. Learned matchers replaced descriptor distance. Learned end-to-end SLAM systems exist. And Gaussian splatting brought differentiable photorealistic mapping to the field. Here's what's actually production in 2026.

The four lines of attack

  1. Replace components: keep the classical pipeline; swap in learned features, matchers, depth estimators.
  2. End-to-end SLAM: train a neural network that ingests video and outputs trajectory + map.
  3. Neural fields for mapping: replace point clouds with NeRF or Gaussian splat representations.
  4. Foundation-model SLAM: leverage VLMs and diffusion models as priors over scene structure.

1. Component-level replacements

SuperPoint (2018) and SuperGlue (2020)

SuperPoint: a CNN that detects keypoints + computes descriptors in one forward pass. Significantly more repeatable than ORB across viewpoint, lighting, and weather changes.

SuperGlue: a graph-neural-network matcher. Given two sets of SuperPoint descriptors, output the optimal one-to-one matching. Robust to occlusion, viewpoint change, low texture.

Practical impact: SuperPoint + SuperGlue beats ORB + brute-force matching by ~30 percentage points on hard wide-baseline benchmarks. Used in modern SfM (COLMAP-Pixel) and several SLAM systems.

Cost: 5–10× slower than ORB; requires GPU.

LightGlue (2023)

SuperGlue with adaptive computation — fast pairs are matched quickly, hard pairs get more attention. ~3× faster than SuperGlue with similar accuracy. The new "default learned matcher."

Learned depth (MiDaS, Depth Anything, ZoeDepth)

Single-image depth estimation. Useful as a prior for monocular SLAM (which is otherwise scale-ambiguous). Doesn't replace stereo or LiDAR but provides robust scale anchors.

2. End-to-end neural SLAM

DROID-SLAM (2021)

One model takes a video stream, outputs camera poses + dense depth maps. Internally: differentiable bundle adjustment via a recurrent network. End-to-end trained on synthetic data.

Strengths: dense reconstruction; works in textureless scenes; good loop closure.

Weaknesses: heavy compute (RTX 3090+); slow real-time; not yet robust enough to replace ORB-SLAM3 in production.

NICE-SLAM, Co-SLAM (2022)

Combine classical tracking with neural-field mapping. Tracking is geometric; the map is a multi-resolution feature grid (NICE) or hash grid (Co). Reconstruction is dense; tracking remains real-time.

3. Gaussian splatting for SLAM

3D Gaussian Splatting (Kerbl et al., 2023) represents a scene as millions of 3D Gaussians, each with position, covariance, color, and opacity. Trained by gradient descent on multi-view photo loss. The result is a continuous-density representation that renders extremely fast (real-time at 30+ FPS) and looks photorealistic.

SLAM systems using Gaussian splats:

  • SplaTAM (2024): tracks the camera by minimizing rendering loss against the current Gaussian map. Updates the map by adding new Gaussians from each frame.
  • MonoGS (2024): monocular Gaussian-splat SLAM.
  • RTG-SLAM (2024): real-time variant; faster updates but lower quality.

What this buys: dense, photorealistic maps usable for visualization, novel-view synthesis, and downstream perception. The map IS the rendering.

What it doesn't buy yet: as fast as classical SLAM. Memory-heavy. Doesn't yet replace LiDAR-LOAM for autonomous-driving-scale environments.

4. Foundation models in SLAM

Most exploratory work in 2024–25. Examples:

  • VLM-driven place recognition: use a vision-language model to embed images; loop-closure detection via the embedding distance.
  • Diffusion priors for mapping: when the scene is partially observed, a diffusion model fills in plausible geometry.
  • LLM-driven semantic SLAM: caption regions of the map; query with natural language.

Production-ready in 2026? Mostly no. Promising on benchmarks; not yet robust enough to replace classical pipelines.

What classical methods still win on

  • Speed: ORB-SLAM3 / LIO-SAM run real-time on a single CPU. Most learned SLAM needs a GPU.
  • Memory: classical maps are kilobytes per square meter; Gaussian-splat maps are megabytes.
  • Robustness in known regimes: feature-based SLAM has well-characterized failure modes; neural SLAM can fail in surprising ways.
  • Auditability: when classical SLAM goes wrong, you can trace which feature mismatched. Neural SLAM gives you a black box.
  • Edge deployment: classical SLAM runs on a Jetson Nano; neural needs an Orin or better.

What learned methods win on

  • Featureless / hard scenes: white walls, uniform ground, fog. Classical features fail; learned features find subtle patterns.
  • Wide-baseline matching: revisits from very different angles. SuperGlue/LightGlue beat ORB+brute-force handily.
  • Photorealistic rendering: Gaussian splats produce maps you can fly through visually. Classical SLAM produces a sparse point cloud.
  • Long-term changes: learned descriptors generalize better across day/night, summer/winter.

The production hybrid

Most 2026 production SLAM stacks combine:

  1. Classical front-end: ORB or LiDAR features, fast feature extraction.
  2. Learned matcher (LightGlue): when the scene is hard.
  3. Classical bundle adjustment: the math is well-understood; converges fast.
  4. Optional Gaussian-splat layer: for visualization or downstream tasks needing dense maps.
  5. Optional learned loop-closure: NetVLAD or recent vision-language embeddings.

Each component is the best tool for its specific role. Pure-classical or pure-neural systems both lose to thoughtful hybrids.

The compute trajectory

Learned SLAM's compute requirement keeps dropping (better models, faster GPUs). 2018 SuperPoint needed a desktop GPU. 2026 LightGlue runs on Jetson Orin. The frontier of "what's deployable" expands every year.

By ~2028, end-to-end neural SLAM at edge-deployable rates is plausible. By then "classical SLAM" will be a heritage technology like Kalman filters — still useful, no longer the cutting edge.

Where to start

  • Run SuperPoint/LightGlue as a drop-in replacement in your existing visual SLAM pipeline. Compare accuracy on hard scenes.
  • Try SplaTAM on TUM RGB-D. Watch it produce photorealistic maps from RGB-D streams.
  • Read DROID-SLAM's paper. Understand why differentiable bundle adjustment is interesting (and what it costs).

Exercise

On a YouTube video of an indoor walk, run COLMAP (classical SfM) and compare with a Gaussian-splat reconstruction. The classical version produces a sparse point cloud that you can use for navigation. The splat version produces a 3D model you can fly through. Different outputs, different uses.

Next

GPS, RTK, and outdoor state estimation — when you don't need SLAM because the satellites can tell you where you are.

Comments

    Sign in to post a comment.