RobotForge
Published·~16 min

Visual SLAM: ORB-SLAM3 internals

One of the best open-source SLAM systems. Tracking, local mapping, loop closure, multi-map atlas — the four threads that turn a video stream into a reusable 3D map.

by RobotForge
#slam#visual-slam#orb-slam

ORB-SLAM3 (2021, Campos et al.) is among the most capable open-source visual SLAM systems. Monocular, stereo, RGB-D, and visual-inertial inputs; multi-map "Atlas" handling; map merging on loop closure. Reading its architecture once teaches you what every modern visual SLAM system does — because most of them adopted similar patterns. Here's the working overview.

The four threads

ORB-SLAM3 runs four independent threads, each with a clear responsibility:

  1. Tracking: per-frame, fast (~30 Hz). Estimates camera pose given the current keyframe map.
  2. Local Mapping: per-keyframe, slower. Refines local geometry, adds new map points.
  3. Loop Closure / Map Merging: rare, expensive. Detects revisits; merges maps when found.
  4. Visualization: optional; renders the trajectory + map in a viewer.

This separation lets the front-end stay real-time while the back-end does the heavy optimization.

The pipeline (per frame)

1. Feature extraction

Extract ORB features from the current frame. ORB = FAST corners + rotated BRIEF descriptor. Fast (~5 ms for 1000 features), invariant to rotation, robust to lighting.

2. Pose tracking

Match features to the previous frame. Solve PnP (Perspective-n-Point) for the camera's pose:

R, t = solve_pnp(2D_features_in_current_frame, 3D_points_from_previous)

Output: 6-DOF pose of the current camera in the map frame.

3. Track local map

Project nearby 3D map points into the current image; match more features to refine the pose. Adds robustness; helps with motion blur, occlusion.

4. Decide: keyframe or not?

Keyframes are camera frames retained as part of the map. Triggers:

  • Sufficient time elapsed since last keyframe.
  • Sufficient parallax (camera has moved enough).
  • Tracking quality drops below threshold.

Most frames don't become keyframes. Memory stays bounded.

Local Mapping (back-end-fast)

When a new keyframe is added:

  • Triangulate new 3D map points from feature matches across the new + neighboring keyframes.
  • Update existing map points — refine 3D positions using all observations.
  • Cull bad keyframes (redundant ones with too few unique observations).
  • Run local bundle adjustment: optimize the new keyframe pose + map points + neighboring keyframes (a window of ~20). Sparse non-linear least squares; converges in tens of milliseconds.

Loop Closure (back-end-slow)

The killer feature. Detect when the camera revisits a previously-mapped place; correct accumulated drift across the whole trajectory.

Detection

  • For each new keyframe, query a bag-of-words (DBoW3) database of all past keyframes.
  • Find candidates with high BoW similarity.
  • Verify each candidate geometrically: solve a relative transform; check inlier count.

Correction

  • Once a loop is verified, run pose graph optimization: a factor graph over all keyframes' poses, with the loop closure as a constraint. Fast; updates poses globally.
  • Then run full bundle adjustment: re-optimize all keyframes + all map points jointly. Slow (seconds for large maps); runs in a separate thread.

Watch the trajectory snap into place. The visual is satisfying.

Atlas: multi-map handling

One of ORB-SLAM3's innovations. When tracking is lost (lighting change, motion blur, going dark), the system creates a new map and starts mapping fresh. When it later detects a loop closure with an old map, it merges the two into a unified map.

This handles real-world recovery: occasional tracking loss doesn't restart from scratch. Each map shares the same coordinate system after merge; the trajectory stitches together.

Visual-Inertial mode (VI-ORB-SLAM)

Adding an IMU dramatically improves robustness:

  • Pre-integration: integrate IMU readings between keyframes to produce a single pseudo-measurement.
  • Add IMU factors to the bundle adjustment objective.
  • The IMU constrains scale (monocular SLAM is otherwise scale-ambiguous).
  • The IMU prevents tracking loss during fast motion or featureless scenes.

VI-mode is what most production drone autonomy uses. ORB-SLAM3 was the first open-source system to do it cleanly.

Stereo and RGB-D

Stereo: triangulate 3D points immediately from each frame's left+right images. No scale ambiguity; better depth at close range.

RGB-D: similar but the depth comes from the sensor (Kinect, RealSense). Works indoors only (typical RGB-D range is ~5 m); fast and accurate.

What ORB-SLAM3 isn't great at

  • Dense reconstruction: produces a sparse map of feature points, not a mesh. Use TSDF or Gaussian splatting on top if you need surfaces.
  • Long-term mapping: works well within a session. Re-localizing after months in a changing environment is a different problem.
  • Featureless scenes: white walls, fog, water. ORB features fail; tracking gets lost.
  • Very fast motion: motion blur kills feature extraction. IMU helps but only so much.

Modern alternatives

  • VINS-Fusion: HKUST's visual-inertial SLAM. Less mature multi-map but high-quality VI.
  • OpenVSLAM / Stella-VSLAM: ORB-SLAM-style; community fork with cleaner API.
  • DROID-SLAM: deep-learning replacement of feature matching with learned correspondence. Better in low-texture; slower.
  • NICE-SLAM, Gaussian-SLAM: neural-field SLAM. Dense reconstruction; research-grade speed.
  • NVIDIA Isaac VSLAM: GPU-accelerated; production-targeted.

Running it

git clone https://github.com/UZ-SLAMLab/ORB_SLAM3
cd ORB_SLAM3
./build.sh
./Examples/Stereo/stereo_euroc \
    Vocabulary/ORBvoc.txt Examples/Stereo/EuRoC.yaml \
    /path/to/euroc/V1_01_easy

Runs on the EuRoC drone dataset. Watch the trajectory + sparse map build live. The first time the loop-closure correction snaps the trajectory back into place is when the field's progress feels real.

The takeaways

  • SLAM systems are pipelines, not algorithms. The architecture matters as much as any single math.
  • Feature-based methods (ORB) are still production-standard. Deep features are the future for hard scenes; not yet faster on commodity hardware.
  • Loop closure is what makes SLAM more than odometry. Every paper's "drift correction" video is loop closure firing.
  • Visual-inertial is dramatically more robust than visual-only. Add an IMU when you can.

Exercise

Run ORB-SLAM3 on the TUM RGB-D dataset (free download). Watch the four threads work simultaneously. Identify the moments when loop closure fires. Then run the same dataset with monocular vs stereo input — observe how stereo's known scale eliminates drift in absolute size.

Next

LiDAR SLAM with LOAM and its descendants — the lineage that powers most modern autonomous vehicles.

Comments

    Sign in to post a comment.