Published2026-04-26·~16 min

Visual SLAM: ORB-SLAM3 internals

One of the best open-source SLAM systems. Tracking, local mapping, loop closure, multi-map atlas — the four threads that turn a video stream into a reusable 3D map.

by RobotForge

#slam#visual-slam#orb-slam

ORB-SLAM3 (2021, Campos et al.) is among the most capable open-source visual SLAM systems. Monocular, stereo, RGB-D, and visual-inertial inputs; multi-map "Atlas" handling; map merging on loop closure. Reading its architecture once teaches you what every modern visual SLAM system does — because most of them adopted similar patterns. Here's the working overview.

The four threads

ORB-SLAM3 runs four independent threads, each with a clear responsibility:

Tracking: per-frame, fast (~30 Hz). Estimates camera pose given the current keyframe map.
Local Mapping: per-keyframe, slower. Refines local geometry, adds new map points.
Loop Closure / Map Merging: rare, expensive. Detects revisits; merges maps when found.
Visualization: optional; renders the trajectory + map in a viewer.

This separation lets the front-end stay real-time while the back-end does the heavy optimization.

The pipeline (per frame)

1. Feature extraction

Extract ORB features from the current frame. ORB = FAST corners + rotated BRIEF descriptor. Fast (~5 ms for 1000 features), invariant to rotation, robust to lighting.

2. Pose tracking

Match features to the previous frame. Solve PnP (Perspective-n-Point) for the camera's pose:

R, t = solve_pnp(2D_features_in_current_frame, 3D_points_from_previous)

Output: 6-DOF pose of the current camera in the map frame.

3. Track local map

Project nearby 3D map points into the current image; match more features to refine the pose. Adds robustness; helps with motion blur, occlusion.

4. Decide: keyframe or not?

Keyframes are camera frames retained as part of the map. Triggers:

Sufficient time elapsed since last keyframe.
Sufficient parallax (camera has moved enough).
Tracking quality drops below threshold.

Most frames don't become keyframes. Memory stays bounded.

Local Mapping (back-end-fast)

When a new keyframe is added:

Triangulate new 3D map points from feature matches across the new + neighboring keyframes.
Update existing map points — refine 3D positions using all observations.
Cull bad keyframes (redundant ones with too few unique observations).
Run local bundle adjustment: optimize the new keyframe pose + map points + neighboring keyframes (a window of ~20). Sparse non-linear least squares; converges in tens of milliseconds.

Loop Closure (back-end-slow)

The killer feature. Detect when the camera revisits a previously-mapped place; correct accumulated drift across the whole trajectory.

Detection

For each new keyframe, query a bag-of-words (DBoW3) database of all past keyframes.
Find candidates with high BoW similarity.
Verify each candidate geometrically: solve a relative transform; check inlier count.

Correction

Once a loop is verified, run pose graph optimization: a factor graph over all keyframes' poses, with the loop closure as a constraint. Fast; updates poses globally.
Then run full bundle adjustment: re-optimize all keyframes + all map points jointly. Slow (seconds for large maps); runs in a separate thread.

Watch the trajectory snap into place. The visual is satisfying.

Atlas: multi-map handling

One of ORB-SLAM3's innovations. When tracking is lost (lighting change, motion blur, going dark), the system creates a new map and starts mapping fresh. When it later detects a loop closure with an old map, it merges the two into a unified map.

This handles real-world recovery: occasional tracking loss doesn't restart from scratch. Each map shares the same coordinate system after merge; the trajectory stitches together.

Visual-Inertial mode (VI-ORB-SLAM)

Adding an IMU dramatically improves robustness:

Pre-integration: integrate IMU readings between keyframes to produce a single pseudo-measurement.
Add IMU factors to the bundle adjustment objective.
The IMU constrains scale (monocular SLAM is otherwise scale-ambiguous).
The IMU prevents tracking loss during fast motion or featureless scenes.

VI-mode is what most production drone autonomy uses. ORB-SLAM3 was the first open-source system to do it cleanly.

Stereo and RGB-D

Stereo: triangulate 3D points immediately from each frame's left+right images. No scale ambiguity; better depth at close range.

RGB-D: similar but the depth comes from the sensor (Kinect, RealSense). Works indoors only (typical RGB-D range is ~5 m); fast and accurate.

What ORB-SLAM3 isn't great at

Dense reconstruction: produces a sparse map of feature points, not a mesh. Use TSDF or Gaussian splatting on top if you need surfaces.
Long-term mapping: works well within a session. Re-localizing after months in a changing environment is a different problem.
Featureless scenes: white walls, fog, water. ORB features fail; tracking gets lost.
Very fast motion: motion blur kills feature extraction. IMU helps but only so much.

Modern alternatives

VINS-Fusion: HKUST's visual-inertial SLAM. Less mature multi-map but high-quality VI.
OpenVSLAM / Stella-VSLAM: ORB-SLAM-style; community fork with cleaner API.
DROID-SLAM: deep-learning replacement of feature matching with learned correspondence. Better in low-texture; slower.
NICE-SLAM, Gaussian-SLAM: neural-field SLAM. Dense reconstruction; research-grade speed.
NVIDIA Isaac VSLAM: GPU-accelerated; production-targeted.

Running it

git clone https://github.com/UZ-SLAMLab/ORB_SLAM3
cd ORB_SLAM3
./build.sh
./Examples/Stereo/stereo_euroc \
    Vocabulary/ORBvoc.txt Examples/Stereo/EuRoC.yaml \
    /path/to/euroc/V1_01_easy

Runs on the EuRoC drone dataset. Watch the trajectory + sparse map build live. The first time the loop-closure correction snaps the trajectory back into place is when the field's progress feels real.

The takeaways

SLAM systems are pipelines, not algorithms. The architecture matters as much as any single math.
Feature-based methods (ORB) are still production-standard. Deep features are the future for hard scenes; not yet faster on commodity hardware.
Loop closure is what makes SLAM more than odometry. Every paper's "drift correction" video is loop closure firing.
Visual-inertial is dramatically more robust than visual-only. Add an IMU when you can.

Exercise

Run ORB-SLAM3 on the TUM RGB-D dataset (free download). Watch the four threads work simultaneously. Identify the moments when loop closure fires. Then run the same dataset with monocular vs stereo input — observe how stereo's known scale eliminates drift in absolute size.

LiDAR SLAM with LOAM and its descendants — the lineage that powers most modern autonomous vehicles.

← Previous

Occupancy grid mapping

LiDAR SLAM: LOAM and its descendants