Visual SLAM: ORB-SLAM3 internals
One of the best open-source SLAM systems. Tracking, local mapping, loop closure, multi-map atlas — the four threads that turn a video stream into a reusable 3D map.
ORB-SLAM3 (2021, Campos et al.) is among the most capable open-source visual SLAM systems. Monocular, stereo, RGB-D, and visual-inertial inputs; multi-map "Atlas" handling; map merging on loop closure. Reading its architecture once teaches you what every modern visual SLAM system does — because most of them adopted similar patterns. Here's the working overview.
The four threads
ORB-SLAM3 runs four independent threads, each with a clear responsibility:
- Tracking: per-frame, fast (~30 Hz). Estimates camera pose given the current keyframe map.
- Local Mapping: per-keyframe, slower. Refines local geometry, adds new map points.
- Loop Closure / Map Merging: rare, expensive. Detects revisits; merges maps when found.
- Visualization: optional; renders the trajectory + map in a viewer.
This separation lets the front-end stay real-time while the back-end does the heavy optimization.
The pipeline (per frame)
1. Feature extraction
Extract ORB features from the current frame. ORB = FAST corners + rotated BRIEF descriptor. Fast (~5 ms for 1000 features), invariant to rotation, robust to lighting.
2. Pose tracking
Match features to the previous frame. Solve PnP (Perspective-n-Point) for the camera's pose:
R, t = solve_pnp(2D_features_in_current_frame, 3D_points_from_previous)
Output: 6-DOF pose of the current camera in the map frame.
3. Track local map
Project nearby 3D map points into the current image; match more features to refine the pose. Adds robustness; helps with motion blur, occlusion.
4. Decide: keyframe or not?
Keyframes are camera frames retained as part of the map. Triggers:
- Sufficient time elapsed since last keyframe.
- Sufficient parallax (camera has moved enough).
- Tracking quality drops below threshold.
Most frames don't become keyframes. Memory stays bounded.
Local Mapping (back-end-fast)
When a new keyframe is added:
- Triangulate new 3D map points from feature matches across the new + neighboring keyframes.
- Update existing map points — refine 3D positions using all observations.
- Cull bad keyframes (redundant ones with too few unique observations).
- Run local bundle adjustment: optimize the new keyframe pose + map points + neighboring keyframes (a window of ~20). Sparse non-linear least squares; converges in tens of milliseconds.
Loop Closure (back-end-slow)
The killer feature. Detect when the camera revisits a previously-mapped place; correct accumulated drift across the whole trajectory.
Detection
- For each new keyframe, query a bag-of-words (DBoW3) database of all past keyframes.
- Find candidates with high BoW similarity.
- Verify each candidate geometrically: solve a relative transform; check inlier count.
Correction
- Once a loop is verified, run pose graph optimization: a factor graph over all keyframes' poses, with the loop closure as a constraint. Fast; updates poses globally.
- Then run full bundle adjustment: re-optimize all keyframes + all map points jointly. Slow (seconds for large maps); runs in a separate thread.
Watch the trajectory snap into place. The visual is satisfying.
Atlas: multi-map handling
One of ORB-SLAM3's innovations. When tracking is lost (lighting change, motion blur, going dark), the system creates a new map and starts mapping fresh. When it later detects a loop closure with an old map, it merges the two into a unified map.
This handles real-world recovery: occasional tracking loss doesn't restart from scratch. Each map shares the same coordinate system after merge; the trajectory stitches together.
Visual-Inertial mode (VI-ORB-SLAM)
Adding an IMU dramatically improves robustness:
- Pre-integration: integrate IMU readings between keyframes to produce a single pseudo-measurement.
- Add IMU factors to the bundle adjustment objective.
- The IMU constrains scale (monocular SLAM is otherwise scale-ambiguous).
- The IMU prevents tracking loss during fast motion or featureless scenes.
VI-mode is what most production drone autonomy uses. ORB-SLAM3 was the first open-source system to do it cleanly.
Stereo and RGB-D
Stereo: triangulate 3D points immediately from each frame's left+right images. No scale ambiguity; better depth at close range.
RGB-D: similar but the depth comes from the sensor (Kinect, RealSense). Works indoors only (typical RGB-D range is ~5 m); fast and accurate.
What ORB-SLAM3 isn't great at
- Dense reconstruction: produces a sparse map of feature points, not a mesh. Use TSDF or Gaussian splatting on top if you need surfaces.
- Long-term mapping: works well within a session. Re-localizing after months in a changing environment is a different problem.
- Featureless scenes: white walls, fog, water. ORB features fail; tracking gets lost.
- Very fast motion: motion blur kills feature extraction. IMU helps but only so much.
Modern alternatives
- VINS-Fusion: HKUST's visual-inertial SLAM. Less mature multi-map but high-quality VI.
- OpenVSLAM / Stella-VSLAM: ORB-SLAM-style; community fork with cleaner API.
- DROID-SLAM: deep-learning replacement of feature matching with learned correspondence. Better in low-texture; slower.
- NICE-SLAM, Gaussian-SLAM: neural-field SLAM. Dense reconstruction; research-grade speed.
- NVIDIA Isaac VSLAM: GPU-accelerated; production-targeted.
Running it
git clone https://github.com/UZ-SLAMLab/ORB_SLAM3
cd ORB_SLAM3
./build.sh
./Examples/Stereo/stereo_euroc \
Vocabulary/ORBvoc.txt Examples/Stereo/EuRoC.yaml \
/path/to/euroc/V1_01_easy
Runs on the EuRoC drone dataset. Watch the trajectory + sparse map build live. The first time the loop-closure correction snaps the trajectory back into place is when the field's progress feels real.
The takeaways
- SLAM systems are pipelines, not algorithms. The architecture matters as much as any single math.
- Feature-based methods (ORB) are still production-standard. Deep features are the future for hard scenes; not yet faster on commodity hardware.
- Loop closure is what makes SLAM more than odometry. Every paper's "drift correction" video is loop closure firing.
- Visual-inertial is dramatically more robust than visual-only. Add an IMU when you can.
Exercise
Run ORB-SLAM3 on the TUM RGB-D dataset (free download). Watch the four threads work simultaneously. Identify the moments when loop closure fires. Then run the same dataset with monocular vs stereo input — observe how stereo's known scale eliminates drift in absolute size.
Next
LiDAR SLAM with LOAM and its descendants — the lineage that powers most modern autonomous vehicles.
Comments
Sign in to post a comment.