Published2026-04-26·~13 min

Sensor fusion: visual-inertial odometry

Combining a camera with an IMU for pose estimation that works where GPS fails. Why VIO is its own subfield, and the production-grade systems that fly drones indoors.

by RobotForge

#perception#vio#sensor-fusion

A camera tells you where you are relative to features in the world; an IMU tells you how you're moving. Each has weaknesses the other covers. Combine them tightly and you get visual-inertial odometry — pose estimation that's accurate, GPS-free, and robust enough to fly drones through buildings. VIO is the perception backbone of every modern AR headset, indoor drone, and aggressive autonomous car (combined with LiDAR).

Why combining beats either alone

Sensor	Strength	Weakness
Camera	Absolute pose from features; metrically precise	Slow (~30 Hz); fails in featureless / blurred scenes; monocular has scale ambiguity
IMU	Fast (1 kHz); never blanks out; metric motion measurement	Drifts (gyro and accelerometer biases); only relative motion

The complementary failure modes are why fusion works. Camera anchors absolute position; IMU bridges between camera frames and handles fast motion.

The two fusion architectures

Loosely-coupled

Camera estimates pose; IMU estimates pose; an EKF combines them as two independent measurements. Easy to implement, modest accuracy.

Tightly-coupled

Fuse at the measurement level: camera features and IMU pre-integration go into a single optimization. Significantly more accurate; production-grade.

For VIO specifically, tightly-coupled has won. All modern systems (VINS-Fusion, ORB-SLAM3-VI, OpenVINS) are tightly-coupled.

IMU pre-integration: the key trick

An IMU at 1 kHz between two camera frames at 30 Hz produces ~33 measurements. Naively, the optimizer needs to integrate IMU readings every iteration — expensive.

Pre-integration (Forster et al., 2017): integrate the IMU readings once, producing a single relative-motion factor. The factor encodes "between camera frame i and frame i+1, the body rotated by \Delta R, translated by \Delta p, with covariance \Sigma." Plug into the bundle adjustment as a single residual.

Efficiency: the optimizer doesn't see individual IMU readings; just the pre-integrated factor. Repeated optimization is cheap.

The VIO state vector

For each camera keyframe, estimate:

Position p — 3D world frame.
Velocity v — 3D world frame.
Orientation R — body to world.
IMU biases b_g, b_a — gyro and accelerometer biases (slowly time-varying).

Plus 3D positions of tracked features. Optimize all jointly via factor graph.

Initialization: the hard part

VIO requires initialization to set scale, gravity direction, and IMU biases. Two patterns:

Known motion: ask the user to move the device in a specific way (figure-8, side-to-side). Standard for AR (ARKit / ARCore).
Automatic: detect motion variance; once enough parallax and IMU motion has accumulated, solve a linear system for scale and gravity. Used in VINS-Fusion.

Initialization can take seconds and may fail if motion is too smooth. Most production VIO systems show "calibrating..." for a few seconds at startup.

Production VIO systems

System	Strengths
VINS-Fusion (HKUST)	Most popular open-source; mono / stereo; loop closure
ORB-SLAM3 VI mode	Multi-map atlas; mature
OpenVINS	EKF-based (lighter compute); modular
ROVIO	Older; very fast; still good for fast motion
Apple ARKit / ARCore	Closed-source but production-grade; on every modern phone

For ROS-based robotics, VINS-Fusion is the typical entry point. Cleanly integrates with ROS topics; works on commodity hardware (RealSense + Pixhawk's onboard IMU).

Hardware requirements

For ~$300 you can build a working VIO platform:

Cameras: stereo pair (RealSense D435i has built-in IMU + stereo). Mono cameras work too but lack inherent scale.
IMU: 200+ Hz, well-calibrated. The D435i / T265 have integrated IMUs; consumer phones do too.
Synchronization: hardware-triggered exposure for cameras + matching IMU timestamps. Critical for accuracy.
Compute: Jetson Nano / Orin Nano runs VINS-Fusion at ~30 Hz.

Common pitfalls

Time sync: cameras and IMU must share a clock. Software-only sync introduces ~10 ms offsets that ruin pre-integration. Use hardware sync if possible.
Camera-IMU calibration: the rigid transform between camera and IMU must be known to ~1 mm / 0.1°. Use Kalibr to estimate.
Bias drift: the IMU's biases change over time and temperature. Production systems re-estimate them online.
Featureless scenes: long blank walls, fog, sky-only views. VIO degrades to dead reckoning. Add LiDAR or rely on IMU only briefly.

The phone-grade VIO bar

ARKit and ARCore are commercial-grade VIO running on consumer phones. Their accuracy on a 30-second walk is sub-cm. The combination of high-quality MEMS IMUs (the iPhone's Bosch-class chips), hardware-synchronized cameras, and well-tuned software is hard to match with commodity ROS components.

For applications where you can use a phone or AR headset as the sensor, doing so often gives you better VIO than a custom build at less cost.

What VIO is not for

Outdoor large-scale: drift over kilometers becomes meters. Use VIO + GPS fusion.
Hard-RT control: VIO's optimization happens at ~30 Hz; not appropriate for 1 kHz control. Use IMU integration directly between VIO updates.
Fast rotation without translation: pure rotation breaks monocular VIO scale recovery. Stereo VIO handles it; mono VIO drifts.

The VIO + LiDAR future

Modern autonomous vehicles fuse LiDAR + VIO into a single estimator. The combination handles every failure case:

VIO drifts → LiDAR provides absolute updates.
LiDAR fails in fog → VIO continues.
Both fail in tunnels → IMU + odometry briefly.

FAST-LIVO and CamLiO are recent open-source examples. Production AVs all do this internally.

Exercise

On a RealSense D435i, run VINS-Fusion in stereo + IMU mode. Walk around a featured room while plotting the trajectory in RViz. After a 30-second loop, return to the start; observe the drift (typically ~1% of distance traveled). Try the same in a hallway with white walls; watch VIO degrade. The before/after is what makes "feature dependence" concrete.

Deep learning for robot perception — what's different about training and deploying CV models for robotics, where the cameras don't always cooperate.

← Previous

Semantic and instance segmentation

Deep learning for robot perception: what's different