Published2026-04-25·~15 min

Optical flow and structure from motion

How pixels move when the camera moves — and how to recover 3D from a moving monocular camera. The geometry that powers visual odometry, SLAM, and every drone that reconstructs a map from a single camera.

by RobotForge

#perception#optical-flow#sfm

If you have feature matches between two camera frames, you can recover the camera's motion and the 3D structure of the scene — from one moving camera. This is the geometric heart of visual SLAM, every drone "fly-through" reconstruction, and most AR. Two ideas: optical flow (where pixels go) and structure from motion (the 3D that explains the flow).

Optical flow — pixels in motion

Given two consecutive frames, optical flow is a vector field: for each pixel in frame 1, where did that piece of scene end up in frame 2?

Two regimes:

Sparse optical flow: track a few hundred features. Fast, robust, used by visual odometry. Lucas-Kanade tracker is the canonical algorithm.
Dense optical flow: a flow vector for every pixel. Slower, used by video stabilization, segmentation, deep-learning models. Farneback (classical) or RAFT (deep) are common.

Lucas-Kanade in 50 lines of code

Assume small motion and constant pixel intensity (the "brightness constancy" equation):

I(x + Δx, y + Δy, t + Δt) ≈ I(x, y, t)

Taylor expand:

I_x · Δx + I_y · Δy + I_t · Δt ≈ 0

One equation, two unknowns (Δx, Δy). Solve by assuming pixels in a small window all share the same motion — that gives many equations, two unknowns, solve by least-squares. Done. OpenCV's cv2.calcOpticalFlowPyrLK wraps this with a multi-scale pyramid for handling larger motions.

import cv2
import numpy as np

prev = cv2.imread('frame1.png', cv2.IMREAD_GRAYSCALE)
curr = cv2.imread('frame2.png', cv2.IMREAD_GRAYSCALE)
p0 = cv2.goodFeaturesToTrack(prev, maxCorners=200, qualityLevel=0.01, minDistance=10)

p1, status, err = cv2.calcOpticalFlowPyrLK(prev, curr, p0, None)
good_old = p0[status == 1]
good_new = p1[status == 1]

for o, n in zip(good_old, good_new):
    cv2.line(curr, tuple(o.astype(int).ravel()), tuple(n.astype(int).ravel()), (0, 255, 0), 2)

Twenty lines for a working sparse tracker. Most visual odometry pipelines have this kind of code at their core.

From flow to camera motion

Once you have correspondences (point in frame 1 ↔ point in frame 2), you can recover the camera motion. The math:

For a calibrated camera, two views are related by the essential matrix E. x_2^T E x_1 = 0 for every correspondence.
E is a 3×3 matrix with 5 degrees of freedom (rotation R + translation direction t̂ — note: monocular cameras can't recover translation magnitude from two views alone).
From 5 correspondences, you can solve for E (the "five-point algorithm"). With more correspondences, RANSAC + least-squares.
From E, decompose into (R, t) — there are four candidate solutions; pick the one with positive depth.

OpenCV does the whole thing:

K = np.array([[fx, 0, cx], [0, fy, cy], [0, 0, 1]])
E, mask = cv2.findEssentialMat(good_old, good_new, K, method=cv2.RANSAC, threshold=1.0)
_, R, t, _ = cv2.recoverPose(E, good_old, good_new, K)
print('Rotation:', R, 'Translation direction:', t)

You now have the camera's motion between the two frames — up to a scale ambiguity in t (since a small camera moving slowly looks identical to a big camera moving fast in monocular video).

Triangulation — recovering 3D points

Given the camera motion (R, t) and two pixel correspondences, recover the 3D point X that produced them. This is triangulation: find X that minimizes reprojection error in both views.

P1 = K @ np.hstack([np.eye(3), np.zeros((3, 1))])    # first cam at origin
P2 = K @ np.hstack([R, t])                             # second cam relative
X_homog = cv2.triangulatePoints(P1, P2, good_old.T, good_new.T)
X = (X_homog[:3] / X_homog[3]).T   # convert from homogeneous

That's a 3D point cloud, in metric units up to the same scale ambiguity as t. To resolve scale, you need another view, a stereo baseline of known length, or some scene constraint (known object size).

Bundle adjustment — the polish

Run the above on N frames in sequence and you get a chain of camera poses + 3D points. But errors compound — each new pose has small uncertainty added to the previous. The standard fix: bundle adjustment — joint nonlinear optimization of all camera poses and all 3D points to minimize the total reprojection error.

This is the workhorse of visual SLAM. Hundreds of poses, thousands of points, optimized at every keyframe. Tools: g2o, Ceres, GTSAM. Modern SLAM systems (ORB-SLAM3, COLMAP) are essentially bundle adjustment with smart bookkeeping.

Structure from motion: the offline cousin

SfM is what you do when you have a fixed set of images and want a high-quality 3D reconstruction:

Detect features in every image (SIFT or learned features).
Match features across all image pairs.
Estimate pairwise camera motions; chain them into an initial trajectory.
Triangulate 3D points.
Bundle-adjust everything together.
Refine with multi-view stereo for dense reconstruction.

Tools: COLMAP (academic standard), Meshroom (open-source), OpenMVG. Inputs: a few hundred photos around an object. Outputs: a dense 3D mesh.

The 2024–26 deep-learning shift

Learned features (SuperPoint, R2D2, DISK) — better matching across viewpoint and lighting changes.
Learned matchers (SuperGlue, LightGlue) — replace brute-force matching with attention-based networks. Massive improvement in wide-baseline scenes.
Learned depth (MiDaS, DPT, Depth Anything) — single-image depth estimation. Scaffold initialization for monocular SLAM.
Gaussian splatting — represents scenes as 3D Gaussians; optimization differentiates through rendering. Used by NeRFstudio, recent SLAM systems (SplaTAM, MonoGS).

Classical SfM still wins on robustness and explainability. Hybrid systems (classical front-end, learned matcher, classical bundle adjust, learned depth refinement) are now common in production.

Common gotchas

Pure rotation breaks SfM. Two views with the same camera position can't triangulate — there's no parallax. Most SfM systems detect this and refuse to add the frame.
Lens distortion. Always undistort first. Even small distortion ruins triangulation.
Scale drift. Monocular SLAM accumulates scale error over time. Loop closure or stereo or IMU fixes it.
Featureless scenes. White walls, blue sky — no features to track. Visual odometry fails. Either add IMU fusion or use learned features.

Why this still matters in 2026

Every drone with a camera, every AR headset, every mobile robot doing visual localization runs some version of this pipeline. Deep learning has improved many components but the geometric backbone is intact. Understanding it lets you debug when the modern stack breaks.

Exercise

Walk around an object in your house with your phone, take 30 photos. Run COLMAP on them — outputs a 3D reconstruction. Open the result in MeshLab or COLMAP's viewer. The whole pipeline above ran on your photos. Try sparser captures (10 photos) and watch where it fails.

Stereo and depth — the alternative to SfM when you have two cameras at known offsets, getting depth in real time without needing to solve for motion.

← Previous

Classical CV: features, descriptors, matching

Depth: stereo and RGB-D sensors