Classical CV: features, descriptors, matching
Before deep learning, computer vision rested on three primitives — corner detectors, descriptors, and matchers. They still power half of modern SLAM and visual odometry. Here's why and how.
A surprising fact about 2026 robotics: most production SLAM stacks (ORB-SLAM3, ROS Nav2 visual localization, every commercial AR system) still use hand-engineered feature detectors from the 1990s. Deep features (SuperPoint, R2D2) are gaining ground but classical features remain the workhorse. Here's why — and the three primitives that hold it all together.
The three primitives
- Detector: find "interesting" points in an image. Corners, blobs, edges.
- Descriptor: encode each point's local neighborhood as a vector that's stable across viewpoint and lighting.
- Matcher: pair points across two images by descriptor similarity.
Pipeline: detect features in image A, detect in image B, describe both, match. Output: a list of correspondences (p_A, p_B). Everything downstream — visual odometry, SfM, SLAM, image stitching — runs on top.
What makes a good feature
- Repeatable: detected at the same physical point under viewpoint change, scale change, illumination change.
- Distinctive: the descriptor for one feature is far from descriptors for others. Avoids ambiguous matches.
- Localized: precisely positioned. Sub-pixel accuracy matters for 3D reconstruction.
Corners satisfy all three. Edges satisfy 1 and 3 but not 2 (every point along an edge looks the same). Smooth regions satisfy 3 but neither 1 nor 2. So the field zeroed in on corner-like detectors.
Detectors: what's in OpenCV in 2026
| Detector | Speed | Repeatability | Use case |
|---|---|---|---|
| FAST | ~3000 fps | low | Real-time tracking |
| Harris | moderate | moderate | Educational; legacy code |
| SIFT | slow | excellent | Highest-quality matching |
| ORB | fast | good | SLAM, real-time mapping |
| AKAZE | moderate | excellent | High-quality, modern alternative to SIFT |
ORB is the default in modern robotics — fast, scale-invariant, rotation-invariant, free of patent encumbrance, and produces a binary descriptor that matches in O(1) Hamming distance.
The OpenCV pipeline
import cv2
img1 = cv2.imread('a.jpg', cv2.IMREAD_GRAYSCALE)
img2 = cv2.imread('b.jpg', cv2.IMREAD_GRAYSCALE)
orb = cv2.ORB_create(nfeatures=2000)
kp1, des1 = orb.detectAndCompute(img1, None)
kp2, des2 = orb.detectAndCompute(img2, None)
# Brute-force matcher with Hamming distance (binary descriptors)
matcher = cv2.BFMatcher(cv2.NORM_HAMMING, crossCheck=True)
matches = matcher.match(des1, des2)
matches = sorted(matches, key=lambda m: m.distance)[:50] # top 50
draw = cv2.drawMatches(img1, kp1, img2, kp2, matches, None)
cv2.imwrite('matches.png', draw)
Three function calls. That's the heart of every classical CV pipeline.
The descriptor: turning a patch into a vector
Different detectors pair with different descriptors:
- SIFT descriptor: 128-dim float vector summarizing gradient orientations in a 16×16 patch. Compared by L2 distance.
- ORB descriptor: 256-bit binary string from BRIEF-style intensity comparisons. Compared by Hamming distance — bitwise XOR + popcount, much faster than L2.
- SURF: faster SIFT-like descriptor. Patent-encumbered until recently; less common in open source.
The "magic" of these descriptors is their stability under transformation. Rotate the image 30°? The descriptor barely changes. Add lighting? Mostly invariant. Add a different camera with different gamma? Still mostly invariant. They were designed by hand to have these properties.
Matching: the gotcha
A naive nearest-neighbor matcher creates many false matches — every feature has some closest descriptor in the other image, even if it's not a real correspondence. Two filters:
Lowe's ratio test
For each feature in image A, find the two closest matches in image B. Accept the match only if the closest is significantly closer than the second-closest:
matcher = cv2.BFMatcher(cv2.NORM_HAMMING)
matches = matcher.knnMatch(des1, des2, k=2)
good = [m for m, n in matches if m.distance < 0.75 * n.distance]
The ratio threshold (0.75) is the canonical default. Lowe found this empirically and it's stuck for 25 years.
RANSAC for geometric consistency
Even after the ratio test, some matches are wrong. RANSAC fits a geometric model (homography for planar scenes, fundamental matrix for general 3D) and keeps only the inliers — matches consistent with the model.
pts1 = np.float32([kp1[m.queryIdx].pt for m in good])
pts2 = np.float32([kp2[m.trainIdx].pt for m in good])
H, mask = cv2.findHomography(pts1, pts2, cv2.RANSAC, 5.0)
inlier_matches = [good[i] for i in range(len(good)) if mask[i]]
After RANSAC, what's left is the set of geometrically consistent correspondences. Now you can compute camera motion, build a 3D map, or stitch a panorama.
Where classical features still win in 2026
- Compute-constrained platforms. ORB on a Raspberry Pi 4 runs 30+ fps. SuperPoint needs a GPU.
- Established stacks. ORB-SLAM3 is open source and battle-tested. Replacing it with deep features is a research project, not a swap-in.
- Robustness analysis. Classical features have well-characterized failure modes; deep features can surprise you on out-of-distribution inputs.
- Patents and licensing. ORB is BSD-clean; SIFT is now public domain (post-2020); deep alternatives often come with research-only licenses.
Where deep features are taking over
- Wide-baseline matching. Two photos of the same building from very different angles. Classical methods lose half the matches; SuperPoint+SuperGlue finds nearly all.
- Low-texture scenes. White walls, textureless floors. Classical detectors find nothing; learned features find subtle patterns.
- Long-term mapping. Day vs night, summer vs winter. Learned descriptors generalize better than hand-crafted ones.
The 2025–26 trend: hybrid stacks. Use ORB or AKAZE for the real-time front-end (60+ Hz), use SuperPoint or DISK for relocalization and loop closure (1 Hz, GPU). Best of both.
Exercise
Take two photos of the same scene from different positions. Run the pipeline above. Visualize the matches before and after the ratio test, and again after RANSAC. Watch how each filter strips out wrong matches. The intuition you build for "what's a good match" is the foundation for SLAM and visual odometry.
Next
Optical flow and structure from motion — what to do once you have correspondences and want to recover camera motion + 3D points.
Comments
Sign in to post a comment.