Published2026-04-26·~12 min

Depth: stereo and RGB-D sensors

Three ways to get depth: stereo triangulation, structured light, time-of-flight. What each measures, what each fails on, and when to pick which for your robot.

by RobotForge

#perception#depth#stereo

Monocular cameras tell you direction; stereo and depth cameras tell you distance. For grasping, mapping, collision avoidance, you need depth. Three sensor families dominate in 2026 — each with a different sweet spot, each with a different way to fail. Pick the wrong one and your robot reaches into walls or misses cups it should pick up.

Stereo: triangulation from two cameras

Two cameras separated by a known baseline. The same point appears at slightly different positions in the two images. The horizontal offset (disparity) is inversely proportional to depth.

depth = (focal_length × baseline) / disparity

Compute disparity with a stereo matching algorithm (block matching, semi-global matching, or learned methods like RAFT-Stereo).

Strengths: passive (no active illumination); works outdoors, in sunlight, at any range up to camera resolution.

Weaknesses: requires texture (uniform surfaces have no features to match); compute cost (per-pixel matching is heavy); calibration sensitive (sub-millimeter baseline drift produces meters of depth error at distance).

Hardware: Intel RealSense D435/D455 (~$300), ZED 2 / X (~$500), or build your own with two synchronized cameras.

Structured light: project a known pattern

Project an IR pattern of dots or lines onto the scene; observe the deformation with an IR camera. Distortions encode depth.

Strengths: works on textureless surfaces (the projector adds texture). High resolution (per-pixel depth at the camera's native resolution). Indoor accuracy excellent.

Weaknesses: indoor only (sunlight overwhelms the projector). Limited range (~5 m typical). Power-hungry projector.

Hardware: Microsoft Kinect v1, Intel RealSense D405. Used widely in older indoor robots; less common in 2026 as ToF caught up.

Time-of-flight (ToF)

Pulse an IR beam; measure how long the reflection takes. Speed of light × half round-trip time = distance. Each pixel has its own ToF receiver.

Strengths: works on textureless surfaces. Compact (no baseline needed). Reasonable range (10–40 m for some devices).

Weaknesses: sunlight interferes. Multipath errors (light reflects off two surfaces before returning, confuses the timing). Lower spatial resolution than stereo.

Hardware: Azure Kinect, iPhone Pro LiDAR, RealSense L515 (discontinued; replaced by L515 successor designs). Used in many 2026 robots — Stretch, HSR.

The four-question decision tree

Question	If yes…
Outdoors?	Stereo only
Textureless walls?	ToF or structured light
High range (5+ m)?	Stereo or LiDAR
Compute-budget tight?	ToF (depth is direct from sensor)

Common failure modes

Glass / mirrors: stereo struggles (different reflection in each camera); ToF goes through; LiDAR ghost-reflects. No good answer; filter aggressively.
Black surfaces: low IR reflection. Both structured light and ToF struggle. Stereo works if there's any texture.
Sun: outdoor structured-light/ToF saturated. Use stereo + outdoor-grade IR filter, or use LiDAR.
Smoke / fog: depth-of-field blur for cameras; LiDAR returns spurious points. Robotic firefighters need radar or thermal.

Depth post-processing

Raw depth from any sensor is noisy and has holes. Standard pipeline:

Filter: median filter, edge-preserving filter, or learned depth refinement (DPT, Depth Anything as a refiner).
Hole-filling: extend depth across small holes via interpolation.
Temporal smoothing: average depth over several frames; cuts noise but adds latency.
Outlier rejection: depth pixels far from their neighbors usually wrong.

OpenCV's rgbd module has most of these. RealSense SDK adds proprietary filters. Worth a calibration pass before shipping.

What you do with depth

Point clouds: convert depth + intrinsics to (x, y, z) per pixel. Standard input for grasp networks, mapping, navigation.
Surface normals: derived from depth gradients. Useful for grasping (perpendicular grasps), picking flat tabletops.
Object segmentation: depth boundaries often align with object boundaries.
Collision spheres / boxes: derive bounding volumes for path planning.

Calibration matters

For accurate depth, you need:

Camera intrinsics: focal length, principal point, distortion. Calibrate with a checkerboard.
Stereo extrinsics: relative pose between left and right cameras. Stereo calibration.
RGB-IR alignment: for sensors with separate color and depth cameras (Kinect, RealSense), the registration matters. Most SDKs provide this; verify it's accurate.
Hand-eye calibration: where is the camera in the robot frame? Critical for grasping. Cover this in the visual-servoing lesson.

Recalibrate after the camera has been moved, dropped, or unmounted. The 5-minute checkerboard ritual saves hours of debugging.

The 2026 production stack

Most indoor mobile manipulators in 2026 use:

RealSense D435i or L515 successor: integrated RGB-D + IMU.
Or Azure Kinect: high-quality ToF + RGB.
Wrist-mounted for grasping; scene-mounted for navigation.

For autonomous vehicles outdoors: stereo cameras + LiDAR + radar. Each sensor covers what the others miss.

Exercise

If you have a RealSense (~$300), stream RGB-D into a point cloud (the SDK provides this). Grasp the cloud; visualize in RViz or Open3D. Now move 10 cm and re-grab. Watch the points reproject. Then put a glass of water in front; observe the failure mode (transparent / multi-reflection). The 30 minutes you spend with depth raw data is the foundation for any later perception work.

LiDAR — the depth sensor that took over autonomous driving. Different physics, different geometry, different software stack.

← Previous

Optical flow and structure from motion

LiDAR: point clouds, filtering, ground segmentation