Depth: stereo and RGB-D sensors
Three ways to get depth: stereo triangulation, structured light, time-of-flight. What each measures, what each fails on, and when to pick which for your robot.
Monocular cameras tell you direction; stereo and depth cameras tell you distance. For grasping, mapping, collision avoidance, you need depth. Three sensor families dominate in 2026 — each with a different sweet spot, each with a different way to fail. Pick the wrong one and your robot reaches into walls or misses cups it should pick up.
Stereo: triangulation from two cameras
Two cameras separated by a known baseline. The same point appears at slightly different positions in the two images. The horizontal offset (disparity) is inversely proportional to depth.
depth = (focal_length × baseline) / disparity
Compute disparity with a stereo matching algorithm (block matching, semi-global matching, or learned methods like RAFT-Stereo).
Strengths: passive (no active illumination); works outdoors, in sunlight, at any range up to camera resolution.
Weaknesses: requires texture (uniform surfaces have no features to match); compute cost (per-pixel matching is heavy); calibration sensitive (sub-millimeter baseline drift produces meters of depth error at distance).
Hardware: Intel RealSense D435/D455 (~$300), ZED 2 / X (~$500), or build your own with two synchronized cameras.
Structured light: project a known pattern
Project an IR pattern of dots or lines onto the scene; observe the deformation with an IR camera. Distortions encode depth.
Strengths: works on textureless surfaces (the projector adds texture). High resolution (per-pixel depth at the camera's native resolution). Indoor accuracy excellent.
Weaknesses: indoor only (sunlight overwhelms the projector). Limited range (~5 m typical). Power-hungry projector.
Hardware: Microsoft Kinect v1, Intel RealSense D405. Used widely in older indoor robots; less common in 2026 as ToF caught up.
Time-of-flight (ToF)
Pulse an IR beam; measure how long the reflection takes. Speed of light × half round-trip time = distance. Each pixel has its own ToF receiver.
Strengths: works on textureless surfaces. Compact (no baseline needed). Reasonable range (10–40 m for some devices).
Weaknesses: sunlight interferes. Multipath errors (light reflects off two surfaces before returning, confuses the timing). Lower spatial resolution than stereo.
Hardware: Azure Kinect, iPhone Pro LiDAR, RealSense L515 (discontinued; replaced by L515 successor designs). Used in many 2026 robots — Stretch, HSR.
The four-question decision tree
| Question | If yes… |
|---|---|
| Outdoors? | Stereo only |
| Textureless walls? | ToF or structured light |
| High range (5+ m)? | Stereo or LiDAR |
| Compute-budget tight? | ToF (depth is direct from sensor) |
Common failure modes
- Glass / mirrors: stereo struggles (different reflection in each camera); ToF goes through; LiDAR ghost-reflects. No good answer; filter aggressively.
- Black surfaces: low IR reflection. Both structured light and ToF struggle. Stereo works if there's any texture.
- Sun: outdoor structured-light/ToF saturated. Use stereo + outdoor-grade IR filter, or use LiDAR.
- Smoke / fog: depth-of-field blur for cameras; LiDAR returns spurious points. Robotic firefighters need radar or thermal.
Depth post-processing
Raw depth from any sensor is noisy and has holes. Standard pipeline:
- Filter: median filter, edge-preserving filter, or learned depth refinement (DPT, Depth Anything as a refiner).
- Hole-filling: extend depth across small holes via interpolation.
- Temporal smoothing: average depth over several frames; cuts noise but adds latency.
- Outlier rejection: depth pixels far from their neighbors usually wrong.
OpenCV's rgbd module has most of these. RealSense SDK adds proprietary filters. Worth a calibration pass before shipping.
What you do with depth
- Point clouds: convert depth + intrinsics to (x, y, z) per pixel. Standard input for grasp networks, mapping, navigation.
- Surface normals: derived from depth gradients. Useful for grasping (perpendicular grasps), picking flat tabletops.
- Object segmentation: depth boundaries often align with object boundaries.
- Collision spheres / boxes: derive bounding volumes for path planning.
Calibration matters
For accurate depth, you need:
- Camera intrinsics: focal length, principal point, distortion. Calibrate with a checkerboard.
- Stereo extrinsics: relative pose between left and right cameras. Stereo calibration.
- RGB-IR alignment: for sensors with separate color and depth cameras (Kinect, RealSense), the registration matters. Most SDKs provide this; verify it's accurate.
- Hand-eye calibration: where is the camera in the robot frame? Critical for grasping. Cover this in the visual-servoing lesson.
Recalibrate after the camera has been moved, dropped, or unmounted. The 5-minute checkerboard ritual saves hours of debugging.
The 2026 production stack
Most indoor mobile manipulators in 2026 use:
- RealSense D435i or L515 successor: integrated RGB-D + IMU.
- Or Azure Kinect: high-quality ToF + RGB.
- Wrist-mounted for grasping; scene-mounted for navigation.
For autonomous vehicles outdoors: stereo cameras + LiDAR + radar. Each sensor covers what the others miss.
Exercise
If you have a RealSense (~$300), stream RGB-D into a point cloud (the SDK provides this). Grasp the cloud; visualize in RViz or Open3D. Now move 10 cm and re-grab. Watch the points reproject. Then put a glass of water in front; observe the failure mode (transparent / multi-reflection). The 30 minutes you spend with depth raw data is the foundation for any later perception work.
Next
LiDAR — the depth sensor that took over autonomous driving. Different physics, different geometry, different software stack.
Comments
Sign in to post a comment.