Published2026-04-26·~12 min

Deep learning for robot perception: what's different

Latency budgets, domain shift, and the ways robotics CV diverges from ImageNet practice. The reasons your favorite paper's model breaks on a Jetson, and what to do about it.

by RobotForge

#perception#deep-learning#robotics

A computer vision PhD trains a great ImageNet model. They join a robotics lab and try the same approach. It runs at 2 fps on the Jetson, fails on motion-blurred frames, and confuses every object that wasn't in the dataset. Robotics CV looks like web CV but operates under fundamentally different constraints. Here's the working reality.

Five differences from web CV

1. Latency dominates accuracy

An ImageNet model that runs in 100 ms on a desktop is "fast." For a 50 Hz robot control loop, 100 ms is a fundamental delay you can't engineer around. Production robotics CV often picks a 60% mAP model that runs in 10 ms over an 80% mAP model that runs in 50 ms.

2. Domain shift is constant

The model trained in your lab sees:

Different lighting (sunset vs noon vs LED strips).
Different cameras (RealSense vs phone vs Mako industrial).
Different objects (custom parts not in any pre-trained dataset).
Different angles and distances than ImageNet's centered, well-framed crops.

The fix isn't "more parameters." It's domain randomization, fine-tuning on robot data, and accepting that test-time generalization is a research problem.

3. Motion blur and rolling shutter

Web CV trains on still images. Robot cameras are on moving robots. Blur, rolling-shutter distortion, and exposure variation are everyday. The model has to either be invariant to them or you have to control them out (e.g., global-shutter cameras).

4. Failure modes are unbounded

An ImageNet classifier confidently mislabeling a panda as a gibbon is a paper. A robot misclassifying a pedestrian as a fire hydrant kills someone. The risk function is asymmetric. Production CV needs uncertainty estimates, OOD detection, and conservative defaults.

5. Compute is fixed

Web CV scales to as many TPUs as the bill allows. The robot has a Jetson Orin Nano. The model must fit in 8 GB and run at 30 Hz, end of negotiation. This forces architectural choices (small backbones, quantized weights, edge-friendly operators).

Production patterns

The two-stage pipeline

Slow specialist + fast tracker. Run an expensive model (DETR, Mask2Former) at 1–5 Hz to detect; run a fast tracker (SORT, ByteTrack) at 30+ Hz to associate detections across frames. Each model does what it's best at.

Domain randomization

If you train in sim or on a small lab dataset, randomize:

Lighting (color, intensity, shadows).
Camera intrinsics (focal length within ±10%).
Image augmentation (blur, JPEG compression, color jitter).
Object pose, color, texture.
Background clutter and distractors.

Models trained on randomized data generalize dramatically better to the deployment robot. Cost: maybe 30% more training time.

Knowledge distillation

Train a giant model (ConvNeXt-XL, EVA-02). Use its outputs as soft labels to train a small student (MobileNet-V3, EfficientNet-B0). The student matches 90%+ of the teacher's accuracy at 5–10× the speed.

Standard for production deployment of any modern CV model.

Quantization-aware training

INT8 weights make models 4× faster on edge hardware. Post-hoc quantization works but loses ~2% accuracy. Quantization-aware training (training with simulated INT8 forward passes) loses only ~0.5%. Use it for production deployment.

The latency / accuracy frontier

Production CV is on a Pareto frontier between latency and accuracy. Decision rules:

If your accuracy needs change <5% to halve latency → optimize.
If you have spare compute → use a bigger model.
If you can offload to a server (cloud arm) → use the biggest model that fits in network latency budget.

Rarely is the answer "use the biggest model on Hugging Face."

Uncertainty: the missing dimension

Robotics needs to know when its CV is confident vs guessing. Standard models output a softmax confidence, but those are notoriously miscalibrated.

Modern uncertainty methods:

Temperature scaling: post-hoc calibration of softmax. Quick, effective.
Ensemble: train 5 models; disagreement = uncertainty. Heavy compute but reliable.
Monte Carlo dropout: keep dropout on at inference; sample many times; variance = uncertainty. Cheaper than ensembles.
Open-set detection: detect "this isn't anything I was trained on." Energy-based, ODIN, OpenMax.

For production: ensemble + temperature scaling for confidence; explicit OOD detector for safety-critical decisions.

Continual learning

The robot encounters new objects in deployment. Re-fine-tuning weekly is standard practice for production fleets. Pipeline:

Robot deploys with model V1.
Logs flag low-confidence detections; ship to base.
Operators annotate a sample.
Fine-tune V2 on new + old data.
OTA update.

Continual learning's catastrophic-forgetting problem: V2 forgets things V1 knew. Mitigations: replay buffer (keep some old data), elastic weight consolidation, parameter-efficient fine-tuning (LoRA on the encoder).

The model-zoo decisions

Task	2026 default
Detection (real-time)	YOLOv8 / RT-DETR
Segmentation	SAM (slow) / Mask2Former / YOLO-Seg
Pose estimation	FoundationPose (2024)
Depth (monocular)	Depth Anything v2
Classification (open-vocab)	CLIP, SigLIP
Feature matching (SLAM)	SuperPoint + LightGlue

The frontier moves yearly; the architectural pattern is stable. Pick a model from the right column, fine-tune for your robot, optimize for your edge, deploy.

Robustness benchmarks

Don't trust ImageNet accuracy as a proxy for robotics performance. Use:

Robotic-domain datasets: YCB-Video, Open X-Embodiment.
OOD-stress tests: ImageNet-C (corruptions), motion blur, low light.
Real-robot AB tests: deploy two models, measure success rate.

The metric that matters is end-to-end task success, not isolated detection mAP.

Where it's heading

Foundation models (DINO-V2, CLIP, SAM) are pretraining backbones for almost every robotics CV task.
Vision-Language models (covered next lesson) replace fixed-class detectors with text-prompted ones.
3D-aware models: NeRF, Gaussian splats, mesh prediction. Useful for tasks needing dense geometry.
End-to-end VLAs are eating the perception pipeline for many manipulation tasks. Less explicit "detect → grasp"; more "see → act."

Exercise

Deploy a YOLOv8m to a Jetson Orin Nano with TensorRT. Measure latency. Then retrain it with random brightness, blur, and JPEG-compression augmentation. Re-deploy. Measure latency (same) and accuracy on a held-out test set with motion-blurred images. The accuracy gap between the two trainings is what production augmentation buys you.

Vision-language models for embodied tasks — CLIP, GroundingDINO, SAM, and how modern pipelines glue them together for open-vocabulary robot perception.

← Previous

Sensor fusion: visual-inertial odometry

Vision-language models for embodied tasks