Object detection for robots (YOLO, RT-DETR)
The 2D detection model ecosystem, picked with a robotics lens — latency, small-object accuracy, edge deployment. The runtime constraints that separate paper benchmarks from production.
For most robots, 2D object detection is the first step in seeing the world: "there's a cup at pixel (320, 240) with 95% confidence." The model ecosystem has matured into a few clear winners, but robotics constraints (real-time, edge, small objects) push you to different choices than ImageNet leaderboards. Here's the practical view.
What "detection" is
Given an image, output a list of detected objects: each with a 2D bounding box, a class label, and a confidence score.
[
{bbox: [120, 180, 220, 280], class: 'cup', conf: 0.91},
{bbox: [310, 50, 410, 240], class: 'person', conf: 0.99},
{bbox: [50, 300, 80, 340], class: 'screw', conf: 0.62},
]
Three things every detector does: where, what, how confident. Different architectures balance them differently.
The two model families
1. CNN-based (YOLO family)
Convolutional encoder + dense detection head. Each grid cell predicts a few bounding boxes. Fast, well-optimized, mature.
- YOLOv8 / YOLOv11 (Ultralytics): the production default in 2026. Multiple sizes (nano to extra-large). Pretrained on COCO; fine-tunable on custom data with a one-liner.
- YOLOv7: still good; stable.
- YOLO-NAS: NAS-optimized; better latency at fixed accuracy.
2. Transformer-based (DETR family)
Encoder-decoder transformer. Outputs detections directly via learned object queries. Cleaner formulation; harder to optimize.
- DETR (Carion et al. 2020): the original. Slow but elegant.
- RT-DETR (Baidu, 2023): real-time DETR. Beats YOLO on COCO with similar latency.
- DINO-DETR: state-of-the-art accuracy; production-grade.
For 2026 robotics: YOLO when you want simple + reliable; RT-DETR when you want top accuracy with comparable speed.
The latency budget
Robotics has tight latency constraints. A 200-Hz control loop wants visual feedback at 20+ Hz. With perception running on a Jetson Orin or similar:
| Model | Resolution | Latency (Orin Nano) | COCO mAP |
|---|---|---|---|
| YOLOv8n | 640×640 | ~10 ms | 37.3 |
| YOLOv8s | 640×640 | ~25 ms | 44.9 |
| YOLOv8m | 640×640 | ~50 ms | 50.2 |
| RT-DETR-R50 | 640×640 | ~30 ms | 53.1 |
Numbers are approximate; depends on TensorRT version and quantization.
Optimization for edge deployment
Pretrained models in PyTorch are too slow on Jetson. Production pipeline:
- Train in PyTorch, export to ONNX.
- Convert to TensorRT: NVIDIA's optimized inference engine. 3–10× speedup.
- Quantize to INT8: with calibration data; another 2–4× speedup. Some accuracy loss.
- Batch when possible: process multiple frames per call.
YOLOv8 has TensorRT export built in: yolo export model=best.pt format=engine.
The small-object problem
Standard COCO detection focuses on objects 32+ pixels tall. Robotics often needs to see screws, push-buttons, tiny labels — sub-10-pixel objects. Tactics:
- Higher input resolution: 1280×1280 vs 640×640. Trades 4× compute for ~10% accuracy on small objects.
- Tile the image: split into patches, run detection on each, merge.
- Smaller anchor sizes: bias the model toward small detections.
- SAHI (Slicing Aided Hyper Inference): a library that automates tile-based detection.
For pick-and-place at sub-cm accuracy, none of the standard models is sufficient out of the box. Custom training + tiling + sub-pixel center refinement is the path.
Fine-tuning for your robot
COCO-pretrained models know about cats and pizzas, not robotic parts and tools. Standard recipe:
- Collect ~500 images from the robot's actual cameras.
- Annotate with a tool (CVAT, Roboflow). 30–60 minutes per 100 images for boxes.
- Fine-tune the smallest YOLO that meets accuracy:
yolo train model=yolov8n.pt data=my_data.yaml epochs=100. - Export, deploy, evaluate.
Eight hours from "I want to detect screws" to "robot detects screws on the conveyor."
The synthetic-data option
Real annotations are expensive. Alternatives:
- Synthetic data: render objects in NVIDIA Omniverse / Unity. Free annotation. Domain randomization for sim-to-real.
- Cut-and-paste: foreground objects + random backgrounds = pseudo-real images.
- Open-source datasets: ImageNet-21K, Open Images, LVIS. Sometimes cover your classes.
- Vision-language pretrained detectors: GroundingDINO can detect "yellow cups" with no fine-tuning. Slower but sometimes good enough.
For specialty objects (custom robotic parts, branded items), synthetic data is increasingly the cheapest path.
What detection misses
- Pose: you get a 2D bounding box, not 3D pose. Use depth + pose estimation (FoundationPose) on top.
- Occluded objects: partial visibility confuses confidences. Often false negatives.
- Novel objects: unfamiliar items get confidently misclassified. Open-vocabulary detection (next lesson) helps.
- Fine-grained categories: "Phillips screwdriver" vs "flathead" needs targeted data.
Where this fits in a robotics pipeline
For a typical mobile manipulator:
- Camera streams RGB at 30 Hz.
- YOLO detects target objects per frame (~25 ms).
- Tracker associates detections across frames (Kalman filter, ByteTrack).
- Depth from RGB-D gives 3D position of each detection.
- Segmentation refines the mask (next lesson).
- Grasp planner selects from the 3D-localized objects.
Detection alone isn't enough; it's the gateway to everything downstream.
Exercise
Train YOLOv8n on a custom dataset of 200 images of a single object class. Deploy to a Jetson Orin Nano. Measure latency. Run on a live camera feed. The first time the bounding box stays glued to a moving object on cheap hardware is when robotics CV stops feeling like research.
Next
Semantic and instance segmentation — what to do when you need pixel-precise masks, not just bounding boxes.
Comments
Sign in to post a comment.