Published2026-04-26·~13 min

Object detection for robots (YOLO, RT-DETR)

The 2D detection model ecosystem, picked with a robotics lens — latency, small-object accuracy, edge deployment. The runtime constraints that separate paper benchmarks from production.

by RobotForge

#perception#object-detection#yolo

For most robots, 2D object detection is the first step in seeing the world: "there's a cup at pixel (320, 240) with 95% confidence." The model ecosystem has matured into a few clear winners, but robotics constraints (real-time, edge, small objects) push you to different choices than ImageNet leaderboards. Here's the practical view.

What "detection" is

Given an image, output a list of detected objects: each with a 2D bounding box, a class label, and a confidence score.

[
  {bbox: [120, 180, 220, 280], class: 'cup', conf: 0.91},
  {bbox: [310, 50,  410, 240], class: 'person', conf: 0.99},
  {bbox: [50,  300, 80,  340], class: 'screw', conf: 0.62},
]

Three things every detector does: where, what, how confident. Different architectures balance them differently.

The two model families

1. CNN-based (YOLO family)

Convolutional encoder + dense detection head. Each grid cell predicts a few bounding boxes. Fast, well-optimized, mature.

YOLOv8 / YOLOv11 (Ultralytics): the production default in 2026. Multiple sizes (nano to extra-large). Pretrained on COCO; fine-tunable on custom data with a one-liner.
YOLOv7: still good; stable.
YOLO-NAS: NAS-optimized; better latency at fixed accuracy.

2. Transformer-based (DETR family)

Encoder-decoder transformer. Outputs detections directly via learned object queries. Cleaner formulation; harder to optimize.

DETR (Carion et al. 2020): the original. Slow but elegant.
RT-DETR (Baidu, 2023): real-time DETR. Beats YOLO on COCO with similar latency.
DINO-DETR: state-of-the-art accuracy; production-grade.

For 2026 robotics: YOLO when you want simple + reliable; RT-DETR when you want top accuracy with comparable speed.

The latency budget

Robotics has tight latency constraints. A 200-Hz control loop wants visual feedback at 20+ Hz. With perception running on a Jetson Orin or similar:

Model	Resolution	Latency (Orin Nano)	COCO mAP
YOLOv8n	640×640	~10 ms	37.3
YOLOv8s	640×640	~25 ms	44.9
YOLOv8m	640×640	~50 ms	50.2
RT-DETR-R50	640×640	~30 ms	53.1

Numbers are approximate; depends on TensorRT version and quantization.

Optimization for edge deployment

Pretrained models in PyTorch are too slow on Jetson. Production pipeline:

Train in PyTorch, export to ONNX.
Convert to TensorRT: NVIDIA's optimized inference engine. 3–10× speedup.
Quantize to INT8: with calibration data; another 2–4× speedup. Some accuracy loss.
Batch when possible: process multiple frames per call.

YOLOv8 has TensorRT export built in: yolo export model=best.pt format=engine.

The small-object problem

Standard COCO detection focuses on objects 32+ pixels tall. Robotics often needs to see screws, push-buttons, tiny labels — sub-10-pixel objects. Tactics:

Higher input resolution: 1280×1280 vs 640×640. Trades 4× compute for ~10% accuracy on small objects.
Tile the image: split into patches, run detection on each, merge.
Smaller anchor sizes: bias the model toward small detections.
SAHI (Slicing Aided Hyper Inference): a library that automates tile-based detection.

For pick-and-place at sub-cm accuracy, none of the standard models is sufficient out of the box. Custom training + tiling + sub-pixel center refinement is the path.

Fine-tuning for your robot

COCO-pretrained models know about cats and pizzas, not robotic parts and tools. Standard recipe:

Collect ~500 images from the robot's actual cameras.
Annotate with a tool (CVAT, Roboflow). 30–60 minutes per 100 images for boxes.
Fine-tune the smallest YOLO that meets accuracy: yolo train model=yolov8n.pt data=my_data.yaml epochs=100.
Export, deploy, evaluate.

Eight hours from "I want to detect screws" to "robot detects screws on the conveyor."

The synthetic-data option

Real annotations are expensive. Alternatives:

Synthetic data: render objects in NVIDIA Omniverse / Unity. Free annotation. Domain randomization for sim-to-real.
Cut-and-paste: foreground objects + random backgrounds = pseudo-real images.
Open-source datasets: ImageNet-21K, Open Images, LVIS. Sometimes cover your classes.
Vision-language pretrained detectors: GroundingDINO can detect "yellow cups" with no fine-tuning. Slower but sometimes good enough.

For specialty objects (custom robotic parts, branded items), synthetic data is increasingly the cheapest path.

What detection misses

Pose: you get a 2D bounding box, not 3D pose. Use depth + pose estimation (FoundationPose) on top.
Occluded objects: partial visibility confuses confidences. Often false negatives.
Novel objects: unfamiliar items get confidently misclassified. Open-vocabulary detection (next lesson) helps.
Fine-grained categories: "Phillips screwdriver" vs "flathead" needs targeted data.

Where this fits in a robotics pipeline

For a typical mobile manipulator:

Camera streams RGB at 30 Hz.
YOLO detects target objects per frame (~25 ms).
Tracker associates detections across frames (Kalman filter, ByteTrack).
Depth from RGB-D gives 3D position of each detection.
Segmentation refines the mask (next lesson).
Grasp planner selects from the 3D-localized objects.

Detection alone isn't enough; it's the gateway to everything downstream.

Exercise

Train YOLOv8n on a custom dataset of 200 images of a single object class. Deploy to a Jetson Orin Nano. Measure latency. Run on a live camera feed. The first time the bounding box stays glued to a moving object on cheap hardware is when robotics CV stops feeling like research.

Semantic and instance segmentation — what to do when you need pixel-precise masks, not just bounding boxes.

← Previous

LiDAR: point clouds, filtering, ground segmentation

Semantic and instance segmentation