Semantic and instance segmentation
Pixel-precise masks instead of bounding boxes. SAM, Mask2Former, YOLOv8-Seg — what each gives you, and how to pick the right tool for grasping, mapping, or scene understanding.
A bounding box says "the cup is here-ish." A mask says "here are the 1843 pixels that are exactly the cup." For grasping, you need the mask. For free-space mapping, you need the mask. For detecting "the floor I can drive on" pixel-by-pixel, you need the mask. Segmentation is the next-precision step beyond detection — and 2023's Segment Anything Model (SAM) made it dramatically more practical.
Three flavors of segmentation
| Type | What it produces | Use case |
|---|---|---|
| Semantic | Class label per pixel ("road", "person") | Free-space, sky, terrain understanding |
| Instance | Mask per individual object instance | Grasping (need to separate touching cups) |
| Panoptic | Both — every pixel labeled, things separated | Autonomous driving (full scene) |
Pick by what you need: semantic for "what kind of surface is this?", instance for "which individual object?", panoptic for "tell me everything."
The 2026 model landscape
Segment Anything (SAM, Meta 2023)
The breakthrough. A foundation model for segmentation: prompt it with a click, a box, or a text label, get back a precise mask. Trained on 11M images with 1B+ masks.
- Strengths: zero-shot. Works on any object without fine-tuning. Mask quality is consistently good.
- Weaknesses: doesn't classify (you give it the prompt). Slow on edge hardware (~500 ms on Jetson).
SAM 2 (2024) handles videos with mask propagation across frames.
Mask2Former (Meta, 2022)
Universal segmentation: same architecture handles semantic, instance, and panoptic. Higher accuracy than SAM on benchmarks; needs fine-tuning per dataset.
- Strengths: state-of-the-art accuracy on supervised benchmarks. Outputs class labels.
- Weaknesses: needs fine-tuning; not zero-shot. Slow.
YOLOv8-Seg / YOLOv11-Seg
YOLO's instance segmentation variant. Fast, integrates with the YOLO pipeline. Good for production where speed matters and class set is known.
- Strengths: fast (similar latency to YOLO detection). One-pass inference.
- Weaknesses: mask quality is rougher than SAM/Mask2Former.
Open-vocabulary (LISA, GroundingSAM)
Combine vision-language models with segmentation. Prompt with text: "the red object on the left." Get back a mask.
- Strengths: language-conditioned. Useful when class set is dynamic.
- Weaknesses: slower than YOLO; less accurate than fine-tuned specialists.
The detect-then-segment pipeline
For most robotics, the cleanest pattern:
- Run YOLO for fast detection — produces bounding boxes.
- For each detection, run SAM with the box as prompt — produces a precise mask.
from ultralytics import YOLO
from segment_anything import sam_model_registry, SamPredictor
yolo = YOLO('yolov8n.pt')
sam = sam_model_registry['vit_b'](checkpoint='sam_vit_b_01ec64.pth')
predictor = SamPredictor(sam)
frame = capture()
detections = yolo(frame)
predictor.set_image(frame)
masks = []
for box, _label, _conf in detections:
mask, _, _ = predictor.predict(box=box, multimask_output=False)
masks.append(mask)
Combines YOLO's speed with SAM's mask precision. The latency is dominated by SAM (~500 ms per box on Jetson Orin Nano); for 1–3 boxes per frame, this hits ~5 Hz, fine for many manipulation pipelines.
SAM 2 for video
SAM 2 propagates masks across frames given an initial click. Useful for:
- Tracking: click an object once; SAM 2 follows it.
- Manipulation: grasp a cup, then track it through the place motion.
- Annotation: click a few frames; SAM 2 fills in the rest. Cuts annotation time by 10×.
Production gotchas
- SAM eats memory: ViT-H model is 2.5 GB. Use ViT-B (~360 MB) for edge deployments.
- Mask boundaries are jittery in video: temporal smoothing or SAM 2 helps.
- Touching objects merge: instance segmentation can fail when objects are close. SAM with separate prompts per object is the workaround.
- Background as foreground: SAM sometimes returns "the table" instead of "the cup." Provide negative prompt clicks to refine.
What you do with masks
- Grasping: 3D point cloud filtered to mask pixels → only the object's geometry.
- Pushing / pose estimation: principal-component analysis on the masked point cloud → object orientation.
- Free-space: invert "drivable surface" mask → obstacles for navigation.
- VLA conditioning: provide the mask as an extra input channel to a VLA fine-tune. Improves task focus.
The 2026 stack
For an indoor mobile manipulator:
- YOLOv8 detects objects of interest at 25 fps.
- SAM segments selected detections at 5 fps.
- Mask + depth → 3D point cloud per object.
- Grasp planner takes the cloud as input.
For an autonomous vehicle:
- BEVFusion handles detection + segmentation jointly on LiDAR + camera fusion.
- Output: BEV occupancy grid with instance labels.
- Planner takes the grid as obstacle layer.
Different platforms, different stacks; same fundamental task.
Exercise
Run SAM on an image of a cluttered table. Click on each object; collect masks. Then run YOLO detection + SAM auto-prompt; compare. The hand-clicked SAM is "ground truth"; the YOLO+SAM pipeline is what you can deploy. Tune the YOLO confidence threshold to match the click count. Five minutes; teaches the production tradeoff.
Next
Sensor fusion with visual-inertial odometry — combining the camera with an IMU for robust pose estimation in places GPS can't reach.
Comments
Sign in to post a comment.