Published2026-04-26·~13 min

Semantic and instance segmentation

Pixel-precise masks instead of bounding boxes. SAM, Mask2Former, YOLOv8-Seg — what each gives you, and how to pick the right tool for grasping, mapping, or scene understanding.

by RobotForge

#perception#segmentation#sam

A bounding box says "the cup is here-ish." A mask says "here are the 1843 pixels that are exactly the cup." For grasping, you need the mask. For free-space mapping, you need the mask. For detecting "the floor I can drive on" pixel-by-pixel, you need the mask. Segmentation is the next-precision step beyond detection — and 2023's Segment Anything Model (SAM) made it dramatically more practical.

Three flavors of segmentation

Type	What it produces	Use case
Semantic	Class label per pixel ("road", "person")	Free-space, sky, terrain understanding
Instance	Mask per individual object instance	Grasping (need to separate touching cups)
Panoptic	Both — every pixel labeled, things separated	Autonomous driving (full scene)

Pick by what you need: semantic for "what kind of surface is this?", instance for "which individual object?", panoptic for "tell me everything."

The 2026 model landscape

Segment Anything (SAM, Meta 2023)

The breakthrough. A foundation model for segmentation: prompt it with a click, a box, or a text label, get back a precise mask. Trained on 11M images with 1B+ masks.

Strengths: zero-shot. Works on any object without fine-tuning. Mask quality is consistently good.
Weaknesses: doesn't classify (you give it the prompt). Slow on edge hardware (~500 ms on Jetson).

SAM 2 (2024) handles videos with mask propagation across frames.

Mask2Former (Meta, 2022)

Universal segmentation: same architecture handles semantic, instance, and panoptic. Higher accuracy than SAM on benchmarks; needs fine-tuning per dataset.

Strengths: state-of-the-art accuracy on supervised benchmarks. Outputs class labels.
Weaknesses: needs fine-tuning; not zero-shot. Slow.

YOLOv8-Seg / YOLOv11-Seg

YOLO's instance segmentation variant. Fast, integrates with the YOLO pipeline. Good for production where speed matters and class set is known.

Strengths: fast (similar latency to YOLO detection). One-pass inference.
Weaknesses: mask quality is rougher than SAM/Mask2Former.

Open-vocabulary (LISA, GroundingSAM)

Combine vision-language models with segmentation. Prompt with text: "the red object on the left." Get back a mask.

Strengths: language-conditioned. Useful when class set is dynamic.
Weaknesses: slower than YOLO; less accurate than fine-tuned specialists.

The detect-then-segment pipeline

For most robotics, the cleanest pattern:

Run YOLO for fast detection — produces bounding boxes.
For each detection, run SAM with the box as prompt — produces a precise mask.

from ultralytics import YOLO
from segment_anything import sam_model_registry, SamPredictor

yolo = YOLO('yolov8n.pt')
sam = sam_model_registry['vit_b'](checkpoint='sam_vit_b_01ec64.pth')
predictor = SamPredictor(sam)

frame = capture()
detections = yolo(frame)
predictor.set_image(frame)

masks = []
for box, _label, _conf in detections:
    mask, _, _ = predictor.predict(box=box, multimask_output=False)
    masks.append(mask)

Combines YOLO's speed with SAM's mask precision. The latency is dominated by SAM (~500 ms per box on Jetson Orin Nano); for 1–3 boxes per frame, this hits ~5 Hz, fine for many manipulation pipelines.

SAM 2 for video

SAM 2 propagates masks across frames given an initial click. Useful for:

Tracking: click an object once; SAM 2 follows it.
Manipulation: grasp a cup, then track it through the place motion.
Annotation: click a few frames; SAM 2 fills in the rest. Cuts annotation time by 10×.

Production gotchas

SAM eats memory: ViT-H model is 2.5 GB. Use ViT-B (~360 MB) for edge deployments.
Mask boundaries are jittery in video: temporal smoothing or SAM 2 helps.
Touching objects merge: instance segmentation can fail when objects are close. SAM with separate prompts per object is the workaround.
Background as foreground: SAM sometimes returns "the table" instead of "the cup." Provide negative prompt clicks to refine.

What you do with masks

Grasping: 3D point cloud filtered to mask pixels → only the object's geometry.
Pushing / pose estimation: principal-component analysis on the masked point cloud → object orientation.
Free-space: invert "drivable surface" mask → obstacles for navigation.
VLA conditioning: provide the mask as an extra input channel to a VLA fine-tune. Improves task focus.

The 2026 stack

For an indoor mobile manipulator:

YOLOv8 detects objects of interest at 25 fps.
SAM segments selected detections at 5 fps.
Mask + depth → 3D point cloud per object.
Grasp planner takes the cloud as input.

For an autonomous vehicle:

BEVFusion handles detection + segmentation jointly on LiDAR + camera fusion.
Output: BEV occupancy grid with instance labels.
Planner takes the grid as obstacle layer.

Different platforms, different stacks; same fundamental task.

Exercise

Run SAM on an image of a cluttered table. Click on each object; collect masks. Then run YOLO detection + SAM auto-prompt; compare. The hand-clicked SAM is "ground truth"; the YOLO+SAM pipeline is what you can deploy. Tune the YOLO confidence threshold to match the click count. Five minutes; teaches the production tradeoff.

Sensor fusion with visual-inertial odometry — combining the camera with an IMU for robust pose estimation in places GPS can't reach.

← Previous

Object detection for robots (YOLO, RT-DETR)

Sensor fusion: visual-inertial odometry