RobotForge
Published·~13 min

Semantic and instance segmentation

Pixel-precise masks instead of bounding boxes. SAM, Mask2Former, YOLOv8-Seg — what each gives you, and how to pick the right tool for grasping, mapping, or scene understanding.

by RobotForge
#perception#segmentation#sam

A bounding box says "the cup is here-ish." A mask says "here are the 1843 pixels that are exactly the cup." For grasping, you need the mask. For free-space mapping, you need the mask. For detecting "the floor I can drive on" pixel-by-pixel, you need the mask. Segmentation is the next-precision step beyond detection — and 2023's Segment Anything Model (SAM) made it dramatically more practical.

Three flavors of segmentation

Type What it produces Use case
SemanticClass label per pixel ("road", "person")Free-space, sky, terrain understanding
InstanceMask per individual object instanceGrasping (need to separate touching cups)
PanopticBoth — every pixel labeled, things separatedAutonomous driving (full scene)

Pick by what you need: semantic for "what kind of surface is this?", instance for "which individual object?", panoptic for "tell me everything."

The 2026 model landscape

Segment Anything (SAM, Meta 2023)

The breakthrough. A foundation model for segmentation: prompt it with a click, a box, or a text label, get back a precise mask. Trained on 11M images with 1B+ masks.

  • Strengths: zero-shot. Works on any object without fine-tuning. Mask quality is consistently good.
  • Weaknesses: doesn't classify (you give it the prompt). Slow on edge hardware (~500 ms on Jetson).

SAM 2 (2024) handles videos with mask propagation across frames.

Mask2Former (Meta, 2022)

Universal segmentation: same architecture handles semantic, instance, and panoptic. Higher accuracy than SAM on benchmarks; needs fine-tuning per dataset.

  • Strengths: state-of-the-art accuracy on supervised benchmarks. Outputs class labels.
  • Weaknesses: needs fine-tuning; not zero-shot. Slow.

YOLOv8-Seg / YOLOv11-Seg

YOLO's instance segmentation variant. Fast, integrates with the YOLO pipeline. Good for production where speed matters and class set is known.

  • Strengths: fast (similar latency to YOLO detection). One-pass inference.
  • Weaknesses: mask quality is rougher than SAM/Mask2Former.

Open-vocabulary (LISA, GroundingSAM)

Combine vision-language models with segmentation. Prompt with text: "the red object on the left." Get back a mask.

  • Strengths: language-conditioned. Useful when class set is dynamic.
  • Weaknesses: slower than YOLO; less accurate than fine-tuned specialists.

The detect-then-segment pipeline

For most robotics, the cleanest pattern:

  1. Run YOLO for fast detection — produces bounding boxes.
  2. For each detection, run SAM with the box as prompt — produces a precise mask.
from ultralytics import YOLO
from segment_anything import sam_model_registry, SamPredictor

yolo = YOLO('yolov8n.pt')
sam = sam_model_registry['vit_b'](checkpoint='sam_vit_b_01ec64.pth')
predictor = SamPredictor(sam)

frame = capture()
detections = yolo(frame)
predictor.set_image(frame)

masks = []
for box, _label, _conf in detections:
    mask, _, _ = predictor.predict(box=box, multimask_output=False)
    masks.append(mask)

Combines YOLO's speed with SAM's mask precision. The latency is dominated by SAM (~500 ms per box on Jetson Orin Nano); for 1–3 boxes per frame, this hits ~5 Hz, fine for many manipulation pipelines.

SAM 2 for video

SAM 2 propagates masks across frames given an initial click. Useful for:

  • Tracking: click an object once; SAM 2 follows it.
  • Manipulation: grasp a cup, then track it through the place motion.
  • Annotation: click a few frames; SAM 2 fills in the rest. Cuts annotation time by 10×.

Production gotchas

  • SAM eats memory: ViT-H model is 2.5 GB. Use ViT-B (~360 MB) for edge deployments.
  • Mask boundaries are jittery in video: temporal smoothing or SAM 2 helps.
  • Touching objects merge: instance segmentation can fail when objects are close. SAM with separate prompts per object is the workaround.
  • Background as foreground: SAM sometimes returns "the table" instead of "the cup." Provide negative prompt clicks to refine.

What you do with masks

  • Grasping: 3D point cloud filtered to mask pixels → only the object's geometry.
  • Pushing / pose estimation: principal-component analysis on the masked point cloud → object orientation.
  • Free-space: invert "drivable surface" mask → obstacles for navigation.
  • VLA conditioning: provide the mask as an extra input channel to a VLA fine-tune. Improves task focus.

The 2026 stack

For an indoor mobile manipulator:

  1. YOLOv8 detects objects of interest at 25 fps.
  2. SAM segments selected detections at 5 fps.
  3. Mask + depth → 3D point cloud per object.
  4. Grasp planner takes the cloud as input.

For an autonomous vehicle:

  1. BEVFusion handles detection + segmentation jointly on LiDAR + camera fusion.
  2. Output: BEV occupancy grid with instance labels.
  3. Planner takes the grid as obstacle layer.

Different platforms, different stacks; same fundamental task.

Exercise

Run SAM on an image of a cluttered table. Click on each object; collect masks. Then run YOLO detection + SAM auto-prompt; compare. The hand-clicked SAM is "ground truth"; the YOLO+SAM pipeline is what you can deploy. Tune the YOLO confidence threshold to match the click count. Five minutes; teaches the production tradeoff.

Next

Sensor fusion with visual-inertial odometry — combining the camera with an IMU for robust pose estimation in places GPS can't reach.

Comments

    Sign in to post a comment.