Published2026-04-26·~13 min

Vision-language models for embodied tasks

CLIP, GroundingDINO, SAM, and how modern pipelines glue them together for open-vocabulary robot perception. Detect 'the red mug' with no fine-tuning, segment it, grasp it.

by RobotForge

#perception#vlm#open-vocabulary

For decades, robot perception was a pre-defined-class problem: train YOLO on the 80 things you'll ever care about. The 2021 CLIP paper changed that. Vision-language models map images and text into a shared space, enabling open-vocabulary perception. Tell the robot "find the red mug" and it does — with no fine-tuning. Here's the modern stack.

What CLIP gave us

CLIP (Radford et al., 2021): a contrastive model trained on 400M image-text pairs. Two encoders — one for images, one for text — projected into the same 512-D space. Similar things end up close together.

The capabilities this unlocks:

Zero-shot classification: encode candidate text labels; pick the closest to the image embedding.
Image retrieval: search for "a red plate on a wood table" without ever training on plates.
Semantic similarity: distance between encodings reflects semantic relatedness.

For robotics, CLIP didn't directly solve perception, but it provided the prior: knowledge of what things look like, mapped to natural language. Subsequent models stand on this base.

The modern open-vocabulary stack

GroundingDINO (2023)

Open-vocabulary object detection. Input: an image and a text query ("the red cup"). Output: bounding boxes with the queried object.

Strengths: detects anything you can describe in language. No class-set restriction.
Weaknesses: slower than YOLO (~200 ms on Orin); accuracy depends on text prompt phrasing.

SAM (covered earlier)

Segment Anything. Combine with GroundingDINO: detect a box from text, then segment within the box. Result: pixel-precise masks for arbitrary text queries.

OWL-ViT, OWLv2 (Google, 2022/2023)

Faster than GroundingDINO; comparable accuracy. Useful for production deployment.

YOLOWorld (2024)

Open-vocabulary YOLO. Single forward pass with text prompts; ~30 fps. Becoming the production sweet spot.

The "detect then segment" pipeline

from groundingdino.util.inference import load_model, predict
from segment_anything import sam_model_registry, SamPredictor

dino = load_model('groundingdino_swint_ogc.pth')
sam = sam_model_registry['vit_b'](checkpoint='sam_vit_b.pth')
sam_predictor = SamPredictor(sam)

frame = capture()
prompt = 'a blue mug. a coffee mug.'
boxes, _, _ = predict(dino, frame, prompt, box_threshold=0.35)

sam_predictor.set_image(frame)
for box in boxes:
    mask, _, _ = sam_predictor.predict(box=box, multimask_output=False)
    # Use mask for grasping...

Twenty lines for "find blue mugs in any scene and get pixel masks." No training. The robot just knows what mugs look like, because CLIP/GroundingDINO trained on enough of them.

What "open-vocabulary" buys you

Generalization: deploy without retraining for new objects.
Language interface: end-users can describe what they want.
Composition: "the red mug to the left of the green plate" works.
Long-tail handling: rare objects don't need their own training data.

Where it still fails

Domain-specific objects: custom industrial parts CLIP has never seen. Falls back to generic descriptions or misses entirely.
Spatial reasoning is shallow: "the third object from the left" works sometimes; "stack these" rarely.
Temporal reasoning: "the cup that was just moved." VLMs see frames independently.
Latency: 200+ ms per query. Not for real-time control.
Hallucination: confidently identifying objects that aren't there.

VLMs as planners

Beyond detection, large multimodal LLMs (GPT-4V, Claude 3.5 Sonnet, Gemini Pro Vision) can:

Take an image of a workspace.
Generate a step-by-step plan in natural language.
Identify objects, estimate spatial relationships.
Caption scenes for downstream models.

Used in:

VoxPoser: VLM generates 3D constraint maps for motion planning.
SayCan: VLM grounds high-level commands to robot-affordance language.
Code-as-policies: VLM writes Python code that calls robot APIs.

These are research-track in 2026; not yet production-stable. The latency (3–10 seconds per VLM call) is prohibitive for closed-loop control. They work as one-shot planners for high-level decisions.

Foundation models for embeddings

Beyond detection, VLM embeddings are useful as features for downstream tasks:

Place recognition for SLAM: NetVLAD-style retrieval using CLIP embeddings.
Reward functions: similarity to a goal image as a reward signal for RL.
Visual data filtering: rank training data by relevance.
Episode summarization: caption robot trajectories with language.

The general pattern: use the VLM as a feature extractor; build classical or learned downstream pipelines on top.

Production deployment patterns

Slow VLM + fast specialist

VLM detects the object once with text prompt; tracker + fine-tuned specialist follow it across many frames. Combines flexibility with speed.

VLM as data labeler

VLM generates pseudo-labels for unlabeled images. Train a fast specialist on the pseudo-labels. Specialist deploys; VLM stays in the lab.

VLM in the loop

For tasks where latency is acceptable (1–10 s), VLM directly outputs grasp targets, navigation goals, etc. SayCan and VoxPoser do this.

The text-prompt engineering layer

VLMs are prompt-sensitive. Detection accuracy varies a lot by phrasing:

"the cup" — 70% mAP.
"a blue ceramic mug" — 85% mAP.
"the cup to the left of the table" — 60% mAP (spatial reasoning weak).

Production usage often has a "prompt template" with hand-tuned phrasings per task. Worth iterating on.

Where this is going

Faster VLMs: distillation + quantization producing real-time variants. YOLOWorld is one example; more coming.
Spatial reasoning improvements: 3D-aware VLMs (LERF, VLMaps).
Tighter integration with VLAs: shared encoders for perception and action. π0 and Gemini Robotics already do this.
On-device VLMs: Llama-3.2-Vision, MobileVLM. Deployable on edge.

Exercise

Run GroundingDINO + SAM on an indoor scene. Try prompts: "cup", "blue cup", "the cup on the right", "red coffee cup with handle". Note how detection quality depends on phrasing. Build a small grasping demo where the robot picks the object the user describes via text. The first time you say "the green block" and the robot does it without fine-tuning, the field's progress hits hard.

That's the Perception track done

You've covered the full progression: camera models → classical features → optical flow / SfM → depth sensors → LiDAR → 2D detection → segmentation → VIO → deep CV considerations → vision-language models. Combined with SLAM (just completed), you have the full perception-and-estimation backbone of modern robotics. Apply it across the application tracks (Manipulation, Mobile, Frontiers) for end-to-end robot intelligence.

← Previous

Deep learning for robot perception: what's different