RobotForge
Published·~16 min

What are VLAs (Vision-Language-Action models) and why they matter

π0, RT-2, OpenVLA, Gemini Robotics. A new kind of model that takes a camera image and a natural-language instruction and outputs robot motor commands. Here's what that unlocks — and what it doesn't.

by RobotForge
#learning#vla#foundation-models#imitation-learning

For twenty years, robot learning meant one policy per task. You wanted the robot to pour coffee? Train a pour-coffee policy. Now you want it to open a drawer? Start over. In the last two years, that contract broke. A new kind of model — the Vision-Language-Action (VLA) model — takes a camera image and a natural-language instruction like "pick up the blue cup" and outputs the robot's joint commands directly. One model. Hundreds of tasks.

What a VLA actually is

Strip away the hype and a VLA is a single neural network with:

  • Input: one or more camera images (wrist cam, scene cam, whatever you have) + a natural-language instruction + optional joint-state history.
  • Output: a sequence of robot actions — joint velocities, positions, or a delta end-effector pose — typically covering the next 0.5–2 seconds.
  • Internals: a pre-trained vision-language model (the "VL" part — usually a dense LLM with an image encoder) with an action decoder bolted on.

The magic isn't the architecture — it's the training data. VLAs are trained on massive, diverse datasets spanning many robots, many tasks, and many environments. The network generalizes because it has seen enough variation that your specific kitchen is just another draw from the training distribution.

Four generations in three years

1. RT-1 (2022) — proof of concept

Google's first real attempt. Trained on 130k teleoperated demos of a mobile manipulator across 700+ tasks. Good at tasks in distribution, poor out of it. Established the training recipe.

2. RT-2 (2023) — co-fine-tune with web data

Google's breakthrough: take a pre-trained vision-language model (PaLI), treat robot actions as just another kind of text token, and fine-tune jointly on web vision-language data plus robot demos. Result: the model could follow natural-language instructions never seen in the robot data (e.g., "move the banana to the sum of two plus one") because the web-data half knew what "banana" and "sum" meant.

3. OpenVLA (2024) — the open version

Stanford, Princeton, Toyota. A 7B-parameter VLA trained on the Open X-Embodiment dataset (970k trajectories, 22 robots). Weights on Hugging Face. Overnight, every lab could try the VLA paradigm — previously locked inside Google's walls. This is the one most hobbyists' first VLA experiment uses.

4. π0 and π0.5 (2024–25) — flow-matching generalists

Physical Intelligence's models. Two architectural tweaks on the VLA formula: flow-matching for action decoding (smoother, fewer discretization artifacts) and training on heterogeneous data mixing demos, videos, and synthetic sources. π0.5 (2025) extends to open-world environments — homes and kitchens it had never seen. OpenPI is the open-source release.

5. Gemini Robotics (2025–26)

Google DeepMind's entry. Gemini 2.0 → Gemini Robotics, adding an action decoder. Gemini Robotics-ER ("Embodied Reasoning") lets the same model plan multi-step tasks in natural language before choosing actions. Gemini Robotics 1.5 does chain-of-thought reasoning before each action — the robot equivalent of "thinking, then doing."

What VLAs are good at

  • Language grounding. Say "the red one," get the red one. No more hand-coding object detection rules.
  • Novel object generalization. A VLA trained on cups can often pick up novel cups it's never seen.
  • Task composition. "Put all the toys in the box" works because the model has seen "pick up toy" and "put in box" separately.
  • Cross-embodiment (increasingly). A model trained on one robot can sometimes transfer, with fine-tuning, to a different one. Still fragile but improving fast.

What VLAs still struggle with

  • Fine force control. Current VLAs output kinematic commands — joint or end-effector positions. Tasks that need precise force (threading a needle, turning a tight screw) are out of reach for the current generation.
  • Long horizons. A VLA confidently executes 30 seconds of action. Five-minute tasks need a planner on top.
  • Rare or dangerous scenarios. If your task isn't represented in training data, the model doesn't know it doesn't know.
  • Reaction time. Running a 7B-param model at 10 Hz takes a serious GPU. Edge deployment is still a research problem.

How to actually try one

Three paths from easiest to hardest:

  1. Run OpenVLA in a sim. LeRobot ships a wrapper; you can roll out the policy on LIBERO benchmarks in an afternoon on a good GPU.
  2. Fine-tune OpenVLA on your own robot demos. Collect 50–200 teleop demonstrations, use LoRA fine-tuning, see generalization emerge. LeRobot makes this a weekend project.
  3. Run π0 from OpenPI on an SO-100 arm. The Physical Intelligence open-source release. Hugging Face has a tutorial.

We'll walk through each of these paths later in this track. For now, the takeaway: you don't need $10M and a fleet of robots to try VLAs anymore. You need a 24 GB GPU, a $500 arm, and a weekend.

The open question

Do VLAs scale? The bullish case: language models scaled predictably with data and compute, and VLAs appear to follow a similar curve (the Open X-Embodiment paper showed monotonic improvements with more robots/tasks). The bearish case: robotics data is expensive to collect, physics constrains what you can do with more parameters, and the bitter lesson may bite robotics harder than NLP.

My take, April 2026: VLAs are not a finished paradigm, but they are the most exciting thing in robot learning since deep RL. Any serious roboticist in 2026 needs working knowledge of them — the way anyone in NLP in 2021 needed to understand transformers.

Next up in this track: imitation learning 101 — the training signal that makes VLAs possible.

Comments

    Sign in to post a comment.