RobotForge
Published·~18 min

Production VLA fine-tuning playbook

OpenVLA, π0, GR00T — collect 200 demonstrations, LoRA-fine-tune, deploy. The end-to-end pipeline that goes from teleop rig to working policy on your specific arm in a weekend.

by RobotForge
#frontiers#vla#fine-tuning#imitation-learning

VLAs are no longer a closed-source vendor demo. In 2026 a hobbyist with a $500 SO-100 arm, a 24 GB GPU, and a weekend can fine-tune OpenVLA or π0 on their own task and deploy. Here's the production playbook — what to collect, how to train, and the diagnostic checks that catch failures before they hit hardware.

The pipeline at a glance

  1. Set up a teleop rig — phone, GELLO clone, or an ALOHA puppet.
  2. Collect 100–500 demonstrations of one specific task.
  3. Convert to the LeRobot dataset format.
  4. LoRA-fine-tune a base VLA (OpenVLA, π0, SmolVLA).
  5. Evaluate in sim on held-out scenarios.
  6. Deploy on the real robot. Iterate.

Each step has its own gotchas. Below, in order.

1. Pick the right base model

Model Params Best for License
OpenVLA7BSingle-arm tasks, the most-tested choice in 2026MIT
π0 (OpenPI)3BSmoother actions (flow-matching), bimanual tasksApache-2.0
SmolVLA450MEdge deployment, 12GB GPU sufficientApache-2.0
GR00T N1.51.5BHumanoid whole-bodyApache-2.0

Default for first fine-tune: OpenVLA. The most documented, most-validated model with the biggest community of fine-tunes to learn from. Move to π0 if your task needs smoother trajectories or bimanual coordination.

2. Demonstrations: quality over quantity

The single most-important determinant of fine-tune quality. Good rules:

  • 100 demos minimum, 200–300 is the sweet spot for a single-task fine-tune.
  • Vary the scene. Different lighting, different starting positions of objects, different distractors in the background. Each demo should look slightly different.
  • One task per dataset. Don't mix "pick up the cup" and "stack blocks" in the same fine-tune; performance suffers in both.
  • Include recoveries. Drop the object on purpose, then pick it back up. Your robot will fail in the field; the demo set should show it how to recover.
  • Use natural-language labels. Each episode tags itself with the instruction in plain English. The VLA exploits these.

3. Teleop rig options

  • Phone teleop — point the camera, tap drags map to gripper position. Simplest, lowest fidelity. Good for proof-of-concept on simple arms.
  • GELLO (open-source) — a small puppet of your robot's arm, joint-encoded. The user moves the puppet; the real arm mirrors. Cheap (~$200) and fast (you can do 50 demos/hour).
  • ALOHA — same idea but bimanual. Required for any bimanual fine-tune.
  • VR controllers — Quest, Vive. Higher friction to set up; less precise than puppets.

Recommendation: build or buy a GELLO clone for your first fine-tune. The collection rate compounds — you'll have 200 demos in 4 hours, training takes 6, evaluation takes 1, total weekend.

4. Dataset format

LeRobot is the de-facto standard:

from lerobot.common.datasets.lerobot_dataset import LeRobotDataset

dataset = LeRobotDataset.create(
    repo_id="my-username/pick-up-cup",
    fps=30,
    robot_type="so100",
    features={
        "observation.images.scene": {"dtype": "video", "shape": (480, 640, 3)},
        "observation.images.wrist": {"dtype": "video", "shape": (240, 320, 3)},
        "observation.state": {"dtype": "float32", "shape": (6,)},
        "action": {"dtype": "float32", "shape": (6,)},
    },
)

# In the teleop loop:
dataset.add_frame({"observation.images.scene": frame, ...})
dataset.save_episode(task="pick up the blue cup")

Two cameras (one scene, one wrist) and proprioceptive state are the standard observation set. Push to Hugging Face for portability and reuse.

5. LoRA fine-tuning

Full fine-tuning a 7B model needs 80+ GB of VRAM. LoRA (Low-Rank Adaptation) trains only ~1% of parameters and fits in 24 GB. Quality is nearly identical for single-task fine-tunes.

from lerobot.scripts.train import train

train(
    policy_class="openvla",
    base_checkpoint="openvla/openvla-7b",
    dataset_repo_id="my-username/pick-up-cup",
    use_lora=True,
    lora_rank=32,
    lora_target_modules=["q_proj", "v_proj"],
    train_batch_size=4,
    learning_rate=5e-4,
    num_steps=20_000,
    save_every=2_000,
    output_dir="./checkpoints/pick-up-cup",
)

Numbers that work in practice for OpenVLA:

  • LoRA rank: 16–32. Higher = more capacity, more VRAM.
  • Learning rate: 5e-4 for LoRA (vs 1e-5 for full fine-tune).
  • Batch size: as big as fits. 4 on a 24 GB card.
  • Steps: 20–50k for 200 demos. Watch eval loss; stop when it plateaus.
  • Time: ~6 hours on an RTX 4090.

6. Evaluation discipline

Two evaluation regimes, both required:

Sim eval

Drop your fine-tuned model into a held-out sim setup (LIBERO is the standard benchmark). Measure success rate over 50–100 trials with random initial conditions. Look for: success rate above 85%, action smoothness (no high-frequency jitter), handles distractor objects.

Real-world eval

Don't deploy until sim looks good. Then on hardware: 30 trials, log the robot video for each. Categorize failures: never approached, dropped object, hit obstacle, timed out. Each failure mode points at a different fix (more demos, more domain randomization, better collision handling).

The five most common failure modes

  • Open loop drift — model executes 5 seconds of correct motion, then meanders. Fix: more demos with longer time horizons; use action chunking with chunk size 16–32.
  • Misses the object — model approaches consistently 1–2 cm off. Fix: collect demos with deliberate variation in object placement; add wrist-camera view if you only had scene-camera.
  • Sim-to-real gap on lighting — works in your studio, fails in a friend's apartment. Fix: collect demos under varied lighting (window light, overhead light, mixed). Color-jitter augmentation in training.
  • Action saturation — model commands joint angles outside the robot's range, the safety stop trips. Fix: clamp predictions during inference; verify your action normalization matches the training data's normalization.
  • Catastrophic forgetting — fine-tuned model forgot tasks the base model could do. Fix: smaller LoRA rank, or use rehearsal data from the base mix.

Inference latency

OpenVLA inference is ~150 ms per frame on a 24 GB GPU. Two tactics for real-time use:

  • Action chunking: predict 16 actions, execute them open-loop for 16 frames, then predict again. Effective rate: 6–10 Hz, plenty for most tasks.
  • SmolVLA on edge: 450M model runs at 30+ Hz on a Jetson Orin. Quality drop is real but usable for simpler tasks.

Honest expectations

Even with everything right, expect:

  • ~85% success on tasks similar to demos.
  • ~60% success on tasks slightly outside the demo distribution (different lighting, novel object).
  • ~10% success on tasks substantially different. The model won't tell you it's outside its distribution; it'll just fail confidently.

This is materially better than what was possible in 2024. It is not yet "deploy and forget." It is "iterate on demos for two weekends and have a working capability."

Where to learn more

  • OpenVLA repo — fine-tuning script + LoRA recipes
  • π0 OpenPI — open release of π0 from Physical Intelligence
  • LeRobot — dataset format, training, deployment glue

Exercise

Build (or buy) a GELLO arm. Pick a single, narrow task — "pick up the orange block and put it in the box." Collect 200 demos in one afternoon. Fine-tune OpenVLA overnight. Run 30 evals on hardware the next morning. Compute success rate. You now know what 2026's frontier feels like — and the diff between "demo" and "production" is mostly more of the above.

Next

Embodied reasoning with LLM agents — what to do when even a fine-tuned VLA can't handle the long-horizon, multi-step plan you need.

Comments

    Sign in to post a comment.