RobotForge
Published·~14 min

Fine-tuning a VLA on your own data

The conceptual side of VLA adaptation. LoRA vs full fine-tune, action representations, loss functions, and the tradeoffs that determine whether a 200-demo run produces a working policy.

by RobotForge
#learning#vla#fine-tuning

The Frontiers track has the production playbook (data collection, hyperparameters, evaluation discipline). This lesson covers the conceptual side: what's actually happening when you fine-tune a VLA, why specific design choices matter, and how to debug when training looks fine but deployment fails. The architecture knowledge that makes the playbook make sense.

What's pretrained, what's fine-tuned

A VLA has three logical components, each with different fine-tune characteristics:

  • Vision encoder: a pretrained vision-language backbone (e.g., SigLIP, CLIP). Knows what objects look like. Rarely fine-tuned — fine-tune lower layers and you forget how to see things.
  • Language encoder: same backbone, knows what words mean. Even more rarely fine-tuned.
  • Action decoder: a thinner network mapping joint state + visual+language features to action chunks. This is what fine-tuning mostly modifies.

The standard recipe: freeze the vision-language backbone, fine-tune only the action decoder + a thin adapter layer. That's where LoRA enters.

LoRA in one paragraph

Instead of updating every weight in a layer, train two small low-rank matrices that produce a residual update: W' = W + AB^T, where A, B are skinny (rank r ~ 16–32). At training time you only learn A, B; at inference time you can either keep them separate (cheaper, swap fine-tunes) or merge them into W' (zero overhead).

For a 7B-parameter VLA, LoRA fine-tuning trains ~1% of parameters. Memory drops from 80+ GB to <24 GB; convergence speed is similar to full fine-tune for many tasks.

Action representations: the crucial decision

Your 200 demos contain (image, language, action) triples. What's an "action"?

Joint angles

Direct: a_t \in \mathbb{R}^7 for a 7-DOF arm. Compact, deterministic to execute.

  • Pro: matches the robot's interface; zero ambiguity.
  • Con: doesn't transfer across robots with different joint configurations.

End-effector pose deltas

a_t \in \mathbb{R}^6: linear + angular velocity, or position+orientation delta.

  • Pro: transfers across arms with different kinematics.
  • Con: needs an IK layer at execution time; may produce singular configurations.

Discrete action tokens

RT-2's choice: discretize each action dimension into 256 buckets, treat as language tokens. Predict via the LLM head.

  • Pro: shares the language head; simple integration.
  • Con: discretization artifacts; smooth motion harder.

Continuous flow / diffusion

π0's approach: a flow-matching head outputs continuous actions. Smoother, fewer artifacts.

  • Pro: smoothest motion; multimodal handling.
  • Con: more inference compute; harder to integrate with frozen LLMs.

For a 200-demo fine-tune, joint-angle outputs work fine. For deployment across multiple robots, end-effector or flow-based.

The loss function

VLA fine-tuning is supervised learning. Standard MSE on actions:

loss = ((actions_pred - actions_gt) ** 2).mean()

For diffusion-based VLAs (π0, RDT), the loss is denoising MSE on noised actions. Same shape, different sampling.

Common variants:

  • Per-dimension weighting: weight wrist joints more than shoulder joints (or vice versa). Useful when one joint matters more for the task.
  • Action chunking loss: average over all 16 predicted actions in the chunk. Standard.
  • Auxiliary losses: add reconstruction or behavior-cloning auxiliary loss to stabilize.

Why VLAs are imitation learning, not RL

Each VLA training example is (observation, expert action). The loss penalizes deviating from the expert. This is supervised learning, not reinforcement learning — there's no reward, no exploration, no Bellman equation.

Practical implications:

  • VLAs can only do what the demos showed. They can't discover new strategies.
  • The compounding-error problem applies. Action chunking + diffusion / ACT mitigate but don't eliminate.
  • Fine-tunes don't need a reward function. The data is the spec.
  • Out-of-distribution → silent failure. The model produces actions confidently but wrongly.

The four-stage workflow

  1. Data audit: inspect demos before training. Reject obvious bad ones (shaky teleop, missed tasks). 200 good demos beat 500 mediocre.
  2. Tokenize / preprocess: cameras → image tensors, joint state → vectors, language → tokens. Cache to speed up training.
  3. LoRA fine-tune: 6–12 hours on a 24 GB GPU for typical recipes. Watch loss + held-out validation.
  4. Eval: in sim first (fast), then on hardware (slow). Track success rate, action smoothness, generalization to unseen scenes.

What can go wrong

  • Data leakage: training and val splits not separated by episode. Train succeeds; deploy fails.
  • Action normalization mismatch: train normalizes by training set stats, deploy uses different stats. Robot commands are systematically off.
  • Camera frame mismatch: demos collected with one camera mount, deploy with another. The visual features no longer match.
  • Frequency mismatch: demos at 30 Hz, deploy at 50 Hz. Action chunks are stale; motion is jerky.
  • Hallucination: model confidently outputs wrong actions for out-of-distribution scenes. Not detectable by the model itself.

The transfer hierarchy

Easiest to hardest fine-tune:

  • Same robot, same task, different conditions: 50–100 demos. Hours.
  • Same robot, different task: 100–300 demos. A day.
  • Different robot (same arm class), same task: 200–500 demos. A few days.
  • Different embodiment (e.g., humanoid): 1000+ demos. Weeks.
  • Multi-task generalist: hundreds of thousands of demos. Months. Mostly done by labs with VLA pretraining budgets.

Plan accordingly. A "let's just fine-tune for our humanoid" might be a 1000-demo project.

Which base model to start from

Model When
OpenVLA-7BMost common; biggest community; default
π0 (OpenPI)Smoother actions; bimanual; frontier results
SmolVLA-450MEdge deployment; modest GPU; smaller fine-tunes
RDT-1BDiffusion-based; bimanual; RDT pretraining mix

Exercise

Take a small LeRobot dataset (the public PushT one, or your own 50 demos). Fine-tune OpenVLA-7B with LoRA rank 16. Compare success rate after 5k, 20k, 50k gradient steps. The plateau usually appears around 30k. Then: fine-tune again with rank 32, see if the extra capacity helps. Most projects converge before LoRA capacity is the limit.

Next

Open X-Embodiment and the dataset landscape — what's available, how to use it, and how to contribute.

Comments

    Sign in to post a comment.