Published2026-04-26·~12 min

Open X-Embodiment and the dataset landscape

What datasets exist, what they contain, and how to actually use them. The base data behind every modern VLA, the benchmarks worth trusting, and how to contribute your own demos.

by RobotForge

#learning#datasets#openx#benchmarks

Modern robot learning is dataset-bound. A great VLA architecture trained on 100 demos is mediocre; an okay architecture trained on 1M is state-of-the-art. The data won. Here's the landscape — what's available, how to use it, and how to contribute back.

The hierarchy of robot data

Tier	Size	Examples
Foundation pretraining	~1M+ trajectories	Open X-Embodiment, RT-X
Task-specific	100–1000 trajectories	Bridge, RoboNet
Project-specific	10–500 trajectories	Your fine-tune dataset
Benchmarks	Curated	LIBERO, MetaWorld, ManiSkill

Open X-Embodiment

The 2023 collaboration that changed the field. 21 institutions pooled their robot data: ~1M trajectories across 22 robot embodiments. Released as a single mixed dataset.

OpenVLA, RT-X, and most modern VLAs pretrain on it. Without it, the modern VLA recipe wouldn't be reproducible at academic scale.

Key properties:

Heterogeneous: many robots, many tasks. Forces the model to generalize.
Action spaces vary: needs a unifying representation. The standard: 7-DOF end-effector deltas.
Camera setups vary: typically a scene camera; some have wrist cameras.
Language labels: each episode has a natural-language instruction.

To use: download from robotics-transformer-x.github.io. Available on Hugging Face for streaming.

BridgeData and friends

Smaller, more focused datasets:

BridgeData V2 (60k trajectories): Berkeley's WidowX arm doing many tabletop tasks. Used as a pretrain mix.
RoboNet (15k trajectories): older but still influential, multiple arms.
CALVIN: long-horizon language-conditioned manipulation.
ALOHA datasets: bimanual demonstrations from the original ALOHA hardware.
Mobile ALOHA dataset: bimanual + base; single-task.
DROID (~76k trajectories): Stanford+CMU, large-scale single-arm tabletop.

The LeRobotDataset format

The Hugging Face / LeRobot project standardized a dataset format that most modern projects adopt:

dataset = LeRobotDataset.create(
    repo_id='myorg/my-task',
    fps=30,
    robot_type='so100',
    features={
        'observation.images.scene': {'dtype': 'video', 'shape': (480, 640, 3)},
        'observation.images.wrist': {'dtype': 'video', 'shape': (240, 320, 3)},
        'observation.state':         {'dtype': 'float32', 'shape': (6,)},
        'action':                    {'dtype': 'float32', 'shape': (6,)},
    },
)
for episode in collected_episodes:
    for frame in episode:
        dataset.add_frame(frame)
    dataset.save_episode(task='pick up the orange cup')
dataset.push_to_hub('myorg/my-task')

Cameras are encoded as MP4 (efficient streaming); state and actions as Parquet (fast random access). The format integrates with most modern training pipelines.

Benchmarks worth trusting

LIBERO: 130 simulated tasks across 4 categories. The default for VLA benchmarking.
MetaWorld: 50 tabletop tasks. Older, mature; used for sample-efficient RL.
ManiSkill 2/3: GPU-parallelized (Isaac Sim under the hood); more diverse than MetaWorld.
Franka Kitchen: long-horizon kitchen tasks, requires planning.
SAPIEN: 100+ articulated objects, photorealistic. Good for object-centric tasks.

Trust these for paper claims. Be skeptical of "we got 99% on our internal benchmark" — internal benchmarks favor the architecture they were designed alongside.

Real-robot benchmarks

Hardware benchmarks are harder to standardize but more meaningful:

RAVE / SimplerEnv: bridges sim and real for evaluation; same task in both.
Open X-Embodiment evaluations: original tasks across the 22 included robots.
ALOHA tasks: zip-tying, threading, folding — used for bimanual papers.

For your project, evaluate on the actual deployment hardware with realistic conditions. Sim numbers are the floor; real-robot numbers are the ceiling.

How to contribute

Public datasets need contributors. Standard recipe:

Collect demos in the LeRobotDataset format. Diverse conditions, clean labels.
Document your robot setup: hardware, calibration, control rate.
Push to Hugging Face. Tag the repo with relevant labels.
Optionally: integrate into Open X-Embodiment by following the contributor guide on GitHub.

Datasets are durable contributions. Three years from now, the OpenVLA you fine-tune is forgotten; the 500-trajectory dataset you contributed is still in every pretrain mix.

The data quality question

More demos with better quality >> many more demos with worse quality. Heuristics for "good demo data":

Clean teleop: no shakes, dropped frames, or operator confusion.
Diverse: vary lighting, object positions, distractors.
Successful: the demo achieves the stated task.
Reasonably labeled: language matches what the demo actually shows.
Synchronized: cameras and joint states aligned to within a few ms.

Audit before training. Reject the bottom 10% of demos. The improvement in policy quality is worth more than any architecture tweak.

The trends

Quality over quantity: 2024+ research shows that 100 carefully curated demos outperform 1000 average demos for fine-tuning.
Synthetic augmentation: image-augmented demos (lighting, color, background) trained on real data.
Cross-embodiment data: train on multiple robots; transfers better.
Scaling continues: the "more data" pillar of pretraining is still going up.

Exercise

Pick three public datasets relevant to your project. Download them. Inspect 10 random episodes from each. You'll see drastically different data quality — some clean, some noisy. The 30 minutes you spend looking at data is the most underrated investment in any robotics project.

That's the Learning track done

You've covered the modern-AI side of robotics: VLAs, imitation learning, diffusion, ACT, RL primer, PPO, SAC, real-world RL, teleop rigs, fine-tuning, and the dataset landscape. With this and the Foundations / Kinematics / Control / Manipulation tracks, you have the spine of 2026 robotics in front of you. The remaining tracks (Perception, SLAM, Planning, Mobile/Legged, Simulators, Embedded, Frontiers) build on this foundation.

← Previous

Fine-tuning a VLA on your own data