Published2026-04-28·~14 min

Teleoperation rigs: ALOHA, GELLO, phone-teleop

The hardware that generates VLA training data. Build recipes, cost tradeoffs, and open designs from $20 phone-teleop to $20k full-body MoCap.

by RobotForge

#frontiers#teleop#data-collection

Modern robot learning is data-bottlenecked. Every VLA, every imitation policy, every fine-tuned diffusion model needs hundreds to thousands of demonstrations. The cheapest, fastest, highest-quality way to collect them is teleoperation — a human moves a "puppet" rig and the real robot mirrors. The 2023+ wave of low-cost teleop designs is what made VLA fine-tuning a hobbyist activity. Here's the landscape.

Why not just program the robot?

Hand-coded behaviors don't scale to long-tail manipulation. A human teleoperator demonstrating "fold this towel" produces a policy in 30 minutes that's better than 3 weeks of scripted programming. Plus the resulting policy generalizes across towels because the human's variation is encoded.

The teleop rig is the bottleneck on demonstration throughput. Better rig = more data per hour = better fine-tunes.

The rig families

1. Phone-teleop ($20)

Use a phone's IMU + screen as a controller. Tilt the phone → command end-effector pose. Tap to grasp. Cheap, accessible, but coarse — typically 1 demo per minute.

Use for: bootstrapping; very simple tasks; user studies.

2. Joystick / gamepad ($30)

Two analog sticks for end-effector translation + rotation; buttons for gripper. Used in early ROS demos. Slightly faster than phone; still coarse.

3. GELLO ($150–300)

A scaled-down puppet of the robot's arm with joint encoders. The user moves the puppet; the real arm mirrors joint-by-joint. Open-source design from Stanford (2023).

Strengths: very natural for users; high demonstration speed (5+ demos/minute); cheap.

Weaknesses: needs custom mechanical build; requires a matching puppet for each robot.

Use for: most arm fine-tuning. Production default for VLA data collection in 2024–26.

4. ALOHA ($25k commercial; $5k DIY)

Bimanual GELLO-style. Two puppets, two production arms. From Stanford's 2023 ALOHA paper.

Strengths: enables bimanual tasks (folding, threading, cooking); high quality data.

Weaknesses: expensive; large physical footprint; setup time.

Use for: serious bimanual manipulation research / VLA fine-tuning.

5. VR controllers (Quest 3, Vive: $300–600)

Hand-tracking controllers; map controller pose → end-effector pose. Gripper via trigger.

Strengths: 6-DOF hand tracking out of the box; works in any room with the headset.

Weaknesses: less precise than puppet rigs; users get tired holding hands up.

6. Apple Vision Pro ($3500)

The 2024 entrant. Native hand-tracking + room-scale awareness. Direct integration in Stanford / OpenAI work.

Strengths: free hands (no controllers); seamless 6-DOF; high resolution; full body via add-on tracking.

Weaknesses: cost; limited fine-grained tactile feedback.

7. Full-body MoCap ($10k–50k)

Vicon, OptiTrack, or wearable Xsens IMU suits. Track every body joint. Used for humanoid teleop.

Strengths: highest fidelity; full body.

Weaknesses: expensive; requires room setup; per-session calibration.

8. Custom haptic devices ($5k–25k)

Force-feedback rigs (3D Systems Touch, Phantom Premium) provide haptic info to the user. Used in surgery, hazardous teleop. Niche in robotics RL.

The choice in 2026

For a hobbyist or small team:

Single-arm fine-tuning → GELLO clone for that arm. ~$200 in parts; weekend build.
Bimanual fine-tuning → ALOHA DIY (~$5k) or two GELLOs side-by-side.
Mobile manipulator → GELLO + base teleop joystick.
Humanoid → Apple Vision Pro + hand-tracking; open-source HumanPlus pipeline.

For a serious lab:

Production ALOHA setups, 4–8 cells running in parallel.
Vicon room for high-fidelity humanoid teleop.
Custom haptic for delicate tasks (assembly, cooking).

Demonstration throughput

Empirical numbers from published work:

Rig	Demos/hour
Phone teleop	~30
VR controllers	~80
GELLO	~150
ALOHA	~120 bimanual
Vision Pro	~120

For a 200-demo dataset: ~1.5 hours with GELLO; ~7 hours with phone. The rig pays for itself in operator time.

What makes good demonstration data

Variety in starting conditions: object positions, lighting, distractors.
Recovery demonstrations: deliberately bobble; recover. Teaches the policy how to handle errors.
Consistent task definition: same goal every time.
Clean labels: text annotation matches what's actually happening.
Synced multi-modal: camera + proprioception + actions all timestamped accurately.

Bad data trains worse policies than no data. The discipline of careful collection is underrated.

The data-collection pipeline

Define the task with examples.
Set up the environment with controlled variation.
Operator practices for 10–20 demos before recording.
Record demos with cameras + proprioception + actions at 30 Hz.
Annotate language descriptions per episode.
Store as LeRobot dataset.
Push to Hugging Face for sharing / version control.

Latency requirements

For natural teleop, end-to-end latency (puppet motion → real arm response) should be < 50 ms. Common bottlenecks:

Network (ROS topic over WiFi): 5–50 ms.
Joint position controller: 1–10 ms.
Camera stream: 30–100 ms (display latency).

For most arm teleop, wired Ethernet or local-only ROS works. WiFi-based teleop is feasible but feels sluggish; budget more bandwidth.

The build path

For a single-arm GELLO clone for an SO-100 arm (~$500 hardware total):

Print the puppet's links (free STLs from Stanford's repo).
Buy 6 small servos + 6 absolute encoders (~$200).
Wire to an ESP32; flash GELLO firmware.
Pair with a controlling computer running LeRobot.
Calibrate joint zero positions.
Start collecting demos.

Three weekends for a working teleop setup. From there, every fine-tune you want to do is data-collection-bottlenecked, not rig-bottlenecked.

The OpenX angle

Open-X-Embodiment (Google + 35 universities, 2023+) standardizes data formats so demonstrations from one robot can train policies for many. Your GELLO-collected data on an SO-100 contributes to a global dataset trained on hundreds of robot platforms.

Submit to Hugging Face's LeRobotDataset format; the data joins the OpenX pool. Cross-embodiment training is one of the field's biggest active research bets.

Exercise

Build a phone-teleop interface for a single-arm robot: phone tilt → arm joint commands. Collect 50 demos of a single task. Fine-tune OpenVLA on the data. Compare success rate vs hand-coded policy. Even with the worst rig, fine-tuning beats scripted code on most realistic tasks. Then upgrade to GELLO and watch the success rate jump 20+ points.

Tactile sensing — the modality that complements teleop visual feedback for delicate manipulation.

← Previous

Humanoid whole-body control and retargeting

Tactile sensing: GelSight, DIGIT, e-skins