RobotForge
Published·~14 min

Teleoperation rigs: ALOHA, GELLO, phone-teleop

The hardware that generates VLA training data. Build recipes, cost tradeoffs, and open designs from $20 phone-teleop to $20k full-body MoCap.

by RobotForge
#frontiers#teleop#data-collection

Modern robot learning is data-bottlenecked. Every VLA, every imitation policy, every fine-tuned diffusion model needs hundreds to thousands of demonstrations. The cheapest, fastest, highest-quality way to collect them is teleoperation — a human moves a "puppet" rig and the real robot mirrors. The 2023+ wave of low-cost teleop designs is what made VLA fine-tuning a hobbyist activity. Here's the landscape.

Why not just program the robot?

Hand-coded behaviors don't scale to long-tail manipulation. A human teleoperator demonstrating "fold this towel" produces a policy in 30 minutes that's better than 3 weeks of scripted programming. Plus the resulting policy generalizes across towels because the human's variation is encoded.

The teleop rig is the bottleneck on demonstration throughput. Better rig = more data per hour = better fine-tunes.

The rig families

1. Phone-teleop ($20)

Use a phone's IMU + screen as a controller. Tilt the phone → command end-effector pose. Tap to grasp. Cheap, accessible, but coarse — typically 1 demo per minute.

Use for: bootstrapping; very simple tasks; user studies.

2. Joystick / gamepad ($30)

Two analog sticks for end-effector translation + rotation; buttons for gripper. Used in early ROS demos. Slightly faster than phone; still coarse.

3. GELLO ($150–300)

A scaled-down puppet of the robot's arm with joint encoders. The user moves the puppet; the real arm mirrors joint-by-joint. Open-source design from Stanford (2023).

Strengths: very natural for users; high demonstration speed (5+ demos/minute); cheap.

Weaknesses: needs custom mechanical build; requires a matching puppet for each robot.

Use for: most arm fine-tuning. Production default for VLA data collection in 2024–26.

4. ALOHA ($25k commercial; $5k DIY)

Bimanual GELLO-style. Two puppets, two production arms. From Stanford's 2023 ALOHA paper.

Strengths: enables bimanual tasks (folding, threading, cooking); high quality data.

Weaknesses: expensive; large physical footprint; setup time.

Use for: serious bimanual manipulation research / VLA fine-tuning.

5. VR controllers (Quest 3, Vive: $300–600)

Hand-tracking controllers; map controller pose → end-effector pose. Gripper via trigger.

Strengths: 6-DOF hand tracking out of the box; works in any room with the headset.

Weaknesses: less precise than puppet rigs; users get tired holding hands up.

6. Apple Vision Pro ($3500)

The 2024 entrant. Native hand-tracking + room-scale awareness. Direct integration in Stanford / OpenAI work.

Strengths: free hands (no controllers); seamless 6-DOF; high resolution; full body via add-on tracking.

Weaknesses: cost; limited fine-grained tactile feedback.

7. Full-body MoCap ($10k–50k)

Vicon, OptiTrack, or wearable Xsens IMU suits. Track every body joint. Used for humanoid teleop.

Strengths: highest fidelity; full body.

Weaknesses: expensive; requires room setup; per-session calibration.

8. Custom haptic devices ($5k–25k)

Force-feedback rigs (3D Systems Touch, Phantom Premium) provide haptic info to the user. Used in surgery, hazardous teleop. Niche in robotics RL.

The choice in 2026

For a hobbyist or small team:

  • Single-arm fine-tuning → GELLO clone for that arm. ~$200 in parts; weekend build.
  • Bimanual fine-tuning → ALOHA DIY (~$5k) or two GELLOs side-by-side.
  • Mobile manipulator → GELLO + base teleop joystick.
  • Humanoid → Apple Vision Pro + hand-tracking; open-source HumanPlus pipeline.

For a serious lab:

  • Production ALOHA setups, 4–8 cells running in parallel.
  • Vicon room for high-fidelity humanoid teleop.
  • Custom haptic for delicate tasks (assembly, cooking).

Demonstration throughput

Empirical numbers from published work:

Rig Demos/hour
Phone teleop~30
VR controllers~80
GELLO~150
ALOHA~120 bimanual
Vision Pro~120

For a 200-demo dataset: ~1.5 hours with GELLO; ~7 hours with phone. The rig pays for itself in operator time.

What makes good demonstration data

  • Variety in starting conditions: object positions, lighting, distractors.
  • Recovery demonstrations: deliberately bobble; recover. Teaches the policy how to handle errors.
  • Consistent task definition: same goal every time.
  • Clean labels: text annotation matches what's actually happening.
  • Synced multi-modal: camera + proprioception + actions all timestamped accurately.

Bad data trains worse policies than no data. The discipline of careful collection is underrated.

The data-collection pipeline

  1. Define the task with examples.
  2. Set up the environment with controlled variation.
  3. Operator practices for 10–20 demos before recording.
  4. Record demos with cameras + proprioception + actions at 30 Hz.
  5. Annotate language descriptions per episode.
  6. Store as LeRobot dataset.
  7. Push to Hugging Face for sharing / version control.

Latency requirements

For natural teleop, end-to-end latency (puppet motion → real arm response) should be < 50 ms. Common bottlenecks:

  • Network (ROS topic over WiFi): 5–50 ms.
  • Joint position controller: 1–10 ms.
  • Camera stream: 30–100 ms (display latency).

For most arm teleop, wired Ethernet or local-only ROS works. WiFi-based teleop is feasible but feels sluggish; budget more bandwidth.

The build path

For a single-arm GELLO clone for an SO-100 arm (~$500 hardware total):

  1. Print the puppet's links (free STLs from Stanford's repo).
  2. Buy 6 small servos + 6 absolute encoders (~$200).
  3. Wire to an ESP32; flash GELLO firmware.
  4. Pair with a controlling computer running LeRobot.
  5. Calibrate joint zero positions.
  6. Start collecting demos.

Three weekends for a working teleop setup. From there, every fine-tune you want to do is data-collection-bottlenecked, not rig-bottlenecked.

The OpenX angle

Open-X-Embodiment (Google + 35 universities, 2023+) standardizes data formats so demonstrations from one robot can train policies for many. Your GELLO-collected data on an SO-100 contributes to a global dataset trained on hundreds of robot platforms.

Submit to Hugging Face's LeRobotDataset format; the data joins the OpenX pool. Cross-embodiment training is one of the field's biggest active research bets.

Exercise

Build a phone-teleop interface for a single-arm robot: phone tilt → arm joint commands. Collect 50 demos of a single task. Fine-tune OpenVLA on the data. Compare success rate vs hand-coded policy. Even with the worst rig, fine-tuning beats scripted code on most realistic tasks. Then upgrade to GELLO and watch the success rate jump 20+ points.

Next

Tactile sensing — the modality that complements teleop visual feedback for delicate manipulation.

Comments

    Sign in to post a comment.