Teleoperation rigs: ALOHA, GELLO, phone-teleop
The hardware that generates VLA training data. Build recipes, cost tradeoffs, and open designs from $20 phone-teleop to $20k full-body MoCap.
Modern robot learning is data-bottlenecked. Every VLA, every imitation policy, every fine-tuned diffusion model needs hundreds to thousands of demonstrations. The cheapest, fastest, highest-quality way to collect them is teleoperation — a human moves a "puppet" rig and the real robot mirrors. The 2023+ wave of low-cost teleop designs is what made VLA fine-tuning a hobbyist activity. Here's the landscape.
Why not just program the robot?
Hand-coded behaviors don't scale to long-tail manipulation. A human teleoperator demonstrating "fold this towel" produces a policy in 30 minutes that's better than 3 weeks of scripted programming. Plus the resulting policy generalizes across towels because the human's variation is encoded.
The teleop rig is the bottleneck on demonstration throughput. Better rig = more data per hour = better fine-tunes.
The rig families
1. Phone-teleop ($20)
Use a phone's IMU + screen as a controller. Tilt the phone → command end-effector pose. Tap to grasp. Cheap, accessible, but coarse — typically 1 demo per minute.
Use for: bootstrapping; very simple tasks; user studies.
2. Joystick / gamepad ($30)
Two analog sticks for end-effector translation + rotation; buttons for gripper. Used in early ROS demos. Slightly faster than phone; still coarse.
3. GELLO ($150–300)
A scaled-down puppet of the robot's arm with joint encoders. The user moves the puppet; the real arm mirrors joint-by-joint. Open-source design from Stanford (2023).
Strengths: very natural for users; high demonstration speed (5+ demos/minute); cheap.
Weaknesses: needs custom mechanical build; requires a matching puppet for each robot.
Use for: most arm fine-tuning. Production default for VLA data collection in 2024–26.
4. ALOHA ($25k commercial; $5k DIY)
Bimanual GELLO-style. Two puppets, two production arms. From Stanford's 2023 ALOHA paper.
Strengths: enables bimanual tasks (folding, threading, cooking); high quality data.
Weaknesses: expensive; large physical footprint; setup time.
Use for: serious bimanual manipulation research / VLA fine-tuning.
5. VR controllers (Quest 3, Vive: $300–600)
Hand-tracking controllers; map controller pose → end-effector pose. Gripper via trigger.
Strengths: 6-DOF hand tracking out of the box; works in any room with the headset.
Weaknesses: less precise than puppet rigs; users get tired holding hands up.
6. Apple Vision Pro ($3500)
The 2024 entrant. Native hand-tracking + room-scale awareness. Direct integration in Stanford / OpenAI work.
Strengths: free hands (no controllers); seamless 6-DOF; high resolution; full body via add-on tracking.
Weaknesses: cost; limited fine-grained tactile feedback.
7. Full-body MoCap ($10k–50k)
Vicon, OptiTrack, or wearable Xsens IMU suits. Track every body joint. Used for humanoid teleop.
Strengths: highest fidelity; full body.
Weaknesses: expensive; requires room setup; per-session calibration.
8. Custom haptic devices ($5k–25k)
Force-feedback rigs (3D Systems Touch, Phantom Premium) provide haptic info to the user. Used in surgery, hazardous teleop. Niche in robotics RL.
The choice in 2026
For a hobbyist or small team:
- Single-arm fine-tuning → GELLO clone for that arm. ~$200 in parts; weekend build.
- Bimanual fine-tuning → ALOHA DIY (~$5k) or two GELLOs side-by-side.
- Mobile manipulator → GELLO + base teleop joystick.
- Humanoid → Apple Vision Pro + hand-tracking; open-source HumanPlus pipeline.
For a serious lab:
- Production ALOHA setups, 4–8 cells running in parallel.
- Vicon room for high-fidelity humanoid teleop.
- Custom haptic for delicate tasks (assembly, cooking).
Demonstration throughput
Empirical numbers from published work:
| Rig | Demos/hour |
|---|---|
| Phone teleop | ~30 |
| VR controllers | ~80 |
| GELLO | ~150 |
| ALOHA | ~120 bimanual |
| Vision Pro | ~120 |
For a 200-demo dataset: ~1.5 hours with GELLO; ~7 hours with phone. The rig pays for itself in operator time.
What makes good demonstration data
- Variety in starting conditions: object positions, lighting, distractors.
- Recovery demonstrations: deliberately bobble; recover. Teaches the policy how to handle errors.
- Consistent task definition: same goal every time.
- Clean labels: text annotation matches what's actually happening.
- Synced multi-modal: camera + proprioception + actions all timestamped accurately.
Bad data trains worse policies than no data. The discipline of careful collection is underrated.
The data-collection pipeline
- Define the task with examples.
- Set up the environment with controlled variation.
- Operator practices for 10–20 demos before recording.
- Record demos with cameras + proprioception + actions at 30 Hz.
- Annotate language descriptions per episode.
- Store as LeRobot dataset.
- Push to Hugging Face for sharing / version control.
Latency requirements
For natural teleop, end-to-end latency (puppet motion → real arm response) should be < 50 ms. Common bottlenecks:
- Network (ROS topic over WiFi): 5–50 ms.
- Joint position controller: 1–10 ms.
- Camera stream: 30–100 ms (display latency).
For most arm teleop, wired Ethernet or local-only ROS works. WiFi-based teleop is feasible but feels sluggish; budget more bandwidth.
The build path
For a single-arm GELLO clone for an SO-100 arm (~$500 hardware total):
- Print the puppet's links (free STLs from Stanford's repo).
- Buy 6 small servos + 6 absolute encoders (~$200).
- Wire to an ESP32; flash GELLO firmware.
- Pair with a controlling computer running LeRobot.
- Calibrate joint zero positions.
- Start collecting demos.
Three weekends for a working teleop setup. From there, every fine-tune you want to do is data-collection-bottlenecked, not rig-bottlenecked.
The OpenX angle
Open-X-Embodiment (Google + 35 universities, 2023+) standardizes data formats so demonstrations from one robot can train policies for many. Your GELLO-collected data on an SO-100 contributes to a global dataset trained on hundreds of robot platforms.
Submit to Hugging Face's LeRobotDataset format; the data joins the OpenX pool. Cross-embodiment training is one of the field's biggest active research bets.
Exercise
Build a phone-teleop interface for a single-arm robot: phone tilt → arm joint commands. Collect 50 demos of a single task. Fine-tune OpenVLA on the data. Compare success rate vs hand-coded policy. Even with the worst rig, fine-tuning beats scripted code on most realistic tasks. Then upgrade to GELLO and watch the success rate jump 20+ points.
Next
Tactile sensing — the modality that complements teleop visual feedback for delicate manipulation.
Comments
Sign in to post a comment.