Published2026-04-26·~13 min

Real-world RL: HIL-SERL and friends

Skip the simulator entirely. Train on hardware, with humans in the loop, demonstrations seeding the buffer, safety wrappers everywhere. The methods letting hobbyists train policies on actual robots in 2026.

by RobotForge

#learning#real-world-rl#hil-serl

For a decade, sim-to-real was the only way to train RL on robots. The reality gap was the limit. HIL-SERL (Human-in-the-Loop Sample-Efficient RL, Levine et al. 2024) and similar methods skip the sim entirely — train directly on hardware in an hour or two of real-world experience. Combined with demonstrations and safety wrappers, real-world RL is now production-viable for narrow tasks.

Why "skip the sim" is finally credible

Three things changed:

Sample efficiency: SAC and successors converge in 10–100× fewer samples than PPO. An hour of robot time = 100k+ transitions = enough.
Imitation bootstrapping: a few human demos seed the replay buffer with good experience. Pure exploration from scratch isn't needed.
Safety frameworks: hard constraints (joint limits, F/T thresholds) keep exploration safe. Reset wrappers handle "the robot is in a bad pose" automatically.

HIL-SERL: the recipe

Levine et al.'s HIL-SERL achieves near-100% success on contact-rich tasks (peg-in-hole, plug insertion, screw tightening) with ~1 hour of real-world training. The recipe:

Collect ~50–100 demos with a teleop interface. Save in the replay buffer.
Initialize SAC with a simple BC pretraining: 1k–10k gradient steps on the demos.
Run online RL: SAC explores from the BC-initialized policy. Demos remain in the buffer; new transitions also added.
Human supervisor flags failures (object dropped, arm hit obstacle). The supervisor "rescue" trajectories also go into the buffer with a corrective reward signal.
After ~1 hour of training, the policy hits 100% success.

Production deployment: same policy. No sim-to-real bridge to engineer.

The safety stack

Real robots break easily. The safety stack makes online RL practical:

Joint-limit wrappers: clip actions before they go to the robot. Hard floor.
F/T wrappers: episode terminates with negative reward if force exceeds threshold. Prevents jamming.
Velocity limits: never command above safe speed.
Workspace boundaries: clip end-effector to a known-safe volume. Prevents the arm from reaching out of its work envelope.
Human watchdog: a person nearby with an emergency-stop button. Fast, simple, very effective.

Most of these are framework-agnostic — wrap your environment, run any RL algorithm.

The reset problem

Episodes need to start from a known state. In sim, reset is free. On hardware, reset is expensive — you have to physically restore the scene. Three patterns:

Manual reset: human places the parts back. 30 seconds per episode; doesn't scale.
Scripted reset: the robot itself moves the parts back. Common for blocks and other rigid objects. The reset script is part of the system.
Reset-free RL: design tasks where any state is a valid start. Run continuous learning without explicit resets. Used in HIL-SERL's later work.

Reset-free is the goal but hard to achieve for arbitrary tasks. Most production setups use scripted resets.

The "asymmetric actor-critic" trick

At training time, the critic (Q-function) sees ground-truth state (e.g., precise object pose from sim or external tracking). The actor sees only the partial observations the deployed robot will have (e.g., wrist camera image). Train the critic with the rich state; the actor learns to predict good actions from limited observation.

Result: faster training (rich Q-function converges quickly), deployable policy (actor uses what's actually available).

Used in many recent real-world RL papers; combines well with SAC.

Demonstrations: where they fit

Demonstrations don't replace exploration — they bootstrap it. Three useful patterns:

Pretraining: behavior-clone the policy on demos before starting RL. Initialization beats random.
Mixed batches: every training batch is half demonstration, half online experience. Keeps the policy anchored.
Reward shaping: provide intermediate rewards based on similarity to demonstrated behavior. Eases sparse-reward tasks.

The lower the variance in demonstrations, the easier they are to use. Standardize the teleop setup.

The 2026 production stack

For a real-world RL project today:

Hardware: a torque-controlled arm (Franka, KUKA, UR with FT300), wrist camera, scene camera.
Teleop rig: GELLO clone or similar puppet system for demo collection.
Software: LeRobot for dataset management; Stable-Baselines3 or RSL-RL for the SAC implementation.
Safety: workspace constraints, F/T limits, scripted reset.
Workflow: 50 demos (1 hour) → BC pretrain (30 minutes) → online RL (~1 hour with supervisor present).

End-to-end: a single afternoon. The afternoon that previously was a multi-week sim-to-real project.

What this enables

Custom-task fine-tuning: clients can specialize a base policy to their specific parts in hours.
Continual learning: deployed robots improve over time from each new task.
Demonstration-light tasks: 50 demos vs the 200–500 typical for VLA fine-tuning.
Hardware that doesn't have a sim: novel grippers, soft robots, exotic mechanisms.

What it doesn't help with

Multi-task generalization: real-world RL specializes to one task at a time. For breadth, train a VLA.
Long-horizon plans: sample efficiency is good but not magic. Tasks needing many decisions still benefit from sim training.
Safety-critical applications: the safety wrappers help, but RL exploration is fundamentally probabilistic. Use classical control if a single failure is unacceptable.

Exercise

On a real arm (or a hardware-in-the-loop sim), set up HIL-SERL:

Build a teleop interface (GELLO or VR).
Collect 30 demos of a contact-rich task (peg insert, drawer open).
Pretrain a SAC actor + critic via behavior cloning.
Run online SAC for 1 hour with you supervising; flag failures.
Measure success rate before training (BC only) and after.

The before/after is usually 60% → 99%. That's what real-world RL delivers — and why it's becoming the production-grade fine-tuning approach.

Teleoperation rigs — the demonstration-collection hardware that all of this depends on. ALOHA, GELLO, phone-teleop, VR.

← Previous

Sim-to-real: domain randomization playbook

Collecting demonstrations: teleop rigs that work