Published2026-04-26·~16 min

Reinforcement learning primer for roboticists

MDPs, value functions, policy gradients — the RL minimum for a robotics audience. Skip the chess-playing examples; learn the math you'll use to train a quadruped.

by RobotForge

#learning#rl#fundamentals

RL textbooks teach you to play Atari. RL papers teach you to play chess. Robot RL is a different game: continuous actions, sample inefficiency, sim-to-real, hardware safety. Here's the minimum theory you need to read robot-RL papers without flipping back to Sutton & Barto every page.

The Markov Decision Process

The mathematical setup. At each timestep $t$ :

Robot observes state $s_{t}$ .
Picks action $a_{t}$ according to policy $π (a_{t} ∣ s_{t})$ .
Environment transitions: $s_{t + 1} \sim p (\cdot ∣ s_{t}, a_{t})$ .
Robot receives reward $r_{t} = R (s_{t}, a_{t}, s_{t + 1})$ .

The Markov assumption: the future depends only on the current state and action, not the full history. For partially-observable problems (robot can't see everything), use POMDPs — same math with hidden state.

Goal: find $π^{*}$ maximizing expected discounted return:

J (π) = E_{π} [t = 0 \sum \infty γ^{t} r_{t}]

$γ \in [0, 1)$ is the discount factor — encodes "prefer rewards now over rewards later." Typical: 0.99 for episodic robot tasks.

Value functions

The expected return from a state under policy $π$ :

V^{π} (s) = E_{π} [t = 0 \sum \infty γ^{t} r_{t} ∣ s_{0} = s]

The state-action value (Q-function):

Q^{π} (s, a) = E_{π} [t = 0 \sum \infty γ^{t} r_{t} ∣ s_{0} = s, a_{0} = a]

$Q^{π} (s, a)$ is "what's the expected return if I take action $a$ now and then follow $π$ forever after?"

The Bellman equation links them recursively:

V^{π} (s) = E_{π} [r + γ V^{π} (s^{'})]

Most RL algorithms are tractable approximations of this recursion.

The two policy families

1. Value-based: learn Q, derive π

Train a network $Q_{θ} (s, a)$ to satisfy the Bellman equation. At each state, pick the action that maximizes $Q_{θ}$ .

For continuous actions (robotics), naive $ar g max$ doesn't work — the action space is infinite. Workarounds:

DDPG: separate "actor" network outputs the $ar g max$ directly.
TD3, SAC: variants of DDPG with stability tricks.

These are off-policy methods: they can learn from data collected by any policy, including past or random ones. Sample-efficient, hard to stabilize.

2. Policy-based: learn π directly

Parametrize the policy $π_{θ} (a ∣ s)$ and adjust $θ$ to maximize expected return. The key tool: policy gradient theorem.

\nabla_{θ} J (π_{θ}) = E_{π} [\nabla_{θ} lo g π_{θ} (a ∣ s) \cdot Q^{π} (s, a)]

"Increase the log-probability of actions that lead to high return; decrease the log-probability of low-return actions." Pure intuition, derived rigorously.

Practical algorithms: REINFORCE, A2C/A3C, PPO. On-policy: each batch of training data must come from the current policy. Stable but sample-inefficient.

Sparse vs dense rewards

The reward function determines what your policy learns. Two extremes:

Dense reward: every step provides feedback. "Distance to goal," "speed forward," "energy efficiency." Easy to learn from; risk of reward hacking (policy exploits the reward, ignores the spec).
Sparse reward: 0 every step until task succeeds, then +1. "Did the peg enter the hole?" Hard to learn from; aligned with the actual goal.

Most robot RL uses dense rewards with careful hand-tuning. Modern alternatives:

Hindsight experience replay: relabel failed episodes with achieved-state goals; turns sparse into dense.
Curriculum learning: start with easy tasks, increase difficulty.
Imitation as reward: bootstrap with demonstrations, then RL on top.

Exploration

Without exploration, the policy never tries new things and gets stuck. Strategies:

ε-greedy: with probability ε, take a random action. Discrete actions only.
Gaussian noise: add noise to the policy output. Standard in robotics.
Entropy regularization: push the policy toward higher-entropy distributions. SAC's central trick.
Curiosity: reward the policy for visiting unfamiliar states. Useful when reward is very sparse.

What's different about robotics

Continuous everything: states, actions, rewards. Can't enumerate; must function-approximate.
Sample inefficiency: real robots collect ~1 hour of experience per hour. PPO needs 10⁸ steps. → train in sim.
Sim-to-real: the bridge that makes sim training useful. Covered in detail in the sim-to-real lesson.
Hardware safety: random actions can break expensive equipment. Constrained exploration matters.
Reset complexity: real robots can't always be teleported back to "start state." Episode resets are a real engineering problem.

The four-algorithm shortlist

Algorithm	Type	Use for
PPO	On-policy	Sim training, default
SAC	Off-policy	Sample-efficient sim, real-world RL
TD-MPC	Model-based	Long horizons, tasks needing planning
HIL-SERL	Real-world RL	Production-grade hardware training

Master PPO + SAC and you can read 80% of robotics RL papers.

What to read after this lesson

Sutton & Barto: the canonical RL textbook. Free online.
Lilian Weng's blog: deep but readable summaries of every modern algorithm.
OpenAI Spinning Up: clean implementations + walkthroughs.
Specific to robotics: Tedrake's Underactuated Robotics (chapters on RL).

Exercise

In MuJoCo Playground, train a PPO policy for the cartpole environment. Then change the reward function from "stay upright" to "stay upright AND move forward." Watch the trajectory shift. Tweak weights between the two reward terms; see how policy behavior changes. This is reward shaping; you'll do it for every project.

PPO in practice — the actual hyperparameters that determine whether your training run converges or thrashes.

← Previous

ACT and action chunking

PPO in practice: the hyperparameters that actually matter