Writing your own sim environment (Gymnasium API)
The interface every RL framework expects. Wrap any simulator behind reset/step/observation_space, and the entire RL ecosystem can train on it. Five-method protocol; thousands of trainers.
RL training frameworks (Stable-Baselines3, CleanRL, RSL-RL, RLlib) all expect environments to look the same. The lingua franca is Gymnasium (formerly OpenAI Gym before the maintainership change in 2022). Five methods, three properties, you're done. Your sim becomes train-able by every algorithm in the ecosystem.
The protocol
class MyEnv(gymnasium.Env):
observation_space: spaces.Space
action_space: spaces.Space
def reset(self, *, seed=None, options=None):
# ... reset state ...
return obs, info
def step(self, action):
# ... apply action, advance physics ...
return obs, reward, terminated, truncated, info
def render(self):
# ... return RGB image or None ...
...
def close(self):
# ... cleanup ...
...
Five methods. Two properties (observation_space, action_space). That's the whole protocol. Anything implementing it can be trained by any modern RL library.
The two key properties
action_space
The set of actions the agent can take. Common types:
spaces.Discrete(n): choose one of n discrete actions.spaces.Box(low, high, shape): continuous vector with element-wise bounds. Most robotics.spaces.MultiDiscrete([n1, n2, ...]): multiple independent discrete actions.
For a 7-DOF arm: Box(low=-1, high=1, shape=(7,)) with policy outputs scaled to joint torques.
observation_space
The set of observations the agent receives. Same types. Often a Dict for multi-modal:
spaces.Dict({
'joint_pos': spaces.Box(-pi, pi, (7,)),
'joint_vel': spaces.Box(-10, 10, (7,)),
'image': spaces.Box(0, 255, (3, 224, 224), dtype=np.uint8),
})
The reset and step methods
reset
Initialize the environment to a starting state. Returns the initial observation and an info dict. Optionally takes a seed for reproducibility.
step
Apply the action, advance physics, return:
obs: new observation.reward: scalar reward for this step.terminated: True if the episode ended naturally (success/failure).truncated: True if cut short (time limit, but task could continue).info: dict of diagnostics (whatever you want to log).
The terminated/truncated split (Gymnasium's update from Gym) helps with credit assignment in TD-learning algorithms.
The 50-line PyBullet wrapper
import gymnasium as gym
from gymnasium import spaces
import numpy as np
import pybullet as p
import pybullet_data
class MyArmEnv(gym.Env):
def __init__(self):
super().__init__()
self.action_space = spaces.Box(-1.0, 1.0, (7,), dtype=np.float32)
self.observation_space = spaces.Box(-np.inf, np.inf, (14,), dtype=np.float32)
self.client = p.connect(p.DIRECT)
p.setAdditionalSearchPath(pybullet_data.getDataPath())
self.robot = None
def reset(self, *, seed=None, options=None):
super().reset(seed=seed)
p.resetSimulation()
p.setGravity(0, 0, -9.81)
plane = p.loadURDF('plane.urdf')
self.robot = p.loadURDF('kuka_iiwa/model.urdf')
self.steps = 0
return self._obs(), {}
def step(self, action):
for j in range(7):
p.setJointMotorControl2(
self.robot, j, p.TORQUE_CONTROL, force=action[j] * 50.0)
p.stepSimulation()
self.steps += 1
obs = self._obs()
reward = -np.linalg.norm(obs[:7]) # encourage going to zero
terminated = bool(np.linalg.norm(obs[:7]) < 0.1)
truncated = self.steps >= 200
return obs, reward, terminated, truncated, {}
def _obs(self):
joint_states = [p.getJointState(self.robot, j)[:2] for j in range(7)]
positions = [s[0] for s in joint_states]
velocities = [s[1] for s in joint_states]
return np.concatenate([positions, velocities]).astype(np.float32)
def close(self):
p.disconnect(self.client)
50 lines. Now you can train it with any RL library:
from stable_baselines3 import PPO
env = MyArmEnv()
model = PPO('MlpPolicy', env, verbose=1)
model.learn(total_timesteps=100_000)
That's the entire bridge from "raw simulator" to "any RL algorithm." Your sim is now ecosystem-compatible.
Vectorized environments
Single-env training is slow. Modern RL needs N parallel environments (typically 8–32 for CPU, hundreds for GPU). Standard wrappers:
- SubprocVecEnv: each env in its own subprocess. Used for non-batched sims like PyBullet.
- VectorEnv: native batched API in Gymnasium. Used for already-batched sims like MuJoCo MJX.
For modern PPO training: 32 PyBullet envs in subprocesses, or 1024 MJX envs in a single process.
Common gotchas
- Determinism: same seed must produce same trajectory. Test by running reset(seed=42) twice and asserting equality.
- Numpy dtypes: actions and observations should match the declared dtype (usually
np.float32). Mismatched dtypes are a common silent bug. - Reward shaping vs sparse reward: dense rewards train faster but bake in assumptions. Sparse (success/fail) generalizes better. Pick by task.
- Auto-reset on terminate: most RL libraries auto-reset; don't reset manually inside step.
The wrappers ecosystem
Gymnasium ships standard wrappers for common transforms:
- FrameStack: stack last N observations (for partial-observability tasks).
- NormalizeObservation: zero-mean, unit-variance.
- RecordEpisodeStatistics: track episode rewards and lengths.
- ClipAction: enforce action bounds.
- TimeLimit: truncate after N steps.
Wrap your env with these as needed. Composition stays clean.
The 2026 standards
- Use Gymnasium (not the original Gym; that's frozen since 2022).
- Use the modern reset/step API (5-tuple from step, terminated+truncated).
- For batched sims: implement
gymnasium.vector.VectorEnvdirectly; don't pretend to be a single env.
Beyond Gymnasium
For massive parallel RL on GPU, frameworks like Isaac Lab and MuJoCo Playground use Gymnasium-inspired but custom APIs. They optimize for batched tensor operations rather than per-env Python overhead. Same protocol; different implementation. Translation between them is a one-day project.
Exercise
Wrap any simulator (PyBullet, MuJoCo, your own) as a Gymnasium environment. Train PPO from Stable-Baselines3 on it. The first time you see your custom-sim trained policy converge in a few minutes is the field's progress made tangible.
Next
Co-simulation and hardware-in-the-loop — when the simulator and the real robot run together, blurring the boundary.
Comments
Sign in to post a comment.