Published2026-04-26·~11 min

Writing your own sim environment (Gymnasium API)

The interface every RL framework expects. Wrap any simulator behind reset/step/observation_space, and the entire RL ecosystem can train on it. Five-method protocol; thousands of trainers.

by RobotForge

#simulators#gym#rl

RL training frameworks (Stable-Baselines3, CleanRL, RSL-RL, RLlib) all expect environments to look the same. The lingua franca is Gymnasium (formerly OpenAI Gym before the maintainership change in 2022). Five methods, three properties, you're done. Your sim becomes train-able by every algorithm in the ecosystem.

The protocol

class MyEnv(gymnasium.Env):
    observation_space: spaces.Space
    action_space: spaces.Space

    def reset(self, *, seed=None, options=None):
        # ... reset state ...
        return obs, info

    def step(self, action):
        # ... apply action, advance physics ...
        return obs, reward, terminated, truncated, info

    def render(self):
        # ... return RGB image or None ...
        ...

    def close(self):
        # ... cleanup ...
        ...

Five methods. Two properties (observation_space, action_space). That's the whole protocol. Anything implementing it can be trained by any modern RL library.

The two key properties

action_space

The set of actions the agent can take. Common types:

spaces.Discrete(n): choose one of n discrete actions.
spaces.Box(low, high, shape): continuous vector with element-wise bounds. Most robotics.
spaces.MultiDiscrete([n1, n2, ...]): multiple independent discrete actions.

For a 7-DOF arm: Box(low=-1, high=1, shape=(7,)) with policy outputs scaled to joint torques.

observation_space

The set of observations the agent receives. Same types. Often a Dict for multi-modal:

spaces.Dict({
    'joint_pos': spaces.Box(-pi, pi, (7,)),
    'joint_vel': spaces.Box(-10, 10, (7,)),
    'image': spaces.Box(0, 255, (3, 224, 224), dtype=np.uint8),
})

The reset and step methods

reset

Initialize the environment to a starting state. Returns the initial observation and an info dict. Optionally takes a seed for reproducibility.

step

Apply the action, advance physics, return:

obs: new observation.
reward: scalar reward for this step.
terminated: True if the episode ended naturally (success/failure).
truncated: True if cut short (time limit, but task could continue).
info: dict of diagnostics (whatever you want to log).

The terminated/truncated split (Gymnasium's update from Gym) helps with credit assignment in TD-learning algorithms.

The 50-line PyBullet wrapper

import gymnasium as gym
from gymnasium import spaces
import numpy as np
import pybullet as p
import pybullet_data

class MyArmEnv(gym.Env):
    def __init__(self):
        super().__init__()
        self.action_space = spaces.Box(-1.0, 1.0, (7,), dtype=np.float32)
        self.observation_space = spaces.Box(-np.inf, np.inf, (14,), dtype=np.float32)
        self.client = p.connect(p.DIRECT)
        p.setAdditionalSearchPath(pybullet_data.getDataPath())
        self.robot = None

    def reset(self, *, seed=None, options=None):
        super().reset(seed=seed)
        p.resetSimulation()
        p.setGravity(0, 0, -9.81)
        plane = p.loadURDF('plane.urdf')
        self.robot = p.loadURDF('kuka_iiwa/model.urdf')
        self.steps = 0
        return self._obs(), {}

    def step(self, action):
        for j in range(7):
            p.setJointMotorControl2(
                self.robot, j, p.TORQUE_CONTROL, force=action[j] * 50.0)
        p.stepSimulation()
        self.steps += 1
        obs = self._obs()
        reward = -np.linalg.norm(obs[:7])  # encourage going to zero
        terminated = bool(np.linalg.norm(obs[:7]) < 0.1)
        truncated = self.steps >= 200
        return obs, reward, terminated, truncated, {}

    def _obs(self):
        joint_states = [p.getJointState(self.robot, j)[:2] for j in range(7)]
        positions = [s[0] for s in joint_states]
        velocities = [s[1] for s in joint_states]
        return np.concatenate([positions, velocities]).astype(np.float32)

    def close(self):
        p.disconnect(self.client)

50 lines. Now you can train it with any RL library:

from stable_baselines3 import PPO
env = MyArmEnv()
model = PPO('MlpPolicy', env, verbose=1)
model.learn(total_timesteps=100_000)

That's the entire bridge from "raw simulator" to "any RL algorithm." Your sim is now ecosystem-compatible.

Vectorized environments

Single-env training is slow. Modern RL needs N parallel environments (typically 8–32 for CPU, hundreds for GPU). Standard wrappers:

SubprocVecEnv: each env in its own subprocess. Used for non-batched sims like PyBullet.
VectorEnv: native batched API in Gymnasium. Used for already-batched sims like MuJoCo MJX.

For modern PPO training: 32 PyBullet envs in subprocesses, or 1024 MJX envs in a single process.

Common gotchas

Determinism: same seed must produce same trajectory. Test by running reset(seed=42) twice and asserting equality.
Numpy dtypes: actions and observations should match the declared dtype (usually np.float32). Mismatched dtypes are a common silent bug.
Reward shaping vs sparse reward: dense rewards train faster but bake in assumptions. Sparse (success/fail) generalizes better. Pick by task.
Auto-reset on terminate: most RL libraries auto-reset; don't reset manually inside step.

The wrappers ecosystem

Gymnasium ships standard wrappers for common transforms:

FrameStack: stack last N observations (for partial-observability tasks).
NormalizeObservation: zero-mean, unit-variance.
RecordEpisodeStatistics: track episode rewards and lengths.
ClipAction: enforce action bounds.
TimeLimit: truncate after N steps.

Wrap your env with these as needed. Composition stays clean.

The 2026 standards

Use Gymnasium (not the original Gym; that's frozen since 2022).
Use the modern reset/step API (5-tuple from step, terminated+truncated).
For batched sims: implement gymnasium.vector.VectorEnv directly; don't pretend to be a single env.

Beyond Gymnasium

For massive parallel RL on GPU, frameworks like Isaac Lab and MuJoCo Playground use Gymnasium-inspired but custom APIs. They optimize for batched tensor operations rather than per-env Python overhead. Same protocol; different implementation. Translation between them is a one-day project.

Exercise

Wrap any simulator (PyBullet, MuJoCo, your own) as a Gymnasium environment. Train PPO from Stable-Baselines3 on it. The first time you see your custom-sim trained policy converge in a few minutes is the field's progress made tangible.

Co-simulation and hardware-in-the-loop — when the simulator and the real robot run together, blurring the boundary.

← Previous

URDF, MJCF, USD: robot description formats

Co-simulation and hardware-in-the-loop