Published2026-04-26·~14 min

Autonomous driving stack anatomy

The perception → prediction → planning → control pipeline, mapped one level at a time. The architecture behind every L4 vehicle, with honest notes on what's still hard in 2026.

by RobotForge

#mobile-robots#autonomous-driving#self-driving

An autonomous vehicle is the largest, fastest, most consequential mobile robot most people will ever encounter. The internal architecture has converged on a clean four-stage pipeline: perception → prediction → planning → control. Every L4 system (Waymo, Cruise, Wayve, Mobileye, Pony) implements variations of it. Here's the working knowledge for reading any AV paper.

The pipeline

Perception: turn raw sensors (camera, lidar, radar) into a model of what's around. Other cars, pedestrians, lane markings, signs.
Prediction: forecast where dynamic objects will be in 1–8 seconds. Probabilistic; usually multi-modal.
Planning: decide a trajectory for the ego vehicle. Considers other actors' predicted motion.
Control: execute the trajectory. Steering, throttle, brake.

Each stage feeds the next; failures at any stage cascade.

1. Perception

The most heavily-developed layer. State of the art in 2026 is bird's-eye-view (BEV) fused multi-modal perception:

Cameras (6–12, surrounding the vehicle): high-resolution, low cost, color/texture rich.
LiDAR (1–4 units): metric depth, weather-resistant, expensive.
Radar (4–8 units): long range, weather-immune, low resolution.
HD maps: pre-mapped lane geometry, speed limits, traffic-light positions. Treated as a sensor.

Modern AVs fuse all of these into a unified BEV representation. The 2022–24 wave of papers (BEVFusion, BEVFormer, M^2BEV) made this the production standard. Outputs: 3D bounding boxes for objects, lane semantic maps, free-space layers.

2. Prediction

Given the perception output, predict where each agent will be in the next 1–8 seconds. Hard because:

Multi-modal: a car at an intersection might go straight, left, or right.
Interactive: agents react to each other; their predictions are coupled.
Long horizon: 8-second predictions are highly uncertain.

Modern approaches: graph neural networks over agents (VectorNet, Wayformer). Output: per-agent trajectory distributions (typically 6 modes with probabilities).

3. Planning

Find a trajectory for the ego vehicle that:

Reaches the goal.
Stays within lane boundaries.
Doesn't collide with predicted other-agent trajectories.
Respects traffic rules.
Is comfortable (limited acceleration, smooth steering).
Is feasible (within vehicle dynamics).

Dominant approach: a hierarchical planner.

Behavior planner: discrete decisions (lane change, yield, merge). Often a state machine or learned policy.
Trajectory planner: continuous path through the next 5–10 seconds. Frenet-frame trajectory optimization or model-predictive control.

Increasingly, neural networks are eating both layers. End-to-end approaches like Wayve's MILE and Tesla's FSD generate trajectories directly from sensor input. Quality is approaching but not yet matching the modular stack on edge cases.

4. Control

Track the planned trajectory. The simplest case (urban driving, < 50 km/h):

Lateral controller: pure pursuit or Stanley controller for steering.
Longitudinal controller: PID on velocity error for throttle/brake.

For aggressive driving (highways, racing): MPC with vehicle dynamics model. Considers slip, tire forces, suspension.

Control is the simplest layer; perception and prediction dominate development effort.

What's still hard in 2026

Edge cases (long tail): every "rare scenario" the system encounters can fail. Construction zones, emergency vehicles, irregular pedestrians. Production fleets log millions of miles to enumerate them.
Weather: heavy rain, snow degrades all sensors. LiDAR sees rain droplets; camera lenses occlude.
Behavior of irregular agents: cyclists, motorcycles, scooters, kids. Less data; harder to predict.
Negotiation: 4-way stops, merging, gaps in dense traffic. Game-theoretic; requires understanding intent.
Map maintenance: HD maps go stale (construction, road changes). Self-updating maps are an open problem.

The end-to-end vs modular debate

In 2024–26, two camps:

Modular pipeline: Waymo, Cruise (paused), Mobileye. Hand-designed perception + prediction + planning + control with deep models inside each module. Auditable; verifiable; slower to improve.
End-to-end: Tesla FSD, Wayve, comma.ai. A single neural network from sensors to trajectory (or beyond, to direct steering/throttle). Hard to audit; learns from data; rapid improvement.

The frontier is hybrid: modular structure with learned components inside each module, plus end-to-end refinement on top. By 2026, the bias is shifting toward more learning, less hand-coding.

The simulation layer

Every AV stack is tested predominantly in simulation:

CARLA: open-source urban-driving sim. Used widely in research.
NVIDIA Isaac Sim with DriveSim assets: production-grade.
Replay logs: re-play recorded sensor data through the perception stack to test changes.
Hand-authored scenarios: rare edge cases, written by safety engineers.
Adversarial / generative scenarios: AI-generated edge cases that stress-test the system.

For every mile driven on public roads, ~10,000 miles are driven in simulation. Production fleets test new code via simulation before any real-world deploy.

The safety case

L4 deployment requires a documented safety case. Standard structure (ISO 21448 — Safety Of The Intended Functionality):

Define operational design domain (ODD): where the AV operates.
Hazard analysis: what failure modes exist.
Mitigation: how each hazard is addressed.
Testing evidence: simulation hours, on-road miles, scenario coverage.
Continuous monitoring: production fleet data feeds back.

Most regulatory pushback in 2026 is on documenting these processes — not on the underlying technology.

The 2026 production reality

Waymo: ~1M paid rides/month, multiple US cities. Modular stack with deep components.
Tesla FSD: end-to-end, supervised "L2+", 5+ million users. Improving rapidly; not yet certified L4.
Wayve, Cruise (returning), Pony.ai, Baidu Apollo, Motional: various deployment stages.
Trucks (Aurora, Kodiak, Embark legacy): highway long-haul; geographically simpler than urban.

The technology works. Deployment is bottlenecked by regulation, public trust, and edge-case completeness.

What to learn first if entering AV

BEV perception (read BEVFusion, BEVFormer papers).
Trajectory prediction (Wayformer, Trajectron++).
Frenet-frame trajectory planning.
MPC for vehicle control.
HD maps and localization.
Simulation: CARLA for prototyping.

Each is a several-week deep dive. Together they form the working knowledge of the modular AV stack.

Exercise

Run CARLA's autonomous-driving examples. Modify the perception module to use only camera (drop lidar). Watch performance degrade in heavy rain. Add radar; recover. The empirical multi-modal-fusion experience is the most honest education in why production AVs use all three sensor types.

Social navigation — the frontier where metric path planning meets human behavior modeling.

← Previous

Drone control: from PID to nonlinear MPC

Social navigation: robots around people