Embodied reasoning with LLM agents
Long-horizon task decomposition, tool use, error recovery — the layer above motion, powered by an LLM. SayCan, Code-as-Policies, VoxPoser, and the modern stacks combining reasoning with classical control.
VLAs handle "execute this motion." LLM agents handle "make breakfast." The difference is task length, abstraction level, and reasoning. By 2026 the architecture for embodied LLM agents has settled into a clear pattern: use a multimodal LLM as the slow planner, classical / VLA controllers as the fast executor. Here's the modern stack and the open questions.
Why combine LLMs with robots
VLAs are great at sub-second motor coordination. They're not great at:
- Multi-step plans: "to make breakfast: get a bowl, pour cereal, pour milk."
- Error recovery from new failures: "the milk is empty; check the fridge."
- Tool use: "use the spatula to flip the egg."
- Open-ended language: "make me something hot, ideally savory."
LLMs are great at all of these — but they don't know how to move a robot. Combine them: LLM at the top, controller at the bottom.
The modern architecture
User language goal
↓
LLM Planner ← world state, available skills
↓
Plan: skill_1, skill_2, skill_3, ...
↓
Skill executor (VLA, motion planner, hard-coded)
↓
Robot motion
↑ feedback
Perception
↑
World state ← [back to LLM if recovery needed]
Three layers:
- LLM planner: reasoning, plan generation, error recovery.
- Skill library: pretrained low-level behaviors (grasp, place, open, close, reach).
- World model: perception output the LLM can read.
The seminal papers
SayCan (Google 2022)
LLM proposes high-level actions; a "value function" scores how feasible each is on the actual robot. Pick the highest-feasibility action.
Pioneered the LLM-as-planner pattern. Successor work refined it heavily but the core architecture stuck.
Code-as-Policies (Google 2022)
LLM generates Python code that calls a robot API. The code is the plan; execution is just running it.
# LLM output:
for object in detect_objects("cup"):
pose = get_pose(object)
move_to(pose)
grasp()
move_to(table_corner)
release()
Strengths: leverages LLM's coding ability; loops + conditionals naturally; debuggable.
Weaknesses: LLM can hallucinate API calls; needs careful API design.
VoxPoser (Stanford 2023)
LLM generates 3D constraint maps from language. "Move along this trajectory while staying away from this object." Motion planner takes the constraint map and produces a trajectory.
Strengths: bridges natural language to geometric constraints elegantly.
Inner Monologue (Google 2022)
LLM "thinks out loud" between actions. Each step's output describes what was attempted, what happened, what's next. Produces resilient long-horizon execution.
RT-2 / VLA family
The opposite direction: a VLA does both reasoning AND action in one model. Less modular but trainable end-to-end.
The 2026 production pattern
Most deployed embodied LLM systems use:
- Vision-language model (Claude, Gemini Robotics, GPT-4V) for high-level reasoning. Slow (1–10 s) but flexible.
- Skill library: 20–100 pretrained skills (grasp object, navigate to room, open drawer). Fast, tested.
- Glue code: Python that translates LLM output to skill calls.
- Verification: post-execution checks; LLM re-plans if expected outcome not achieved.
Think Figure's Helix, Boston Dynamics' Spot AI integrations, Google's Gemini Robotics deployments. The pattern is consistent.
Skill library design
The skill API matters more than the LLM itself.
Good skills:
- Composable (navigate then grasp).
- Verifiable (return success / failure cleanly).
- Atomic (a skill does one thing well).
- Parameterized (grasp(object_id, grasp_type)).
Bad skill APIs lead to LLM confusion: hallucinated parameters, missed preconditions.
The world model
LLMs need to read the world. Standard representations:
- Object list: "I see [cup, plate, fork] on the table."
- Spatial relations: "cup is on the plate; plate is on the table."
- Visual frames: send raw images to a multimodal LLM.
- Captioned scene: a small VLM describes the scene; LLM consumes the description.
Hybrid approaches (object list + key images) give the LLM enough structure to act on without overwhelming context.
Where it's still hard
- Latency: 5-second LLM call between every action makes long sequences unusably slow.
- Hallucinated API calls: LLM invents a function that doesn't exist; system crashes.
- Lost track of state: LLM forgets what's already happened in long episodes.
- Negotiating physical reality: LLM proposes "open the drawer"; drawer is locked. Needs to detect failure and adapt.
- Spatial reasoning: "the object behind the cup" — current LLMs guess wrong often.
The mitigations
- Plan once, execute many: get the full plan upfront; don't call LLM per action.
- Strict APIs: schema-validate LLM outputs; reject + retry on hallucinations.
- Persistent scene memory: track what's been done; provide as context.
- Local recovery skills: "if grasp fails, retry up to 3 times" without re-prompting LLM.
- Hybrid VLM perception: ground spatial language with vision-language alignment (CLIP, GroundingDINO).
The model choices in 2026
| Model | Use for |
|---|---|
| Claude / GPT-4o | High-level planning; strong reasoning |
| Gemini Robotics-ER | Embodied-tuned reasoning; spatial understanding |
| Llama-3.2-Vision | On-device; cost-sensitive deployments |
| π0.5 / Helix | VLA with reasoning; end-to-end alternative |
The auth and tool-use angle
Many embodied LLM stacks now use the same tool-calling APIs as software agents (function-call schema, JSON-RPC, MCP). Treat the robot as another tool the LLM has access to.
This is the architecture behind RobotForge's own /cad MCP integration: an LLM can call create_box, boolean, take_screenshot the same way it calls get_weather or send_email. Same protocol; different domain.
Where this is heading
- Smaller / faster LLMs: 1–10B parameter VLMs on Jetson. Latency drops to 200 ms; closed-loop becomes feasible.
- Action-tuned LLMs: models specifically trained for robot tool use. Less hallucination.
- Multi-agent robots: multiple LLM agents (planner, perception, execution monitor) coordinating.
- Self-improving from logs: failure logs become training data; LLM gets better over time.
Exercise
Build a small embodied LLM demo: a 6-DOF arm + RGB camera. Define a skill API (grasp, place, navigate) using OpenVLA or hand-coded skills. Use Claude or GPT-4o to plan from a high-level prompt: "tidy the desk." Watch the LLM decompose, execute, recover. The first time the system handles a multi-step task autonomously, the architecture's value clicks.
Next
Safety and certification — the standards that separate a lab demo from a deployed product.
Comments
Sign in to post a comment.