RobotForge
Published·~14 min

Embodied reasoning with LLM agents

Long-horizon task decomposition, tool use, error recovery — the layer above motion, powered by an LLM. SayCan, Code-as-Policies, VoxPoser, and the modern stacks combining reasoning with classical control.

by RobotForge
#frontiers#llm#embodied-ai

VLAs handle "execute this motion." LLM agents handle "make breakfast." The difference is task length, abstraction level, and reasoning. By 2026 the architecture for embodied LLM agents has settled into a clear pattern: use a multimodal LLM as the slow planner, classical / VLA controllers as the fast executor. Here's the modern stack and the open questions.

Why combine LLMs with robots

VLAs are great at sub-second motor coordination. They're not great at:

  • Multi-step plans: "to make breakfast: get a bowl, pour cereal, pour milk."
  • Error recovery from new failures: "the milk is empty; check the fridge."
  • Tool use: "use the spatula to flip the egg."
  • Open-ended language: "make me something hot, ideally savory."

LLMs are great at all of these — but they don't know how to move a robot. Combine them: LLM at the top, controller at the bottom.

The modern architecture

User language goal
       ↓
    LLM Planner  ← world state, available skills
       ↓
   Plan: skill_1, skill_2, skill_3, ...
       ↓
   Skill executor (VLA, motion planner, hard-coded)
       ↓
   Robot motion
       ↑ feedback
   Perception
       ↑
   World state ← [back to LLM if recovery needed]

Three layers:

  1. LLM planner: reasoning, plan generation, error recovery.
  2. Skill library: pretrained low-level behaviors (grasp, place, open, close, reach).
  3. World model: perception output the LLM can read.

The seminal papers

SayCan (Google 2022)

LLM proposes high-level actions; a "value function" scores how feasible each is on the actual robot. Pick the highest-feasibility action.

Pioneered the LLM-as-planner pattern. Successor work refined it heavily but the core architecture stuck.

Code-as-Policies (Google 2022)

LLM generates Python code that calls a robot API. The code is the plan; execution is just running it.

# LLM output:
for object in detect_objects("cup"):
    pose = get_pose(object)
    move_to(pose)
    grasp()
    move_to(table_corner)
    release()

Strengths: leverages LLM's coding ability; loops + conditionals naturally; debuggable.

Weaknesses: LLM can hallucinate API calls; needs careful API design.

VoxPoser (Stanford 2023)

LLM generates 3D constraint maps from language. "Move along this trajectory while staying away from this object." Motion planner takes the constraint map and produces a trajectory.

Strengths: bridges natural language to geometric constraints elegantly.

Inner Monologue (Google 2022)

LLM "thinks out loud" between actions. Each step's output describes what was attempted, what happened, what's next. Produces resilient long-horizon execution.

RT-2 / VLA family

The opposite direction: a VLA does both reasoning AND action in one model. Less modular but trainable end-to-end.

The 2026 production pattern

Most deployed embodied LLM systems use:

  • Vision-language model (Claude, Gemini Robotics, GPT-4V) for high-level reasoning. Slow (1–10 s) but flexible.
  • Skill library: 20–100 pretrained skills (grasp object, navigate to room, open drawer). Fast, tested.
  • Glue code: Python that translates LLM output to skill calls.
  • Verification: post-execution checks; LLM re-plans if expected outcome not achieved.

Think Figure's Helix, Boston Dynamics' Spot AI integrations, Google's Gemini Robotics deployments. The pattern is consistent.

Skill library design

The skill API matters more than the LLM itself.

Good skills:

  • Composable (navigate then grasp).
  • Verifiable (return success / failure cleanly).
  • Atomic (a skill does one thing well).
  • Parameterized (grasp(object_id, grasp_type)).

Bad skill APIs lead to LLM confusion: hallucinated parameters, missed preconditions.

The world model

LLMs need to read the world. Standard representations:

  • Object list: "I see [cup, plate, fork] on the table."
  • Spatial relations: "cup is on the plate; plate is on the table."
  • Visual frames: send raw images to a multimodal LLM.
  • Captioned scene: a small VLM describes the scene; LLM consumes the description.

Hybrid approaches (object list + key images) give the LLM enough structure to act on without overwhelming context.

Where it's still hard

  • Latency: 5-second LLM call between every action makes long sequences unusably slow.
  • Hallucinated API calls: LLM invents a function that doesn't exist; system crashes.
  • Lost track of state: LLM forgets what's already happened in long episodes.
  • Negotiating physical reality: LLM proposes "open the drawer"; drawer is locked. Needs to detect failure and adapt.
  • Spatial reasoning: "the object behind the cup" — current LLMs guess wrong often.

The mitigations

  • Plan once, execute many: get the full plan upfront; don't call LLM per action.
  • Strict APIs: schema-validate LLM outputs; reject + retry on hallucinations.
  • Persistent scene memory: track what's been done; provide as context.
  • Local recovery skills: "if grasp fails, retry up to 3 times" without re-prompting LLM.
  • Hybrid VLM perception: ground spatial language with vision-language alignment (CLIP, GroundingDINO).

The model choices in 2026

Model Use for
Claude / GPT-4oHigh-level planning; strong reasoning
Gemini Robotics-EREmbodied-tuned reasoning; spatial understanding
Llama-3.2-VisionOn-device; cost-sensitive deployments
π0.5 / HelixVLA with reasoning; end-to-end alternative

The auth and tool-use angle

Many embodied LLM stacks now use the same tool-calling APIs as software agents (function-call schema, JSON-RPC, MCP). Treat the robot as another tool the LLM has access to.

This is the architecture behind RobotForge's own /cad MCP integration: an LLM can call create_box, boolean, take_screenshot the same way it calls get_weather or send_email. Same protocol; different domain.

Where this is heading

  • Smaller / faster LLMs: 1–10B parameter VLMs on Jetson. Latency drops to 200 ms; closed-loop becomes feasible.
  • Action-tuned LLMs: models specifically trained for robot tool use. Less hallucination.
  • Multi-agent robots: multiple LLM agents (planner, perception, execution monitor) coordinating.
  • Self-improving from logs: failure logs become training data; LLM gets better over time.

Exercise

Build a small embodied LLM demo: a 6-DOF arm + RGB camera. Define a skill API (grasp, place, navigate) using OpenVLA or hand-coded skills. Use Claude or GPT-4o to plan from a high-level prompt: "tidy the desk." Watch the LLM decompose, execute, recover. The first time the system handles a multi-step task autonomously, the architecture's value clicks.

Next

Safety and certification — the standards that separate a lab demo from a deployed product.

Comments

    Sign in to post a comment.