Published2026-04-23·~12 min

What 'pick-and-place' actually involves

Moving one object from A to B sounds trivial. It's the unsolved problem most of robotics research is still chipping at. Here are the five steps that hide under the phrase.

by RobotForge

#manipulation#fundamentals#beginner

"Pick up the block and put it in the box." A 3-year-old child executes this in half a second. A robot, with all of deep learning's 2026 tricks, often still fails. Before you can appreciate why — or start building one — you need to see the five distinct problems hiding inside the phrase.

The five steps

Find the object. Where is the block?
Choose a grasp. Where on the block do I pinch, and at what angle?
Plan a motion. What trajectory gets the gripper to that grasp without hitting anything?
Execute it reliably. Did the grasp actually work, or did I miss?
Place it. Which of the above repeats in reverse, except now the object is in my hand.

Each step is a research field.

1. Finding the object

Inputs: RGB image, depth image, or both. Output: the pose of the object — position and orientation in the camera frame, then converted to the robot base frame.

Tools:

2D detection + depth lookup: YOLO/RT-DETR gives you a bounding box; a depth camera tells you how far away it is. Fast, approximate.
6-DOF pose estimation: FoundationPose, MegaPose, DOPE. Slower, but gives you full orientation — necessary if the object has a distinct top and bottom.
Segmentation + principal component analysis: SAM masks the object; PCA on the depth points gives you approximate orientation. Works surprisingly well for industrial parts.

Failure modes: object partially occluded, strong reflections on metallic surfaces, objects smaller than your depth camera's resolution.

2. Choosing a grasp

You have a pose. Now, where on the object do you actually grab? For a rectangular block, obvious — pinch the middle. For a mug, less obvious (handle, rim, body?). For a wrench, depends on what you'll do next.

Approaches:

Grasp from geometry: pick antipodal contact points, check force closure analytically. Classical, still used for well-known objects.
Learned grasp proposal: Dex-Net, GraspNet, Contact-GraspNet. Networks trained on simulated grasps that output a distribution of good grasps for any point cloud.
Task-conditioned grasping: for "pick up the cup to drink," pinch the body; for "put it in the dishwasher," grab the handle.

Failure modes: grasping approaches that violate collision at the last centimeter, grasps that work but then drop the object during motion due to slippage.

3. Planning the motion

Start: current arm configuration. Goal: configuration where the gripper reaches the grasp pose. Constraints: no collisions with the table, the object, the box, the second arm, the cat.

Tools covered in the Planning track: RRT, RRT*, trajectory optimization, MoveIt. For many pick-and-place setups, MoveIt just works; for contact-rich or cluttered scenes, you'll tune it.

Failure modes: the planner times out on hard scenes; the planned path looks fine but violates a joint-velocity limit during execution; the arm collides with the object you're grasping because collision-checking treated it as a fixed obstacle instead of part of the robot after pick-up.

4. Executing reliably

Two things can go wrong during the grasp itself:

The gripper reaches the grasp but the object isn't where you thought it was (perception error).
The gripper closes, but the object slips out (contact uncertainty).

Verification:

Post-grasp force sensors — is there something in the gripper?
Post-grasp vision — is the object gone from the table?
Tactile sensors — do contact patterns match the expected geometry?

Smart pick-and-place pipelines know when a grasp failed and retry. Dumb ones calmly proceed to the placement with an empty gripper.

5. Placing

Easier than picking, usually — the release doesn't need a precise grip, just a clear trajectory to the drop location. Complications arise when:

The box is narrow and the object has to be oriented correctly to fit.
You need a stable placement (can't just drop it).
The box has other objects in it and you need to place without disturbing them.

Task-and-motion planning (TAMP) for stacking, sorting, and tight-fitting placements is an active research area.

Why this is still hard

Each step has a ~95% success rate on easy problems. Chain five 95%-s and you get 0.95⁵ ≈ 77%. That's fine for a demo; not for a warehouse that needs 99.9%.

Progress in 2024–26 has come from two directions:

End-to-end policies (VLAs). Skip the explicit pipeline; train a neural net to go from pixels to motor commands directly. π0, RT-2, Gemini Robotics.
Better components. SAM for segmentation, FoundationPose for pose, diffusion policies for execution — each of which improves the classical pipeline's individual success rates.

Both approaches are valid. Both are areas where hobbyists can contribute — the gear and the models are cheap enough that the bottleneck is experimentation, not budget.

A starting project

Pick a simulated scene with blocks on a table and a box. Use a classical pipeline: a scripted 6-DOF pose estimator (or cheat with ground truth), a fixed top-down grasp, MoveIt for motion, a scripted drop. Get it to 90% success. Now add real-world complications one at a time: pose noise, slippery blocks, occluding clutter. Watch the 90% become 60% become 30%. You now know why this field exists.

Grasp analysis: form and force closure