What 'pick-and-place' actually involves
Moving one object from A to B sounds trivial. It's the unsolved problem most of robotics research is still chipping at. Here are the five steps that hide under the phrase.
"Pick up the block and put it in the box." A 3-year-old child executes this in half a second. A robot, with all of deep learning's 2026 tricks, often still fails. Before you can appreciate why — or start building one — you need to see the five distinct problems hiding inside the phrase.
The five steps
- Find the object. Where is the block?
- Choose a grasp. Where on the block do I pinch, and at what angle?
- Plan a motion. What trajectory gets the gripper to that grasp without hitting anything?
- Execute it reliably. Did the grasp actually work, or did I miss?
- Place it. Which of the above repeats in reverse, except now the object is in my hand.
Each step is a research field.
1. Finding the object
Inputs: RGB image, depth image, or both. Output: the pose of the object — position and orientation in the camera frame, then converted to the robot base frame.
Tools:
- 2D detection + depth lookup: YOLO/RT-DETR gives you a bounding box; a depth camera tells you how far away it is. Fast, approximate.
- 6-DOF pose estimation: FoundationPose, MegaPose, DOPE. Slower, but gives you full orientation — necessary if the object has a distinct top and bottom.
- Segmentation + principal component analysis: SAM masks the object; PCA on the depth points gives you approximate orientation. Works surprisingly well for industrial parts.
Failure modes: object partially occluded, strong reflections on metallic surfaces, objects smaller than your depth camera's resolution.
2. Choosing a grasp
You have a pose. Now, where on the object do you actually grab? For a rectangular block, obvious — pinch the middle. For a mug, less obvious (handle, rim, body?). For a wrench, depends on what you'll do next.
Approaches:
- Grasp from geometry: pick antipodal contact points, check force closure analytically. Classical, still used for well-known objects.
- Learned grasp proposal: Dex-Net, GraspNet, Contact-GraspNet. Networks trained on simulated grasps that output a distribution of good grasps for any point cloud.
- Task-conditioned grasping: for "pick up the cup to drink," pinch the body; for "put it in the dishwasher," grab the handle.
Failure modes: grasping approaches that violate collision at the last centimeter, grasps that work but then drop the object during motion due to slippage.
3. Planning the motion
Start: current arm configuration. Goal: configuration where the gripper reaches the grasp pose. Constraints: no collisions with the table, the object, the box, the second arm, the cat.
Tools covered in the Planning track: RRT, RRT*, trajectory optimization, MoveIt. For many pick-and-place setups, MoveIt just works; for contact-rich or cluttered scenes, you'll tune it.
Failure modes: the planner times out on hard scenes; the planned path looks fine but violates a joint-velocity limit during execution; the arm collides with the object you're grasping because collision-checking treated it as a fixed obstacle instead of part of the robot after pick-up.
4. Executing reliably
Two things can go wrong during the grasp itself:
- The gripper reaches the grasp but the object isn't where you thought it was (perception error).
- The gripper closes, but the object slips out (contact uncertainty).
Verification:
- Post-grasp force sensors — is there something in the gripper?
- Post-grasp vision — is the object gone from the table?
- Tactile sensors — do contact patterns match the expected geometry?
Smart pick-and-place pipelines know when a grasp failed and retry. Dumb ones calmly proceed to the placement with an empty gripper.
5. Placing
Easier than picking, usually — the release doesn't need a precise grip, just a clear trajectory to the drop location. Complications arise when:
- The box is narrow and the object has to be oriented correctly to fit.
- You need a stable placement (can't just drop it).
- The box has other objects in it and you need to place without disturbing them.
Task-and-motion planning (TAMP) for stacking, sorting, and tight-fitting placements is an active research area.
Why this is still hard
Each step has a ~95% success rate on easy problems. Chain five 95%-s and you get 0.955 ≈ 77%. That's fine for a demo; not for a warehouse that needs 99.9%.
Progress in 2024–26 has come from two directions:
- End-to-end policies (VLAs). Skip the explicit pipeline; train a neural net to go from pixels to motor commands directly. π0, RT-2, Gemini Robotics.
- Better components. SAM for segmentation, FoundationPose for pose, diffusion policies for execution — each of which improves the classical pipeline's individual success rates.
Both approaches are valid. Both are areas where hobbyists can contribute — the gear and the models are cheap enough that the bottleneck is experimentation, not budget.
A starting project
Pick a simulated scene with blocks on a table and a box. Use a classical pipeline: a scripted 6-DOF pose estimator (or cheat with ground truth), a fixed top-down grasp, MoveIt for motion, a scripted drop. Get it to 90% success. Now add real-world complications one at a time: pose noise, slippery blocks, occluding clutter. Watch the 90% become 60% become 30%. You now know why this field exists.
Comments
Sign in to post a comment.