Grasp planning with deep learning
Dex-Net, Contact-GraspNet, and the model families that ate the grasp-planning world. How they work, what they need, and the workflow that gets you a working grasper in a weekend.
For 30 years, grasp planning was a hand-engineered pipeline: detect the object, fit a primitive, sample antipodal grasps, score each with form-closure metrics, pick the best. In 2017–2023 that workflow got eaten by deep learning. A network looks at a depth image or point cloud, outputs a ranked list of grasp candidates. The classical theory is still the training signal — but the inference is fast, robust, and orders of magnitude better at messy real-world clutter.
The architecture pattern
Deep grasp models almost all share this template:
- Input: a depth image, RGB-D image, or point cloud of the scene.
- Network: convnet (for images) or PointNet/PointNet++/sparse-3D-conv (for point clouds).
- Output: per-pixel or per-point grasp predictions — for each candidate location, a 6-DOF grasp pose plus a quality score.
- Post-processing: filter, sort, take top-k. Optionally collision-check.
Training data is generated in simulation: synthesize random object piles, sample candidate grasps, evaluate each with classical force-closure analysis, label "good" vs "bad." Train the network to predict the label. At test time, drop the classical analysis; the network has internalized it.
The four families that matter in 2026
Dex-Net (Berkeley, 2017–2020)
The pioneering family. Trained on millions of point-cloud examples with force-closure labels. Top-down parallel-jaw grasps. Strengths: well-validated, mature, runs on a single GPU. Limits: top-down only; struggles with cluttered scenes and side-grasps.
GraspNet-1Billion (CUHK, 2020+)
Massive dataset (1B grasps over 90k+ scenes), full 6-DOF grasps, public. Models like AnyGrasp build on it. Strengths: dataset is the gold standard for benchmarking; covers many objects and clutter levels. Limits: trained mostly on small bin-picking-style scenes.
Contact-GraspNet (NVIDIA, 2021)
Direct point-cloud → 6-DOF grasps. Each input point is treated as a candidate contact; the network predicts the grasp aligned with that contact. Very fast inference; widely used in research and Isaac Sim demos. Strengths: full 6-DOF, works on cluttered scenes, real-time. Limits: depth data quality matters a lot.
Diffusion-based grasp samplers (2024+)
Newer line of work using diffusion models to sample grasp poses, conditioned on the scene point cloud. Strengths: produces diverse grasps (multimodal grasp distributions); easier to handle "either pinch the handle or pinch the rim" scenarios. Limits: slower inference (multiple diffusion steps).
What you need to deploy one
- A depth camera — Intel RealSense D435/D455, Azure Kinect, or a stereo camera. Quality directly affects grasp success. Budget $300–500.
- A GPU — RTX 3060 minimum for real-time inference. Some smaller models run on Jetson Orin.
- Camera calibration — intrinsics + extrinsics relative to the robot base. Off by a centimeter on extrinsics → grasps consistently miss.
- A motion planner — given a target grasp pose, plan an arm trajectory to reach it without collision. MoveIt is the standard.
- Pretrained weights — most models have public checkpoints. Don't retrain unless you must.
The deployment pipeline
def grasp_once(camera, robot, planner, model):
rgb, depth = camera.capture()
cloud = depth_to_pointcloud(depth, K=camera.intrinsics)
grasps, scores = model.predict(cloud) # network inference
top = sorted(zip(grasps, scores), key=lambda x: -x[1])[:50]
for grasp_pose, score in top:
# Transform to base frame
grasp_in_base = T_base_camera @ grasp_pose
# Check collisions, joint limits, reachability
if planner.is_reachable(grasp_in_base):
traj = planner.plan_to(grasp_in_base)
if traj is not None:
robot.execute(traj)
robot.close_gripper()
return True
return False # no feasible grasp
That's the entire pipeline. Most projects flesh out collision filtering, retry logic, and pre/post motions, but the core is 30 lines.
What goes wrong in practice
- Depth holes: shiny, transparent, or very thin objects produce missing depth pixels. Network's input is incomplete; predictions are unreliable. Mitigations: backup with stereo, RGB-only models, or depth completion networks.
- Reflections: metallic objects fool depth sensors. Same fix.
- Calibration drift: gradual misalignment between camera and arm. Re-calibrate weekly.
- Failure compounding: a 90% grasp predictor + 90% reach planner + 90% gripper actuation = ~73% end-to-end success. Each component matters.
- Out-of-distribution objects: networks trained on YCB-style objects may struggle with very thin / very deformable / very large items. Honest evaluation tells you the failure modes.
Hybrid pipelines (state of the art in 2026)
Pure end-to-end is rarely best. Most modern grasp pipelines combine:
- Detection / segmentation — instance-level masks (SAM, Mask2Former) for "which object am I grasping."
- Grasp prediction — Contact-GraspNet, AnyGrasp, or diffusion-based.
- Grasp scoring + filtering — domain rules ("not too close to the table edge") on top of network scores.
- Tactile verification — after pickup, GelSight or DIGIT confirms the grasp is solid before lifting.
This stack outperforms any single component.
What VLAs change
The 2024+ wave of vision-language-action models (π0, OpenVLA, Gemini Robotics) treats grasping as part of a higher-level task. Instead of "predict grasp pose," they predict "execute the next 16 joint commands." The grasp emerges as part of executing the task, not as a separate planning stage.
Implications:
- For language-conditioned tasks ("pick up the red mug"), VLAs are more flexible than dedicated grasp networks.
- For pure bin-picking with high precision requirements, dedicated grasp networks still win on speed and reliability.
- Hybrid stacks: VLA for high-level planning, dedicated grasp net for the actual grasp execution. Best of both.
Datasets you should know
- YCB Object Set — 77 standardized objects with meshes; the canonical grasping benchmark.
- EGAD! — procedurally generated objects spanning shape complexity; tests generalization.
- GraspNet-1Billion — 1B labeled 6-DOF grasps, the dataset most modern models train on.
- ACRONYM — 17M grasps across 8000+ objects, fully simulated.
Exercise
Set up Contact-GraspNet (or AnyGrasp) with the public weights. Use a RealSense or simulated point cloud of YCB objects. Run inference; visualize the top 10 grasps. Pick a feasible one in MoveIt. Execute. Most people build this from scratch in two days. The first time the arm reliably picks an object you've never written code for, you'll understand why deep grasping ate the field.
Next
MoveIt 2 in practice — the planning and execution layer that actually drives the arm to the predicted grasp.
Comments
Sign in to post a comment.