Visual servoing: IBVS vs PBVS
Close the control loop on the camera, not on joints. Image-based servoing controls in pixel space; position-based servoing controls in 3D pose. Same goal, different math, different failure modes.
Visual servoing closes the control loop on what the camera sees, not on where joint encoders say the arm is. The classical formulations split into two camps: IBVS (image-based, drive the pixel error) and PBVS (position-based, drive the 3D pose error). Both have been around for decades and still work; both are competing with end-to-end learned policies that read pixels and output torques. Here's the math, the trade-offs, and why this matters in 2026.
The setup
The camera is mounted on the arm (eye-in-hand) or fixed in the world (eye-to-hand). The robot tracks a target — a marker, a learned feature, or an object detection. Goal: drive the arm to a desired pose relative to the target.
Two natural ways to express the error:
- IBVS: error is in the image. Pixels-of-the-target-now vs pixels-of-the-target-when-we're-done. Servo the image directly to the goal image.
- PBVS: estimate the target's 3D pose from the image, compute the 3D pose error, drive that to zero with a Cartesian controller.
IBVS — control in image space
For each tracked feature point in the image, the rate at which it moves on the image plane depends on the camera's twist (linear + angular velocity) and the point's depth Z. The relationship is a 2×6 image Jacobian (interaction matrix):
Where is the camera twist and is the pre-derived interaction matrix for a point feature.
For multiple feature points, stack the Jacobians and solve for the camera velocity that drives the image error to zero:
Where is the desired image, is the pseudoinverse, and is a gain. The arm controller receives (mapped from camera frame to base frame) and executes.
Strengths: doesn't need to recover 3D pose; robust to calibration error; final image always matches the goal.
Weaknesses: trajectory in 3D is unpredictable (the robot might take strange paths to satisfy the image error); needs an estimate of the depth Z for each feature; can lose features off-screen.
PBVS — control in pose space
Estimate the 3D pose of the target relative to the camera (PnP, fiducial markers, learned 6-DOF pose estimators). Compute the pose error between current and desired. Apply a Cartesian PD controller:
The pose error is a 6-vector: 3D translation + 3D axis-angle rotation.
Strengths: trajectory in 3D is straight (predictable); image features can leave the field of view temporarily without breaking the controller; easy to add 6-DOF terms.
Weaknesses: depends on accurate pose estimation; calibration errors propagate to the final pose; the image at the goal won't match the goal image perfectly if pose estimates are biased.
Hybrid 2.5-D approaches
Best of both: control translation in 3D (PBVS-like) and rotation in image-feature space (IBVS-like). Avoids both the unpredictable trajectories of pure IBVS and the calibration-sensitivity of pure PBVS. Malis's 2.5-D visual servoing is the canonical implementation.
Common feature choices
- Fiducial markers (ArUco, AprilTag): cheap, robust, give 6-DOF pose. Most "first visual-servoing demo" projects use these.
- Learned features (SuperPoint, ORB): don't need physical markers; harder to track reliably.
- Object pose estimators (FoundationPose, MegaPose): 6-DOF pose for known objects. Powers PBVS without markers.
- Image moments / shape descriptors: classical IBVS uses these for non-point features; less common in modern work.
Calibration: the sensitivity
Both IBVS and PBVS need camera intrinsics (well-handled by OpenCV calibration) and the eye-in-hand transform from camera to end-effector (a hand-eye calibration step). Errors in either propagate:
- Intrinsic error → biased depth estimates → IBVS instability when objects are close.
- Hand-eye error → systematic offset in the goal pose for PBVS.
Hand-eye calibration is its own subroutine: collect ~20 poses with the arm holding a marker, solve the AX=XB problem. cv2.calibrateHandEye wraps it.
The 30-line PBVS implementation
import cv2
def pbvs_step(camera_pose_in_target, desired_camera_pose_in_target, lambda_=0.5):
# Compute the pose error between current and desired (homogeneous transforms)
error_T = np.linalg.inv(camera_pose_in_target) @ desired_camera_pose_in_target
t_error = error_T[:3, 3]
R_error = error_T[:3, :3]
rotvec_error, _ = cv2.Rodrigues(R_error)
# Camera velocity (translation + rotation)
v_camera = -lambda_ * np.concatenate([t_error, rotvec_error.flatten()])
# Map to base frame and feed to arm
v_base = T_base_camera_R @ v_camera
return v_base
That's the entire control law. Run at 30+ Hz, the arm tracks the target smoothly.
When neither classical approach is the right tool
- End-to-end learned policies (VLAs, diffusion policies). They take pixels as input and output joint commands; the visual servoing is implicit. Beats classical methods on tasks with clutter, varying object instances, or vague goals (e.g., "place the cup somewhere safe").
- Tasks with no visible goal — assembly inside a hole, in occluded scenes. F/T-based control is more reliable.
- Tasks needing absolute position — visual servoing tracks a relative target; if you need to be at world coordinate (x=1.234, y=0.567), use a global localization first.
Where visual servoing still wins
- Marker-based tracking with strict tolerances: surgery, high-precision assembly, bin-picking with markers.
- Calibration-rich industrial cells: when the camera, arm, and parts are all calibrated to sub-millimeter, classical visual servoing outperforms any learned policy.
- Real-time control loops on edge hardware: a 100 Hz IBVS loop on a Jetson is feasible; running a VLA at 30 Hz isn't.
Modern hybrid: classical front-end, learned back-end
2026 production stacks often combine:
- Learned 6-DOF object pose estimator (e.g. FoundationPose) outputs the target pose at 10 Hz.
- Classical PBVS controller runs at 100 Hz, smoothing the pose track.
- Fall-back to F/T-based contact control once the gripper engages.
Each stage uses the right tool: deep learning where it has the data, classical control where it has the structure.
Exercise
In a sim with a 6-DOF arm and a marker, implement PBVS using ArUco. Goal: hold a fixed pose 30 cm in front of the marker as the marker moves. The implementation is ~50 lines including marker detection. Then try IBVS on the four marker corners. Watch the trajectories: PBVS goes straight; IBVS curves. Both converge.
Next
Lyapunov stability — the energy-function tool that lets you prove your controller actually converges, instead of hoping.
Comments
Sign in to post a comment.