Published2026-04-25·~13 min

Visual servoing: IBVS vs PBVS

Close the control loop on the camera, not on joints. Image-based servoing controls in pixel space; position-based servoing controls in 3D pose. Same goal, different math, different failure modes.

by RobotForge

#control#visual-servoing#perception

Visual servoing closes the control loop on what the camera sees, not on where joint encoders say the arm is. The classical formulations split into two camps: IBVS (image-based, drive the pixel error) and PBVS (position-based, drive the 3D pose error). Both have been around for decades and still work; both are competing with end-to-end learned policies that read pixels and output torques. Here's the math, the trade-offs, and why this matters in 2026.

The setup

The camera is mounted on the arm (eye-in-hand) or fixed in the world (eye-to-hand). The robot tracks a target — a marker, a learned feature, or an object detection. Goal: drive the arm to a desired pose relative to the target.

Two natural ways to express the error:

IBVS: error is in the image. Pixels-of-the-target-now vs pixels-of-the-target-when-we're-done. Servo the image directly to the goal image.
PBVS: estimate the target's 3D pose from the image, compute the 3D pose error, drive that to zero with a Cartesian controller.

IBVS — control in image space

For each tracked feature point $s = (u, v)^{T}$ in the image, the rate at which it moves on the image plane depends on the camera's twist (linear + angular velocity) and the point's depth Z. The relationship is a 2×6 image Jacobian (interaction matrix):

\dot{s} = L_{s} (s, Z) v_{c}

Where $v_{c} = (v_{x}, v_{y}, v_{z}, ω_{x}, ω_{y}, ω_{z})^{T}$ is the camera twist and $L_{s}$ is the pre-derived interaction matrix for a point feature.

For multiple feature points, stack the Jacobians and solve for the camera velocity that drives the image error to zero:

v_{c} = - λ L_{s}^{+} (s - s^{*})

Where $s^{*}$ is the desired image, $L_{s}^{+}$ is the pseudoinverse, and $λ$ is a gain. The arm controller receives $v_{c}$ (mapped from camera frame to base frame) and executes.

Strengths: doesn't need to recover 3D pose; robust to calibration error; final image always matches the goal.

Weaknesses: trajectory in 3D is unpredictable (the robot might take strange paths to satisfy the image error); needs an estimate of the depth Z for each feature; can lose features off-screen.

PBVS — control in pose space

Estimate the 3D pose of the target relative to the camera (PnP, fiducial markers, learned 6-DOF pose estimators). Compute the pose error between current and desired. Apply a Cartesian PD controller:

v_{c} = - λ e_{pose}

The pose error is a 6-vector: 3D translation + 3D axis-angle rotation.

Strengths: trajectory in 3D is straight (predictable); image features can leave the field of view temporarily without breaking the controller; easy to add 6-DOF terms.

Weaknesses: depends on accurate pose estimation; calibration errors propagate to the final pose; the image at the goal won't match the goal image perfectly if pose estimates are biased.

Hybrid 2.5-D approaches

Best of both: control translation in 3D (PBVS-like) and rotation in image-feature space (IBVS-like). Avoids both the unpredictable trajectories of pure IBVS and the calibration-sensitivity of pure PBVS. Malis's 2.5-D visual servoing is the canonical implementation.

Common feature choices

Fiducial markers (ArUco, AprilTag): cheap, robust, give 6-DOF pose. Most "first visual-servoing demo" projects use these.
Learned features (SuperPoint, ORB): don't need physical markers; harder to track reliably.
Object pose estimators (FoundationPose, MegaPose): 6-DOF pose for known objects. Powers PBVS without markers.
Image moments / shape descriptors: classical IBVS uses these for non-point features; less common in modern work.

Calibration: the sensitivity

Both IBVS and PBVS need camera intrinsics (well-handled by OpenCV calibration) and the eye-in-hand transform from camera to end-effector (a hand-eye calibration step). Errors in either propagate:

Intrinsic error → biased depth estimates → IBVS instability when objects are close.
Hand-eye error → systematic offset in the goal pose for PBVS.

Hand-eye calibration is its own subroutine: collect ~20 poses with the arm holding a marker, solve the AX=XB problem. cv2.calibrateHandEye wraps it.

The 30-line PBVS implementation

import cv2

def pbvs_step(camera_pose_in_target, desired_camera_pose_in_target, lambda_=0.5):
    # Compute the pose error between current and desired (homogeneous transforms)
    error_T = np.linalg.inv(camera_pose_in_target) @ desired_camera_pose_in_target
    t_error = error_T[:3, 3]
    R_error = error_T[:3, :3]
    rotvec_error, _ = cv2.Rodrigues(R_error)

    # Camera velocity (translation + rotation)
    v_camera = -lambda_ * np.concatenate([t_error, rotvec_error.flatten()])

    # Map to base frame and feed to arm
    v_base = T_base_camera_R @ v_camera
    return v_base

That's the entire control law. Run at 30+ Hz, the arm tracks the target smoothly.

When neither classical approach is the right tool

End-to-end learned policies (VLAs, diffusion policies). They take pixels as input and output joint commands; the visual servoing is implicit. Beats classical methods on tasks with clutter, varying object instances, or vague goals (e.g., "place the cup somewhere safe").
Tasks with no visible goal — assembly inside a hole, in occluded scenes. F/T-based control is more reliable.
Tasks needing absolute position — visual servoing tracks a relative target; if you need to be at world coordinate (x=1.234, y=0.567), use a global localization first.

Where visual servoing still wins

Marker-based tracking with strict tolerances: surgery, high-precision assembly, bin-picking with markers.
Calibration-rich industrial cells: when the camera, arm, and parts are all calibrated to sub-millimeter, classical visual servoing outperforms any learned policy.
Real-time control loops on edge hardware: a 100 Hz IBVS loop on a Jetson is feasible; running a VLA at 30 Hz isn't.

Modern hybrid: classical front-end, learned back-end

2026 production stacks often combine:

Learned 6-DOF object pose estimator (e.g. FoundationPose) outputs the target pose at 10 Hz.
Classical PBVS controller runs at 100 Hz, smoothing the pose track.
Fall-back to F/T-based contact control once the gripper engages.

Each stage uses the right tool: deep learning where it has the data, classical control where it has the structure.

Exercise

In a sim with a 6-DOF arm and a marker, implement PBVS using ArUco. Goal: hold a fixed pose 30 cm in front of the marker as the marker moves. The implementation is ~50 lines including marker detection. Then try IBVS on the four marker corners. Watch the trajectories: PBVS goes straight; IBVS curves. Both converge.

Lyapunov stability — the energy-function tool that lets you prove your controller actually converges, instead of hoping.

← Previous

Force control and hybrid motion/force

Lyapunov stability for roboticists