3D Vision · #13 of 16

Stereo Vision & Depth

Seeing in 3D

Diagram of a Wheatstone stereogram: two eyes (LE, RE) viewing a pair of drawings, with rays converging on a fixation point F and a more distant point X, marking the convergence and accommodation distances. — Charles Wheatstone's stereogram: two flat drawings, one per eye, fuse into a single solid scene with real depth. — Jodi.Krol, CC BY-SA 4.0

Hold a finger up at arm's length. Close your left eye, then your right, back and forth. The finger jumps sideways against the wall behind it. Now do the same with a tree across the street: it barely moves at all.

You just measured depth. Not with a tape measure, not with sonar, but with two slightly different views of the same world and a brain that knows how to read the difference between them. Each eye sees from a vantage point a few centimeters from the other, and that tiny offset is enough.

Two views of one scene, plus a little geometry, is all it takes to turn flatness back into depth.

A single photograph is a one-way street. The world has three dimensions, the sensor has two, and the act of projection throws the third one away: every point along a ray from the lens lands on the same pixel, so the camera can never tell which one you meant. The cure is parallax, the apparent shift of a point when you move your viewpoint. Give a system two eyes, separated by a known distance, and the near things shift more than the far things. Measuring that shift, and reasoning backward through the geometry of the two cameras, lets you recover the depth the single image lost. This is stereo vision, and biology and computers solve it the same way.

Drag the baseline slider (how far apart the two cameras sit) and the disparity slider in the sim below, and watch the recovered depth Z respond. Look for the inverse relationship: as the same point's left-right shift grows, the depth it implies shrinks, and vice versa.

Baseline (b) = 80px Focal Length (f) = 60px

Actual Depth (Z) = 150px

What the sim is quietly demonstrating is the single equation that powers all of stereo: depth is not measured directly, it is inferred from how far a point appears to move between the two views. A big jump means close. A whisper of a shift means far. Get the cameras' separation and focal length right, and that shift becomes a ruler.

Disparity is a ruler in disguise

Set up two cameras side by side, looking forward, separated by a baseline $b$ . A 3D point in the world projects to a horizontal position $x_L$ in the left image and $x_R$ in the right.

Triangulation: the baseline b and the two viewing rays pin point P at depth Z.

The disparity is simply the difference:

d = x_L - x_R

Here $x_L$ is the point's column in the left image, $x_R$ is its column in the right image, and $d$ is the horizontal offset between them, measured in pixels. The governing relationship between that offset and the true depth is:

Z = \frac{f \, b}{d}

where $Z$ is the depth (distance from the cameras to the point), $f$ is the focal length in pixels, $b$ is the baseline (the physical distance between the two camera centers), and $d$ is the disparity. The shape of this equation is the whole story: $Z$ and $d$ are inversely proportional. Double the disparity and you halve the depth. A point at infinity has zero disparity (it sits at the same pixel in both views); a point right in front of your nose has enormous disparity. That is exactly the finger-versus-tree experiment, written in symbols.

Where disparity comes from: your two eyes

Biology got here first. Your visual system fixates both eyes on a single point, and because the eyes look from slightly different directions, every other point in the scene lands on subtly different spots on the two retinas. That mismatch is binocular disparity, and the brain reads it as depth. The sensation it produces, the felt solidity of the world, has its own name: stereopsis, from the Greek for "solid sight."

Geometric diagram of two eyes (left eye and right eye) fixating, each with a principal direction and a visual direction toward a point P, with the angular disparity marked between them. — Binocular geometry: both eyes aim at a fixation point, but a second point P falls along different visual directions in each eye. That angular difference is disparity. — Jodi.Krol, CC BY-SA 4.0

Crucially, disparity is not the only depth cue your brain uses. Close one eye and the world does not collapse into a flat sheet: you still read depth from monocular cues like relative size (far things look smaller), occlusion (near things hide far things), linear perspective (parallel lines converge), and texture gradients (texture gets finer with distance). Stereo gives you precise depth up close; the monocular cues carry you the rest of the way out to the horizon.

A suburban street scene with cars receding into the distance, trees lining the road, the road narrowing toward a vanishing point. — Monocular depth cues at work in a single flat photo: the road's converging edges (linear perspective), nearer cars looming larger (relative size), and trees overlapping (occlusion). — Brazzit, CC BY-SA 3.0

An animated cityscape where the foreground buildings scroll past quickly while the background skyline drifts slowly, illustrating motion parallax. — Motion parallax: near layers slide fast, far layers slide slow. One moving eye recovers depth the same way two still eyes do. — NuclearDuckie, CC0

From two photos to a depth map

For a computer, the geometry above is the easy part. The hard part is correspondence: for each pixel in the left image, which pixel in the right image is the same physical point? Get that wrong and your disparity is garbage. The trick that makes the search tractable is rectification: you warp both images so that corresponding points always land on the same horizontal row. Then matching a pixel is a one-dimensional hunt along a single line instead of a search over the whole frame.

A geometric triangulation diagram: a scene point A at the top, two camera centers B and D on a horizontal baseline, and the point's projection landing at different offsets E-F and G-H on the two image planes below. — Computer stereo, geometrically: point A projects to different offsets in the left and right image planes (E–F versus G–H). That displacement difference is the disparity the depth equation needs. — Thepigdog, CC BY-SA 3.0

Once images are rectified, the matcher slides a small window along each row and scores how well the patches agree. Block matching does exactly this, fast but noisy. Semi-global matching (SGM) adds a smoothness penalty so neighboring pixels prefer similar depths, killing speckle. Modern systems train convolutional networks on stereo pairs and learn the matching directly. Whatever the method, the output is a disparity map, one disparity value per pixel, which the depth equation converts into a depth map: an image where brightness encodes distance, bright for near and dark for far.

That depth map is the payload. It feeds 3D reconstruction, lets a robot avoid an obstacle, drives the soft background blur of a portrait photo, and tells an AR headset which virtual objects should be hidden behind real ones.

Why Z = fb/d falls straight out of similar triangles

Put the two cameras in a rectified, fronto-parallel rig: optical axes parallel, image planes coplanar, separated by baseline $b$ along the $x$ -axis, each with focal length $f$ . Place the origin at the left camera center. A world point sits at depth $Z$ and at horizontal offset $X$ from the left camera.

By the pinhole projection law, the left image coordinate is the focal length scaled by the angle to the point:

x_L = f \, \frac{X}{Z}

The right camera is shifted by $b$ , so it sees the same point at horizontal offset $X - b$ :

x_R = f \, \frac{X - b}{Z}

Subtract to get the disparity, and the unknown $X$ cancels cleanly:

d = x_L - x_R = f \, \frac{X}{Z} - f \, \frac{X - b}{Z} = \frac{f \, b}{Z}

Invert and you have the depth equation in its working form:

Z = \frac{f \, b}{d}

Two consequences fall out immediately. First, the precision of depth degrades with the square of distance: differentiating gives $\,|\partial Z| = \dfrac{f b}{d^2}\,|\partial d| = \dfrac{Z^2}{f b}\,|\partial d|$ , so a one-pixel matching error costs you proportionally far more depth accuracy on distant objects than on near ones. Second, the only knobs you control are $f$ and $b$ : a longer focal length or a wider baseline both buy you sharper depth, which is exactly why long-range stereo rigs splay their cameras far apart.

Key takeaways

Depth is inferred, not measured. A camera cannot read distance off a single image; stereo recovers it from how far a point shifts between two views.
Disparity is inversely proportional to depth: $Z = fb/d$ . Big shift means near, tiny shift means far, zero shift means infinitely far away.
You tune precision with $f$ and $b$ . A wider baseline or longer focal length sharpens far-distance accuracy, at the cost of bigger occlusion zones.
Correspondence is the real work. Rectification reduces matching to a 1D search per row; block matching, SGM, and CNNs trade speed for smoothness and accuracy.
Texture is required. Blank walls, reflections, and occlusions are where stereo produces holes, which is why real systems fuse stereo with monocular and motion cues.

Wheatstone fooled a brain with two drawings and a pair of mirrors, and in doing so revealed that depth was never out there in the world: it was a calculation, run between two slightly disagreeing eyes. Every stereo camera since, from a Mars rover to the lenses on your phone, is running that same calculation. Close one eye, then the other, and watch your finger jump. You are doing geometry, and you have been since before you could count.