3D Vision · #14 of 16

Epipolar Geometry

The Geometry of Two Views

Two cameras viewing the same 3D point: the point, the two camera centers, and the two image planes, with epipolar lines and epipoles marked. — Two cameras, one point. The 3D point, the two optical centers, and the line joining them carve a triangle into space, and that triangle is what makes stereo possible. — Arne Nordmann (norro), CC BY-SA 3.0

Close one eye. Hold a finger at arm's length and look at it against a far wall. Now switch eyes. The finger jumps. The wall barely moves.

Your brain just performed a calculation that took computer vision researchers most of the twentieth century to write down. Two views of the same world are not independent: they are locked together by the rigid geometry of the camera that took them.

Given a point in one image, you do not have to search the whole second image for its twin. You only have to search a single line.

When two cameras (or two eyes) look at the same scene, the world they share leaves a fingerprint on the pair of pictures. That fingerprint is epipolar geometry: a set of constraints that say exactly where a feature seen in one view is allowed to appear in the other. It is the difference between hunting for a match across a million pixels and gliding along a single ruled line. Everything in this chapter (stereo depth, structure from motion, the way SLAM stitches a camera's path together) leans on this one idea.

Drag the point in the left image below and watch its epipolar line sweep across the right view. Look for the way the line pivots: notice that no matter where you put the point, the family of lines on the right all fan out from one fixed spot. That spot has a name, and finding it is half the battle.

Left Image (drag point p)

Right Image (epipolar line l')

Epipolar Constraint: p'ᵀ · F · p = 0

residual p'ᵀ·F·p = 0.000

What you just saw is the epipolar constraint made visible. The match for your left-image point can be anywhere along that right-image line, but it cannot be off it. A two-dimensional search just collapsed into a one-dimensional one. The fan-out point is the epipole: the image of the other camera, seen from this one.

The whole relationship is captured by a single small matrix and a single tidy equation. If $\mathbf{x}$ is a point in the left image and $\mathbf{x}'$ is its match in the right (both written in homogeneous coordinates, so a pixel $(u, v)$ becomes the column $[u, v, 1]^\top$ ), then

$\mathbf{x}'^{\top} \, \mathbf{F} \, \mathbf{x} = 0$

Here $\mathbf{F}$ is the fundamental matrix, a $3 \times 3$ array of numbers. The symbol $\mathbf{x}$ is the homogeneous pixel in the first image; $\mathbf{x}'$ is the candidate match in the second; the superscript $\top$ means "transpose" (turn the column into a row so the multiplication lines up); and the right-hand side being exactly $0$ is the constraint itself. The product $\mathbf{F}\mathbf{x}$ is not a number, it is a line: the epipolar line in the right image that $\mathbf{x}'$ must lie on. The equation simply says "the matching point sits on its line."

The epipole, the baseline, and why a line appears

Picture the two optical centers, $C$ and $C'$ . The straight segment between them is the baseline. Now take any 3D point $P$ that both cameras see. The three points $P$ , $C$ , and $C'$ are not collinear (unless $P$ sits on the baseline itself), so they define a plane: the epipolar plane. That plane slices through each image, and the cut it leaves behind is the epipolar line.

The match for p must lie on its epipolar line l′ — searching collapses from the whole image to one line.

The epipole is where the baseline pierces each image plane. In the left image, the epipole $e$ is literally where you would see the right camera if it were a glowing dot in your scene; in the right image, $e'$ is where you would see the left camera. Every epipolar line in an image passes through that image's epipole, which is exactly the fan-out you saw in the sim. For a stereo rig with the two cameras side by side and perfectly parallel, the epipoles slide off to infinity and the epipolar lines become horizontal scanlines, which is why rectified stereo pairs can be matched row by row.

The fundamental matrix: seven numbers that hold two cameras together

The fundamental matrix $\mathbf{F}$ is the algebraic heart of all this. It is the unique $3 \times 3$ matrix such that $\mathbf{x}'^{\top} \mathbf{F} \mathbf{x} = 0$ for every pair of corresponding points. Two facts about it are worth tattooing somewhere:

It has rank 2, meaning its determinant is zero, $\det(\mathbf{F}) = 0$ . This is not a numerical accident. It is what forces all the epipolar lines to meet at the epipole instead of scattering. The epipole is precisely the null vector of $\mathbf{F}$ : the direction $\mathbf{e}$ for which $\mathbf{F}\mathbf{e} = \mathbf{0}$ .
It has only 7 degrees of freedom. A $3 \times 3$ matrix has nine entries, but $\mathbf{F}$ is defined only up to an overall scale (multiply it by 5 and the equation $\mathbf{x}'^{\top}\mathbf{F}\mathbf{x} = 0$ is unchanged), which removes one, and the rank-2 constraint removes another. Seven numbers. As Wikipedia puts it bluntly, those seven parameters "represent the only geometric information about cameras that can be obtained through point correspondences alone." You cannot squeeze more out of raw matches than that.

A diagram of epipolar geometry showing two camera centers, two image planes, a world point, the epipolar plane, the two epipoles, and the epipolar lines connecting them. — The full cast: two optical centers, the world point P, the epipolar plane they span, and the two epipolar lines it cuts into the image planes. The lines meet the baseline at the epipoles. — ZooFari, Public domain

Computing F from matches: the eight-point algorithm

So how do you actually find these seven numbers from a real pair of photographs? You collect point correspondences and let the constraint do the work. Each matching pair $(\mathbf{x}, \mathbf{x}')$ gives you one linear equation in the entries of $\mathbf{F}$ . Stack up enough of them and you can solve.

The classic recipe is the eight-point algorithm:

Find eight or more corresponding point pairs across the two images.
Turn each pair into one row of a linear system built from $\mathbf{x}'^{\top}\mathbf{F}\mathbf{x} = 0$ .
Solve that system with the singular value decomposition (SVD) to get a least-squares estimate of $\mathbf{F}$ .
The raw estimate will not be exactly rank 2, so project it back onto the nearest rank-2 matrix (zero out the smallest singular value). This step is what guarantees a clean, single epipole.

In the real world, some of your "matches" are wrong, so you wrap the whole thing in RANSAC, which repeatedly fits $\mathbf{F}$ on small random subsets and keeps the fit that the most matches agree with. The mismatches, the outliers, are points that badly violate the epipolar constraint, so epipolar geometry doubles as an outlier detector: anything far off its predicted line is a bad correspondence.

Much of the modern treatment of all this traces back to the British computer scientist Andrew Zisserman (b. 1957), at Oxford and later DeepMind. His textbook with Richard Hartley, Multiple View Geometry in Computer Vision, became the canonical reference for epipolar geometry, the fundamental matrix, and the algorithms in this chapter.

The essential matrix: F's calibrated cousin

If you happen to know the cameras' internal calibration (the focal lengths and principal points packed into the intrinsic matrices $\mathbf{K}$ and $\mathbf{K}'$ ), you can strip the lens out of the picture and work in clean, normalized coordinates. What is left is the essential matrix $\mathbf{E}$ :

$\mathbf{E} = (\mathbf{K}')^{\top} \, \mathbf{F} \, \mathbf{K}$

where $\mathbf{F}$ is the fundamental matrix you just computed and $\mathbf{K}, \mathbf{K}'$ are the two intrinsic calibration matrices. The essential matrix is the purer object: it encodes only the rotation $\mathbf{R}$ and translation $\mathbf{t}$ between the two camera poses, with the lens distortion and pixel scaling already divided out. It factors as

$\mathbf{E} = [\mathbf{t}]_\times \, \mathbf{R}$

Here $\mathbf{R}$ is the $3 \times 3$ rotation from one camera frame to the other, $\mathbf{t}$ is the translation between their centers, and $[\mathbf{t}]_\times$ is the skew-symmetric matrix that turns a cross product into a matrix multiply (so that $[\mathbf{t}]_\times \mathbf{v} = \mathbf{t} \times \mathbf{v}$ for any vector $\mathbf{v}$ ). Because $\mathbf{E}$ knows about $\mathbf{R}$ and $\mathbf{t}$ directly, you can decompose it to recover the camera's motion, which is the seed of structure from motion and visual SLAM.

The two matrices divide the labor neatly:

| | Fundamental $\mathbf{F}$ | Essential $\mathbf{E}$ | |---|---|---| | Calibration needed | No | Yes ( $\mathbf{K}$ , $\mathbf{K}'$ known) | | Encodes | $\mathbf{K}$ , $\mathbf{R}$ , $\mathbf{t}$ combined | Just $\mathbf{R}$ , $\mathbf{t}$ | | Degrees of freedom | 7 | 5 | | Works in | Pixel coordinates | Normalized coordinates |

For the advanced reader → Why x'ᵀFx = 0 falls straight out of the projection equations

Start from the calibrated case. Put the first camera at the origin so its projection is $\mathbf{x} \sim \mathbf{K}[\,\mathbf{I} \mid \mathbf{0}\,]\mathbf{X}$ , and the second at pose $(\mathbf{R}, \mathbf{t})$ so $\mathbf{x}' \sim \mathbf{K}'[\,\mathbf{R} \mid \mathbf{t}\,]\mathbf{X}$ . Work in normalized coordinates $\hat{\mathbf{x}} = \mathbf{K}^{-1}\mathbf{x}$ and $\hat{\mathbf{x}}' = \mathbf{K}'^{-1}\mathbf{x}'$ , so the lens drops away and the back-projected rays are simply $\hat{\mathbf{x}}$ from the first center and $\mathbf{R}\hat{\mathbf{x}}'$ from the second.

The coplanarity condition is the geometric core: the ray to the point from the first camera, the ray from the second, and the baseline $\mathbf{t}$ all lie in one plane (the epipolar plane). Three coplanar vectors have a vanishing scalar triple product:

$\hat{\mathbf{x}}^{\top} \, \big( \mathbf{t} \times (\mathbf{R}\,\hat{\mathbf{x}}') \big) = 0$

Rewrite the cross product as a matrix: $\mathbf{t} \times \mathbf{v} = [\mathbf{t}]_\times \mathbf{v}$ , where

$[\mathbf{t}]_\times = \begin{bmatrix} 0 & -t_3 & t_2 \\ t_3 & 0 & -t_1 \\ -t_2 & t_1 & 0 \end{bmatrix}$

Substituting gives $\hat{\mathbf{x}}^{\top} [\mathbf{t}]_\times \mathbf{R}\, \hat{\mathbf{x}}' = 0$ , and defining the essential matrix as $\mathbf{E} = [\mathbf{t}]_\times \mathbf{R}$ collapses this to

$\hat{\mathbf{x}}^{\top} \mathbf{E}\, \hat{\mathbf{x}}' = 0$

Now undo the normalization with $\hat{\mathbf{x}} = \mathbf{K}^{-1}\mathbf{x}$ :

$\mathbf{x}^{\top} \mathbf{K}^{-\top} \mathbf{E}\, \mathbf{K}'^{-1} \mathbf{x}' = 0 \quad\Longrightarrow\quad \mathbf{F} = \mathbf{K}^{-\top} \mathbf{E}\, \mathbf{K}'^{-1}$

which rearranges to the relation $\mathbf{E} = \mathbf{K}^{\top} \mathbf{F}\, \mathbf{K}'$ . The rank-2 property is now obvious: $[\mathbf{t}]_\times$ is a skew-symmetric $3 \times 3$ matrix and is therefore singular (any odd-sized skew matrix has determinant zero), so $\mathbf{E}$ , and hence $\mathbf{F}$ , inherits rank 2. The epipole is the null vector: $\mathbf{F}\mathbf{e} = \mathbf{0}$ because $[\mathbf{t}]_\times \mathbf{t} = \mathbf{t}\times\mathbf{t} = \mathbf{0}$ , and $\mathbf{t}$ projects to exactly the image of the other camera center.

Key takeaways

The epipolar constraint turns a 2D search into a 1D one. A point in one image must lie on a single line in the other, the epipolar line, and that line is read off directly from $\mathbf{F}\mathbf{x}$ .
The fundamental matrix $\mathbf{F}$ packs two cameras into 7 numbers. It is $3 \times 3$ , rank 2 ( $\det \mathbf{F} = 0$ ), and scale-free, and those seven parameters are everything point matches can ever tell you about the camera pair.
The epipole is the pivot and the null vector. Every epipolar line passes through it, and it satisfies $\mathbf{F}\mathbf{e} = \mathbf{0}$ ; geometrically it is where the baseline pierces the image.
The essential matrix $\mathbf{E} = \mathbf{K}^{\top}\mathbf{F}\mathbf{K}'$ is the calibrated, 5-DOF version that factors as $[\mathbf{t}]_\times \mathbf{R}$ , so it hands you the camera's rotation and translation directly.
Estimate, then enforce, then robustify. The eight-point algorithm solves for $\mathbf{F}$ with SVD; normalization (Hartley, 1997) makes it accurate; rank-2 projection makes it geometrically valid; RANSAC throws out the bad matches.

Two pictures of the same world are never strangers. The rigid geometry of the cameras ties every point in one frame to a single ruled line in the other, and that line is the quiet promise underneath stereo depth, motion estimation, and every robot that finds its way by looking.

Switch eyes one more time and watch your finger jump against the still wall. You are not seeing two images. You are seeing the line that joins them.