Geometry & Transforms · #11 of 16

Homography

Plane-to-Plane Mapping

A 19th-century engraved perspective construction from Hermann Hauck's treatise on perspective and photogrammetry, showing a planar figure projected through a center toward an image plane.
Figure 1a from Hermann Hauck's Neue Constructionen der Perspective und Photogrammetrie. The geometry of a plane viewed through a point is centuries older than the camera. — Hermann Guido Hauck, zoom by Michelbailly, CC0

Tilt a photograph of a chessboard and the squares stop being square. The far rank shrinks, the lines that should run parallel lean toward each other, and somewhere off the edge of the frame they would meet at a single vanishing point. Your eye is not fooled. You still read it as a flat, regular grid seen at an angle.

How does the brain undo that distortion so effortlessly? It is quietly solving a piece of geometry that took mathematicians three hundred years to formalize. Every flat surface, photographed from any viewpoint, is tied to its true shape by a single fixed rule.

That rule is a homography: one 3×3 matrix that maps any plane to any other view of it.

When two photographs show the same flat thing, a wall, a painting, a sheet of paper, a soccer pitch, there is exactly one transformation that turns one view into the other. It is not a stretch or a rotation alone. It is a projective transformation, the kind that lets the far edge of a table shrink while the near edge stays large. Find that transformation and you can flatten a skewed photo of a document into a crisp scan, stitch two overlapping snapshots into one panorama, or paste a virtual poster onto a real wall so convincingly that it leans with the wall as you move.

See the warp before you name it

source · drag corners or the middle
rectified view
Homography Matrix H (3×3)
1.00 0.00 0.00
0.00 1.00 0.00
0.00 0.00 1.00

Drag the four corners of the quadrilateral in the demo above, or grab the middle to slide the whole quad, and watch two things move together: the rectified preview on one side, and the nine numbers of the H matrix on the other. What to look for is the moment the rectangle becomes a trapezoid. The top edge and bottom edge are still straight lines, but they are no longer parallel, and the H matrix grows non-zero entries in its bottom row. Then tap the camera button and point at any flat surface in the room to warp your own scene live.

That is the whole idea in one gesture. Four points on the source, four points on the target, and the machine solves for the single matrix that carries every other point along with them. Straight lines stay straight. Parallel lines do not. The bottom row of H is exactly where the perspective lives.

The governing equation

A homography acts on points written in homogeneous coordinates: a 2D point (x,y)(x, y) becomes the triple (x,y,1)(x, y, 1), with an extra dummy coordinate that lets perspective become plain matrix multiplication.

[xyw]=[h11h12h13h21h22h23h31h32h33][xy1],(ximg,yimg)=(xw,yw)\begin{bmatrix} x' \\ y' \\ w' \end{bmatrix} = \begin{bmatrix} h_{11} & h_{12} & h_{13} \\ h_{21} & h_{22} & h_{23} \\ h_{31} & h_{32} & h_{33} \end{bmatrix} \begin{bmatrix} x \\ y \\ 1 \end{bmatrix}, \qquad (x_{\text{img}},\, y_{\text{img}}) = \left( \tfrac{x'}{w'},\, \tfrac{y'}{w'} \right)

Reading the symbols in plain words: (x,y)(x, y) is your starting pixel and (x,y,w)(x', y', w') is where it lands before normalizing. The matrix entries h11h_{11} through h22h_{22} handle the familiar rotation, scale, and shear. The right column h13,h23h_{13}, h_{23} is translation, the slide. The crucial part is the bottom row h31,h32h_{31}, h_{32}: those two numbers create perspective by making ww' depend on where the point is. The final division by ww' is the step that makes far things small. Set the bottom row to (0,0)(0, 0) and ww' stays at 1, perspective vanishes, and the homography collapses into an ordinary affine map.

Counting the freedom: why four points

The matrix has nine entries, but it has only eight degrees of freedom. Scaling all nine numbers by the same factor produces the same map, because the final division by ww' cancels any common multiple. So one degree is pure illusion, leaving eight real knobs to set: two for translation, two for scale, one for rotation, one for shear, and two for the perspective foreshortening in the bottom row.

Each point correspondence, one pixel in the source matched to its partner in the target, pins down two of those eight numbers, one equation for xx' and one for yy'. Eight unknowns, two equations apiece, means four correspondences are exactly enough to solve for H. That is why the demo gives you precisely four corners to drag. No more, no fewer, are needed to lock a plane.

Diagram of two cameras photographing the same flat surface from different positions, with the planar scene mapped into each image plane and the two views related by a homography.
Two images of one planar surface are always related by a single homography (under the pinhole camera model). Find H and you can hop between the views. — Per Rosengren, derivative by Appoose, CC BY 3.0

This is the deep claim from computer vision: under a pinhole camera, any two images of the same flat surface are related by a homography. There is a second, surprising case where it holds even for a non-flat scene: if the camera only rotates about its own center and never moves sideways, the whole world projects through H, depth and all. That is the secret behind panorama apps. Pan your phone in place and every frame is one homography away from its neighbor. Take a step sideways and the spell breaks, because translation reveals depth and a single matrix can no longer cover it.

Solving for H: the Direct Linear Transform

You have four matched points and you want the matrix. The standard recipe is the Direct Linear Transform (DLT), an algorithm that solves a set of relations xkAyk\mathbf{x}_k \propto \mathbf{A}\,\mathbf{y}_k, where \propto means "equal up to an unknown scale." The proportionality is the catch: because the scale ww' is unknown for every point, you cannot just write a tidy linear system and invert it. The DLT clears the scale algebraically.

For each correspondence (x,y)(x,y)(x, y) \leftrightarrow (x', y'), the proportional relation expands into two linear equations in the nine entries of H. Stack the equations from all your points into one tall matrix A\mathbf{A}, and you are left with the homogeneous problem Ah=0\mathbf{A}\,\mathbf{h} = \mathbf{0}, where h\mathbf{h} is the nine entries unrolled into a vector. The solution is the right singular vector of A\mathbf{A} with the smallest singular value, read straight off a singular value decomposition (SVD). In OpenCV the entire procedure is one call, findHomography(srcPoints, dstPoints).

For the advanced reader → The full DLT derivation, from proportionality to SVD

Start from the homogeneous mapping for a single point, written with the rows of H as h1,h2,h3\mathbf{h}_1^\top, \mathbf{h}_2^\top, \mathbf{h}_3^\top and the source point as p=(x,y,1)\mathbf{p} = (x, y, 1)^\top:

p=[xyw][h1ph2ph3p]\mathbf{p}' = \begin{bmatrix} x' \\ y' \\ w' \end{bmatrix} \propto \begin{bmatrix} \mathbf{h}_1^\top \mathbf{p} \\ \mathbf{h}_2^\top \mathbf{p} \\ \mathbf{h}_3^\top \mathbf{p} \end{bmatrix}

The proportionality means p\mathbf{p}' and HpH\mathbf{p} are parallel vectors, so their cross product is zero: p×(Hp)=0\mathbf{p}' \times (H\mathbf{p}) = \mathbf{0}. Writing out the cross product and using the measured image coordinates (x,y)(x', y') (so w=1w' = 1) gives two independent rows per correspondence:

[xy1000xxyxx000xy1xyyyy][h11h12h33]=0\begin{bmatrix} -x & -y & -1 & 0 & 0 & 0 & x x' & y x' & x' \\ 0 & 0 & 0 & -x & -y & -1 & x y' & y y' & y' \end{bmatrix} \begin{bmatrix} h_{11} \\ h_{12} \\ \vdots \\ h_{33} \end{bmatrix} = \mathbf{0}

Four points produce eight rows, enough to fix the eight degrees of freedom; with more points, A\mathbf{A} becomes overdetermined. In either case the answer is the same: form AA\mathbf{A}^\top \mathbf{A}, take its SVD A=UΣV\mathbf{A} = U \Sigma V^\top, and read h\mathbf{h} off the column of VV corresponding to the smallest singular value. That vector minimizes Ah\lVert \mathbf{A}\mathbf{h} \rVert subject to h=1\lVert \mathbf{h} \rVert = 1, which is exactly the scale-free constraint we needed. Reshape the nine numbers back into a 3×3 matrix and normalize so h33=1h_{33} = 1. In practice you first normalize the input coordinates (center them and scale so the average distance to the origin is 2\sqrt{2}); skipping that step makes the SVD numerically fragile, a subtlety Hartley made famous.

When the matches lie: RANSAC

In the lab you hand-pick four perfect corners. In the wild, correspondences come from an automatic feature matcher, and a stubborn fraction of them are simply wrong: a window matched to a different window, a brick to the wrong brick. These outliers poison any least-squares fit, because a single bad pair can drag the whole matrix off true.

The cure is Random Sample Consensus (RANSAC), published by Martin Fischler and Robert Bolles at SRI International in 1981 to solve the location determination problem: figure out where a camera stood given landmarks of known position. Their insight was almost rude in its simplicity. Do not trust all the data. Pick a tiny random subset, just four matches, fit an H from them, then count how many of the remaining points that H explains well. Those agreeing points are the inliers. Repeat the gamble many times and keep the H with the largest inlier set. It is non-deterministic, returning a good answer only with some probability, but that probability climbs toward certainty as you allow more iterations.

A scatter plot of points with a clear linear trend plus scattered outliers, with a line fit by RANSAC passing through the dense band of inliers and ignoring the stray points.
RANSAC fitting a line to data laced with outliers. The fit follows the dense band of inliers and simply ignores the strays. The same logic estimates a homography from noisy feature matches. — AlexanderLHayes, CC BY-SA 4.0

Homography versus affine

It helps to place the homography against its simpler cousin, the affine transformation. Affine is what you get when you forbid perspective: a 2×3 matrix, six degrees of freedom, needing only three point correspondences. Crucially, affine maps preserve parallelism, parallel lines stay parallel, which makes them right for gentle rotations, scaling, and shear but wrong for strong perspective.

| Property | Affine | Homography | |---|---|---| | Matrix size | 2×3 (6 DOF) | 3×3 (8 DOF) | | Parallel lines | Preserved | Not preserved | | Best for | Small rotations, modest skew | Strong perspective | | Points needed | 3 | 4 |

The homography is the strictly more powerful map. The moment your scene shows real foreshortening, a road narrowing to the horizon, a document photographed from a low angle, you need those two extra degrees of freedom in the bottom row.

A projective-geometry diagram: a pencil of straight lines radiating from a single point, crossed by two transversal lines, illustrating how a projection maps points along the rays.
Projection in the plane: rays through a center cut transversals at corresponding points. This single picture is the seed of every perspective transformation. — Krishnavedala, CC0

These applications all reduce to "find H, then warp." Document scanners detect paper corners and rectify them to a clean rectangle. Panorama stitchers match overlapping frames and align them. Augmented-reality apps track a flat marker and overlay virtual objects in correct perspective. Sports broadcasts map the camera view onto an overhead pitch diagram to draw offside lines that hug the grass. Different goals, one matrix.

Key takeaways

Three centuries ago, draftsmen learned to make a flat drawing look like a room receding into depth, and called the rule that governed it perspective. Mathematicians later gave it points at infinity and named it a projective transformation. Engineers wrapped it in nine numbers and called it H. It is all the same gesture: the chessboard that tilts but never lies, the panorama that swings true so long as you pivot in place, the document that springs back into a clean rectangle. Drag four corners and you hold, in the palm of one matrix, the oldest trick in the art of seeing.