Feature Detection · #07 of 16

Corner Detection

Finding Interest Points

A camera moves through a scene while a tracker follows bright dots on the corners of objects, estimating the camera's own motion frame to frame.
Egomotion: corner points tracked across frames let a robot estimate how its own camera moved through the world. — Benohamid, CC BY-SA 3.0

Slide a tiny window across a blank wall and nothing changes. Slide it along the long edge of a doorframe and almost nothing changes: you are drifting parallel to the line, blind to your own motion. But place that window over the spot where the door's top rail meets its side rail, nudge it any direction at all, and the view lurches.

That lurch is information. It is the difference between a feature a machine can lock onto and a smear it will lose in the next frame. A stitched panorama, a self-driving car estimating how far it rolled, a phone re-projecting a sticker onto your moving face: all of them are quietly hunting for these unmistakable spots.

A corner is the rare place in an image that looks different no matter which way you peek, and that uniqueness is what makes it trackable.

The whole field calls these spots interest points or feature points, and corner detection is the craft of finding them automatically. The intuition is almost embarrassingly simple: a good feature is one you could not have found anywhere else. A patch of blue sky could be any patch of blue sky. A stretch of brick wall slides into the next stretch. But the precise junction where two lines cross is a fingerprint. Track that fingerprint across two photos and you have tied them together.

Drag the corner threshold in the sandbox below and flip the output between the Harris response heatmap and the detected Corners, then tap 📷 Camera and aim it at a checkerboard or a window frame. Look for where the heatmap glows brightest: not along the clean edges, but at the intersections where edges collide.

input
corners

What the heatmap reveals is that "edge" and "corner" are not the same animal. Edges light up faintly and stretch out in lines; corners flare into bright isolated dots. The detector is scoring every pixel by a single question: if I shifted my little window away from here, in every direction, how violently would the picture underneath change? Flat regions score near zero. Edges score high in one direction and low in the other. Corners score high all around. That triage, flat versus edge versus corner, is the engine of everything that follows.

The aperture problem

Look at the world through a soda straw and try to judge which way a tilted line is moving. You cannot. You only ever see the line cross your tiny circle of view, so motion along the line is invisible. This is the aperture problem, and it is the reason edges make terrible features. Through a small window, an edge can slide endlessly without announcing it.

We can make this precise. Take a small image window WW and ask how much its contents change if we shift it by (u,v)(u, v). The sum of squared differences is:

E(u,v)=(x,y)W[I(x+u,y+v)I(x,y)]2E(u, v) = \sum_{(x,y) \in W} \bigl[\, I(x+u,\, y+v) - I(x, y) \,\bigr]^2

Here I(x,y)I(x,y) is the image brightness at pixel (x,y)(x,y), the pair (u,v)(u,v) is the small shift we test, and WW is the window of pixels we sum over. EE measures the total change the shift causes. A flat patch gives E0E \approx 0 for every shift. An edge gives E0E \approx 0 for shifts along it. Only a corner gives a large EE in all directions, and that is exactly the "lurch" you felt in the Hook.

The Harris corner detector

In 1988 Chris Harris and Mike Stephens, working on motion analysis in Plessey's research labs in Britain, took Moravec's idea and made it smooth. Instead of physically shifting the window in a handful of clumsy directions, they used calculus. A first-order Taylor expansion of E(u,v)E(u,v) turns that messy sum into clean linear algebra, and out drops a small matrix that encodes the local image structure in every direction at once:

M=W[Ix2IxIyIxIyIy2]M = \sum_{W} \begin{bmatrix} I_x^2 & I_x I_y \\ I_x I_y & I_y^2 \end{bmatrix}

where IxI_x and IyI_y are the image gradients (how fast brightness changes left-to-right and top-to-bottom, usually computed with a Sobel filter), and the sum runs over the window WW. This 2×22 \times 2 matrix MM is called the structure tensor. Its two eigenvalues, λ1\lambda_1 and λ2\lambda_2, tell you how strongly the image varies along the two principal directions of the window. Two small eigenvalues mean a flat region; one large and one small mean an edge; two large eigenvalues mean a corner.

Computing eigenvalues at every pixel was expensive in 1988, so Harris and Stephens used a clever shortcut. They scored each pixel with:

R=det(M)ktrace(M)2R = \det(M) - k \cdot \operatorname{trace}(M)^2

The trick is that det(M)=λ1λ2\det(M) = \lambda_1 \lambda_2 and trace(M)=λ1+λ2\operatorname{trace}(M) = \lambda_1 + \lambda_2, so RR can be read straight off the matrix without ever solving for the eigenvalues. The constant kk is an empirical fudge factor, usually between 0.040.04 and 0.060.06. A large positive RR means a corner, a large negative RR means an edge, and RR near zero means a flat region. That is precisely the heatmap you scrubbed through above.

A grayscale test image with small colored markers placed at the sharp interior and exterior corners of the depicted shapes, showing which points a corner detector flags.
A classic corner-detection test image: markers fall on the junctions where edges meet, not along the straight edges themselves. — Retardo, Public domain

Shi-Tomasi and the honest minimum

Six years later, in 1994, Jianbo Shi and Carlo Tomasi asked a sharper question: not "what is a corner?" but "what is a feature good to track?" Their answer dropped Harris's clever determinant shortcut and went back to the eigenvalues themselves. They scored each pixel by:

R=min(λ1,λ2)R = \min(\lambda_1, \lambda_2)

The reasoning is plainly stated once you see it. A point is only trackable if it changes strongly in every direction, and the weakest direction is the one set by the smaller eigenvalue. So score the point by its weakest link. If even the minimum eigenvalue is large, the point is solid in all directions. This Shi-Tomasi measure proved more stable in practice and became the default behind OpenCV's goodFeaturesToTrack().

A photograph processed by the Foerstner corner operator, with detected interest points marked across the structure of the scene.
The same hunt, a different scorer: corners flagged by the Foerstner operator, one of several refinements that followed Harris. — Nicoht, CC BY-SA 3.0

There is one more idea you need before the corners are usable. A strong corner does not light up a single pixel; it lights up a small bright blob of high response. Left alone, the detector would report the same corner a dozen times over. Non-maximum suppression fixes this: scan for local maxima of RR, and within each neighborhood keep only the single strongest response, discarding its dimmer neighbors. The result is the clean scatter of well-separated points you want to feed into a tracker.

A diagram of the FAST corner detector showing a candidate pixel at the center and the ring of 16 surrounding pixels it compares against to decide whether the center is a corner.
The FAST test: a candidate pixel is a corner if a long enough arc of the 16-pixel ring around it is uniformly brighter or darker than the center. — Jingjin Huang, Guoqing Zhou, Xiang Zhou and Rongting Zhang, CC BY 4.0
For the advanced reader → Where the structure tensor M actually comes from

Start from the change function and Taylor-expand the shifted image to first order:

I(x+u,y+v)I(x,y)+Ixu+IyvI(x+u,\, y+v) \approx I(x,y) + I_x\, u + I_y\, v

Substitute that into E(u,v)E(u,v) and the I(x,y)I(x,y) terms cancel, leaving a sum of squared linear terms:

E(u,v)W(Ixu+Iyv)2E(u, v) \approx \sum_W \bigl( I_x\, u + I_y\, v \bigr)^2

Expand the square and group the uu and vv coefficients, and the whole expression collapses into a quadratic form:

E(u,v)[uv]M[uv],M=W[Ix2IxIyIxIyIy2]E(u, v) \approx \begin{bmatrix} u & v \end{bmatrix} \, M \, \begin{bmatrix} u \\ v \end{bmatrix}, \qquad M = \sum_W \begin{bmatrix} I_x^2 & I_x I_y \\ I_x I_y & I_y^2 \end{bmatrix}

This is the punchline of the Harris derivation. The change in any direction is governed entirely by the structure tensor MM. The eigenvectors of MM point along the directions of greatest and least change, and the eigenvalues λ1,λ2\lambda_1, \lambda_2 give their magnitudes. The surface EE is a bowl whose steepness is set by those eigenvalues:

  • Both small: a shallow, near-flat bowl. Flat region.
  • One large, one small: a long trough. Edge (free to slide along the trough).
  • Both large: a steep, isotropic bowl. Corner.

Harris's R=λ1λ2k(λ1+λ2)2R = \lambda_1 \lambda_2 - k(\lambda_1 + \lambda_2)^2 and Shi-Tomasi's R=min(λ1,λ2)R = \min(\lambda_1, \lambda_2) are just two different ways of squeezing this two-eigenvalue situation into a single trackability score.

Key takeaways

Every panorama your phone stitches, every AR sticker that clings to your moving face, every robot that knows how far it just rolled, all of it begins with the same humble act: a tiny window, slid across the image, asking did anything change? The wall says no. The doorframe's edge says barely. But the corner, that stubborn junction where two lines insist on crossing, says yes, in every direction at once, and in that small, unambiguous yes, a machine finds something it can hold onto.