Foundations · #02 of 16

The Pinhole Camera Model

How Cameras See

A large inverted image of Prague Castle's palace projected onto a dim attic wall by a single small hole in the tile roof. — A natural camera obscura: a single hole in a tile roof paints a four-metre inverted image of Prague Castle on the attic wall inside. — Gampe, CC BY 3.0

On a bright afternoon, close every curtain in a room until it is dark, then poke a single small hole in the blind. Wait for your eyes to adjust. Slowly, the wall opposite the hole begins to glow with a faint, full-color picture of the street outside: cars, clouds, a passing dog. It is moving. It is in color. And it is completely upside down.

No glass. No electronics. No lens. Just a hole and a wall, and somehow the entire outside world has been painted onto your plaster, inverted and shrunk to fit.

Every camera ever built, from a Roman keyhole to the sensor in your phone, is a refinement of that hole in the blind.

The trick has a name that is almost a thousand years old: the camera obscura, Latin for "dark chamber." The physics is brutally simple. Light travels in straight lines. A tiny hole only lets through the rays that happen to line up with it, so each point on the far wall can be lit by light coming from exactly one direction. Rays from the top of the scene angle down through the hole and land low; rays from the bottom angle up and land high. That is why the image flips. The hole does not bend light; it just selects it.

This is the whole idea behind the pinhole camera model, the foundation that the rest of computer vision is built on. Before we can detect an edge, recognize a face, or reconstruct a room in 3D, we need one honest equation for how a point floating in space ends up at a particular pixel. The pinhole model is that equation, and it is gorgeous in how little it asks of you.

Drag the focal length slider and the 3D point in the simulator below. Watch where the projected dot lands on the image plane. Look for two things: pushing the point farther away (larger depth) makes its image slide toward the center and shrink, and stretching the focal length zooms the whole scene in like a telephoto lens.

Focal Length (f) = 80 Object Distance (Z) = 150

What you just felt with your fingers is perspective projection: the rule that an object's apparent size is its real size divided by its distance. A friend at arm's length and a skyscraper a mile away can both fit behind your thumb. The pinhole does not measure distance; it trades distance away for size. That single trade is responsible for railroad tracks meeting at the horizon, for the moon looking the same size as a coin, and for why a 2D photo can never, on its own, tell you how far away anything truly is.

One hole, one equation

Set up a coordinate frame with the pinhole at the origin, looking down the $Z$ axis into the world. A point in front of the camera is $P = (X, Y, Z)$ . The image forms on a plane a distance $f$ behind the hole. By similar triangles, the point lands at image coordinates $(x, y)$ where

x = f\,\frac{X}{Z}, \qquad y = f\,\frac{Y}{Z}.

Reading every symbol in plain words:

$X, Y$ are how far the point sits sideways and up/down from the optical axis, in real-world units (say, meters).
$Z$ is the depth: how far the point is in front of the camera, measured along the direction the camera looks.
$f$ is the focal length: the distance from the pinhole to the image plane. Bigger $f$ magnifies.
$x, y$ are where the point's image lands on the plane.

The only interesting thing happening here is that division by $Z$ . Double an object's distance and its image halves in every dimension, so it covers a quarter of the area. That is the mathematical signature of perspective, and it falls straight out of light going in straight lines through one point.

A pinhole inverts the world — the object projects through the aperture to an upside-down image at focal length f.

Ray diagram of a pinhole camera: light from a tree passes through a small aperture in a box and forms an inverted image on the back wall. — The textbook pinhole: rays from the top of the tree cross through the hole and land at the bottom of the image plane, so the picture arrives upside down. — DrBob / Pbroks13, Public domain

The man who explained it

The leap from "neat effect" to "law of optics" belongs to one person. Ibn al-Haytham (Latinized as Alhazen), working in Cairo around 1011–1021, wrote the Kitab al-Manazir, the Book of Optics. In it he did something radical: he tested the camera obscura with controlled experiments using multiple candles and a screen, and proved that each point of light travels to the screen along its own straight ray, independent of the others. That is the pinhole model, stated correctly, five hundred years before the telescope.

Al-Haytham also settled an argument that had run since the Greeks. Many believed vision was extramissive, that the eye shot out rays to feel the world like invisible fingers. He showed it was intromissive: light comes from objects into the eye. He even argued that what we call "seeing" is finished in the brain, not the eye, because perception is subjective and shaped by experience. For insisting that a hypothesis must survive a confirmable experiment, he is often called the first true scientist.

Engraved portrait of Ibn al-Haytham (Alhazen) used as a frontispiece in a 17th-century European astronomy book. — Ibn al-Haytham (Alhazen) · c. 965 – c. 1040 Wrote the **Book of Optics** (1011–1021), gave the first correct experimental account of the *camera obscura* and pinhole imaging, proved vision is intromissive, and championed experiment as the test of any hypothesis.

There is a poignant origin story to his optics. According to medieval accounts, al-Haytham had boasted to the Fatimid caliph al-Hakim that he could regulate the flooding of the Nile with a dam. Realizing on site that the engineering was impossible and that failure under a famously volatile ruler could be fatal, he is said to have feigned madness to escape punishment, living for years under house arrest. It was in that confinement, the story goes, that he turned to light, lenses, and the dark chamber, and changed how humanity understands seeing.

An engraving of a portable box-style camera obscura, labeled with its lens and mirror, used by an artist to trace a projected image onto paper. — The camera obscura as a portable drawing aid: a lens and mirror inside the box project the scene downward so an artist can trace the live image and capture perfect perspective. — unknown illustrator, Public domain

Focal length is just zoom

In the equation, $f$ is the lone knob that sets the trade between field of view and magnification. A short focal length puts the image plane close to the hole, so a wide cone of the world squeezes onto the sensor: a broad, slightly exaggerated wide-angle view. A long focal length pushes the plane far back, so only a narrow cone is captured but everything in it is enlarged: a telephoto reach. When you "zoom in" on a phone, you are emulating a longer $f$ .

A pure pinhole has one charming property a lens can never match: it is in focus at every distance simultaneously, because each scene point still maps to a single ray. The catch is light. A tiny hole is faint, so pinhole photographs need long exposures; widen the hole for brightness and the image goes soft, since each point now spreads into a little blur disk. Lenses exist purely to solve that brightness-versus-sharpness conflict, gathering a wide cone of light and bending it back to a point. The geometry, though, stays exactly the pinhole geometry.

A pinhole photograph of a fire hydrant shown as a paper negative on top and the inverted positive below, soft and with corner flaring. — A real pinhole photograph made with a shoebox camera (paper negative on top, digital positive below). No lens was used; the softness and corner flaring are the honest signature of imaging through a bare hole. — Matthew Clemente, CC BY-SA 2.5

The picture is off-center: the principal point

The clean equations assume the optical axis pierces the image plane exactly at coordinate $(0,0)$ . But pixels are counted from a corner of the sensor, not its center. The point where the optical axis actually hits the image, expressed in pixel coordinates, is the principal point $(c_x, c_y)$ , and it usually sits near the middle of the frame. Add it as an offset:

x = f\,\frac{X}{Z} + c_x, \qquad y = f\,\frac{Y}{Z} + c_y.

Here $c_x, c_y$ simply shift the origin from the image center to the sensor's corner so the result is a real pixel address. These four numbers, $f$ , $f$ again (one per axis in practice), $c_x$ , and $c_y$ , are the camera's intrinsics. In the next lesson we will pack them into a single matrix and let matrix multiplication do all of this at once.

For the advanced reader → Why the division by Z forces us into homogeneous coordinates

The pinhole map is not linear. You cannot write $(x, y) = M \cdot (X, Y, Z)$ with a fixed matrix $M$ , because dividing by $Z$ is a nonlinear operation, and $Z$ is itself one of the inputs. That is annoying: linear algebra is the only tool that scales to real pipelines.

The fix is to embed the problem in homogeneous coordinates. Tack on an extra coordinate and treat the result as defined only up to an overall scale, so that $(x, y, w)$ and $(\lambda x, \lambda y, \lambda w)$ denote the same image point for any $\lambda \neq 0$ . Now write the projection as a linear map onto a 3-vector:

\begin{pmatrix} x' \\ y' \\ w' \end{pmatrix} = \begin{pmatrix} f & 0 & c_x \\ 0 & f & c_y \\ 0 & 0 & 1 \end{pmatrix} \begin{pmatrix} X \\ Y \\ Z \end{pmatrix} = \begin{pmatrix} fX + c_x Z \\ fY + c_y Z \\ Z \end{pmatrix}.

The matrix on the left is the intrinsic matrix $K$ . The division by depth has not vanished; it has been deferred. To recover an actual pixel you "dehomogenize" by dividing through by the last coordinate $w' = Z$ :

x = \frac{x'}{w'} = f\,\frac{X}{Z} + c_x, \qquad y = \frac{y'}{w'} = f\,\frac{Y}{Z} + c_y,

which is exactly the offset pinhole equation from above. The payoff is enormous: rotation, translation, and projection all become matrix multiplications that you can chain, and the single nonlinear step (the divide) is isolated at the very end. Every calibration routine, every 3D reconstruction, and the camera matrix of the next lesson rides on this one trick.

Key takeaways

A camera is a hole. The pinhole / camera obscura selects rays that pass through one point, painting an inverted image of the world on a plane behind it.
Perspective is division by depth. $x = fX/Z$ and $y = fY/Z$ mean apparent size shrinks in proportion to distance, which is why a flat photo cannot recover absolute depth on its own.
Focal length is zoom. A short $f$ gives a wide field of view; a long $f$ magnifies a narrow one. A bare pinhole is sharp at all depths but starved for light, which is why real cameras add lenses.
The principal point recenters the math. Adding $(c_x, c_y)$ converts ideal centered coordinates into honest pixel addresses; together with $f$ these are the camera's intrinsics.
Homogeneous coordinates linearize it. Embedding the projection in a 3-vector turns the nonlinear divide-by- $Z$ into one matrix $K$ plus a final dehomogenization, the gateway to the camera matrix.

A thousand years ago a man under house arrest in Cairo lit a row of candles in a dark room and watched their light cross cleanly through a pinhole, each ray keeping to its own straight path. He could not have imagined the sensor in your pocket, yet he wrote down its physics exactly. Every photograph you have ever taken is still that image on the wall of the dark chamber, just measured more carefully. From here, all we do is give the hole an address and the math a matrix.