Foundations · #03 of 16

Camera Matrix & Projection

The Math of Seeing

Animation of a simple house wireframe seen from a camera, with the focal length sliding longer and shorter, stretching and compressing the projected image — One scene, one model, one knob. Sliding the focal length changes the whole feel of the picture without moving a single point in the world. — SharkD, CC BY-SA 4.0

A camera is a machine that destroys a dimension. The world has three: width, height, depth. The photograph has two. Somewhere between the lens and the sensor, depth is thrown away, and a glowing point of light gets pinned to a single pixel.

Yet we still recover so much. We read a face, judge a distance, reach for the right cup. The astonishing thing is not that information is lost. It is that the loss is lawful: a clean, repeatable rule maps every point in the room to exactly one cell on the chip.

That rule is a matrix, and learning to read it is learning to read the camera's mind.

Every camera, from the one on your phone to the rover squinting at Martian rock, carries a hidden personality: how long its focal length is, where the center of its image actually sits, whether its pixels are perfectly square. We bundle that personality into a small grid of numbers called the camera matrix. Feed it a 3D point and it tells you, with no guessing, which pixel that point lands on.

Intrinsic Matrix K

150 0 160

0 150 100

0 0 1

x = fₓ · (X/Z) + cₓ
y = f_y · (Y/Z) + c_y

2D pixel projection

fₓ (focal length X) = 150 f_y (focal length Y) = 150

cₓ (principal point X) = 160 c_y (principal point Y) = 100

🖼 Upload

Drag the fx, fy focal lengths and the cx, cy principal point in the simulator and watch where the 3D cube lands on the image plane. Then tap 📷 Camera and deliberately mis-set the calibration: look for the moment the wireframe slides right off the scene, the way a poorly fit pair of glasses pushes the whole world sideways. What you are feeling is the difference between a camera that knows itself and one that does not.

Notice what changes and what does not. Moving cx, cy slides the entire projected cube as a rigid block, because you are just relabeling where "the middle" is. Changing fx, fy instead scales the cube about that center, zooming without anything in the world moving. Those are the two distinct jobs the camera matrix does, and seeing them split apart on screen is the whole lesson in miniature.

From geometry to a single equation

In the previous chapter the pinhole gave us perspective by similar triangles: a point at depth $Z$ projects with $x = f\,X/Z$ . That divide-by- $Z$ is what shrinks distant things. The trouble is that division is awkward to chain through the long pipeline of a real imaging system, where the world is rotated, translated, projected, then converted into pixel rows and columns.

The fix is a trick of bookkeeping called homogeneous coordinates. We pad each point with one extra coordinate: a 2D point $(x, y)$ becomes $[x, y, 1]^\mathsf{T}$ and a 3D point $(X, Y, Z)$ becomes $[X, Y, Z, 1]^\mathsf{T}$ . The payoff is that perspective projection, which is fundamentally nonlinear because of that division, turns into plain matrix multiplication. The division is deferred to the very end, where we "dehomogenize" by dividing through by the last coordinate.

With that in hand, the mapping a camera performs collapses to one line:

\mathbf{y} \;\sim\; \mathbf{C}\,\mathbf{x}

Here $\mathbf{x}$ is the world point in homogeneous coordinates (a 4-vector), $\mathbf{C}$ is the $3 \times 4$ camera matrix, and $\mathbf{y}$ is the resulting image point (a 3-vector). The symbol $\sim$ means "equal up to scale": both sides describe the same pixel even if one is multiplied by some nonzero number $k$ , because $[x, y, w]^\mathsf{T}$ and $[kx, ky, kw]^\mathsf{T}$ point at the identical location once you divide by the last coordinate. That little squiggle is doing real work; it is why a camera matrix has only 11 true degrees of freedom rather than 12, since scaling the whole matrix changes nothing you can see.

The intrinsic matrix K: the camera's inner life

Split the camera matrix into two factors and its personality and its pose come cleanly apart. The first factor, the intrinsic matrix $K$ , captures everything internal to the camera: the bits that stay fixed no matter where you carry it.

K = \begin{bmatrix} f_x & 0 & c_x \\ 0 & f_y & c_y \\ 0 & 0 & 1 \end{bmatrix}

Reading it symbol by symbol: $f_x$ and $f_y$ are the focal length measured in pixels along the horizontal and vertical directions. They can differ if the sensor's pixels are not perfectly square, which is why there are two. $c_x$ and $c_y$ are the principal point: the pixel where the optical axis pierces the sensor, almost always near the image center but rarely exactly at it. The lone $1$ in the corner is the scaffolding of homogeneous coordinates, the placeholder that makes the multiplication close.

A camera-coordinate point $[X, Y, Z]^\mathsf{T}$ projects through $K$ to

\begin{bmatrix} x \\ y \\ w \end{bmatrix} = K \begin{bmatrix} X \\ Y \\ Z \end{bmatrix} = \begin{bmatrix} f_x X + c_x Z \\ f_y Y + c_y Z \\ Z \end{bmatrix},

and then the final pixel is $(x/w,\; y/w) = (f_x X/Z + c_x,\; f_y Y/Z + c_y)$ . The divide-by- $Z$ you saw in the pinhole chapter is still there; it just waited politely until the last step. This is exactly the slider behavior in the simulator: $f_x, f_y$ scale, $c_x, c_y$ shift.

Geometric diagram of the pinhole camera model showing the optical axis, the principal point where it meets the image plane, and the focal length f as the distance from the camera center to that plane — The pinhole geometry that K encodes: the optical axis meets the image plane at the principal point, a focal length f away from the camera center. — en:User:KYN, CC0

Extrinsics [R | t]: where the camera stands

The world does not arrive in camera coordinates for free. Before $K$ can do its job, the scene has to be expressed from the camera's point of view, and that depends entirely on where the camera is sitting and which way it is looking. Those facts are the extrinsic parameters: a $3 \times 3$ rotation matrix $R$ for orientation and a $3 \times 1$ translation vector $t$ for position. Stacked side by side as $[R \mid t]$ , they form the rigid transform that carries a world point into the camera's own frame.

These are the parameters that change the instant the camera moves. Walk across the room and $[R \mid t]$ updates; the intrinsics $K$ do not care. That clean separation, internals that are calibrated once versus a pose that is re-estimated every frame, is the conceptual spine of structure from motion, visual odometry, and SLAM.

Diagram showing a single cube above a ground plane rendered under several different projection types, each producing a different 2D image of the same 3D object — One cube, many projections. The same 3D object yields wildly different images depending on the viewpoint and projection rule. Extrinsics choose the viewpoint; intrinsics choose the lens. — SharkD, CC BY-SA 4.0

Stir intrinsics and extrinsics together and you get the full mapping from a world point $P_w$ to a pixel $p$ :

p \;\sim\; \underbrace{K}_{\text{internals}} \, \underbrace{[\,R \mid t\,]}_{\text{pose}} \, P_w

The product $\mathbf{C} = K[R \mid t]$ is the complete $3 \times 4$ projection matrix: the single object that takes any point in the world and tells you its pixel, no further questions. Everything we have built in three chapters lives inside those twelve numbers.

Calibration: teaching a camera its own matrix

Where do these numbers come from? You measure them. Camera calibration is the act of estimating $K$ (and lens distortion) by showing the camera something whose geometry you already know, classically a printed checkerboard. The recipe is short:

Photograph a known pattern from several angles.
Detect the pattern's corners in each image.
Solve for the $K$ , $R$ , and $t$ that make the projected corners match the detected ones as closely as possible.

In OpenCV this is one call, calibrateCamera(), which returns the intrinsics, the distortion coefficients, and a pose for each photo. The checkerboard works because its corners are unambiguous and lie on a perfect grid, giving the solver clean, plentiful constraints. Skip calibration and every downstream measurement, every distance and every reconstructed point, inherits a silent bias.

Chart comparing the major families of graphical projection, from perspective to orthographic and axonometric, showing how each renders a 3D object onto a 2D plane — The family tree of graphical projection. A real camera lives in the perspective branch, where parallel lines converge and distant objects shrink. — SharkD, CC BY-SA 3.0

For the advanced reader → why the matrix has exactly 11 degrees of freedom

The full projection matrix $\mathbf{C}$ is $3 \times 4$ , so it has 12 entries. But recall the relation only holds up to scale:

\mathbf{y} = k\,\mathbf{C}\,\mathbf{x}, \qquad k \neq 0.

Multiplying the entire matrix by any nonzero constant produces an equivalent camera, because the final dehomogenization divides that constant right back out. So one global scale is unobservable, and $12 - 1 = 11$ free parameters remain.

You can also count them by source. The intrinsics contribute $f_x, f_y, c_x, c_y$ , and a fifth, a skew term $s$ in the top-right of $K$ that we set to zero for modern square-pixel sensors, so 4 to 5 there. The extrinsics contribute 3 for rotation (a rotation matrix lives on a 3-dimensional manifold despite having 9 entries) and 3 for translation. That is $5 + 6 = 11$ . The two counts agree, which is the kind of quiet consistency that tells you the model is right.

This is why a minimal calibration needs at least 6 point correspondences: each gives 2 equations (a pixel $x$ and $y$ ), and $11$ unknowns need $\lceil 11/2 \rceil = 6$ points. In practice you use dozens, and least-squares averages out the noise.

Key takeaways

A camera is a lawful projection, not a magic eye. The map from 3D world to 2D pixel is a single $3 \times 4$ matrix $\mathbf{C}$ , and once you know it, every projection is deterministic.
Homogeneous coordinates turn perspective into multiplication. Padding points with a $1$ lets the awkward divide-by- $Z$ be deferred to a final dehomogenization step, so the whole pipeline becomes linear algebra.
K is who the camera is; [R | t] is where it stands. Intrinsics ( $f_x, f_y, c_x, c_y$ ) are calibrated once and stay put; extrinsics (rotation and translation) update every time the camera moves.
The matrix has 11 degrees of freedom, not 12, because it is defined only up to scale: the $\sim$ in $\mathbf{y} \sim \mathbf{C}\mathbf{x}$ is doing load-bearing work.
Calibration is how a camera learns its own K, typically by photographing a checkerboard and solving for the matrix that makes projections match, the foundation under stereo depth, AR, and SLAM.

A camera throws away a dimension, and we spend the rest of computer vision earning it back. The camera matrix is the receipt for that loss: twelve numbers, eleven of them real, that turn the violent flattening of the world into something we can invert, predict, and trust.

Get those numbers right and the wireframe snaps onto the scene like it belongs there. Get them wrong and it slides off the edge of the world. Either way, the lesson is the same one the surveyors knew, sighting steeples through a brass scope: to find where you are, first know exactly how you see.