Foundations · #00 of 16

What is Computer Vision?

Teaching Machines to See

Artist's rendering of NASA's Curiosity rover on the surface of Mars
NASA's Curiosity rover navigates Mars by its own eyes: light radioed from 225 million kilometres away is still, in the end, just a grid of numbers a machine must learn to read. — NASA on The Commons, No restrictions

In 1966 Seymour Papert at MIT assigned a summer project to a handful of students: connect a camera to a computer and have it describe what it saw. It was meant to take one summer.

That summer never really ended. Catching a thrown ball, recognising your mother's face across a crowded room, reading a stop sign through rain on a windshield: every one of these is, for a human, effortless and instant. For a machine, each is a research career.

A computer does not see a scene. It sees a grid of numbers, and the entire field of computer vision is the art of pulling meaning back out.

Here is the uncomfortable truth at the heart of it all. When a camera captures a stop sign, the computer does not receive "a red octagon." It receives a few million integers between 0 and 255, arranged in a rectangle. Computer vision is the discipline of turning that wall of numbers into descriptions a thinking process can act on: this is a sign, it says STOP, it is twelve metres ahead. The numbers are the easy part. The meaning is everything.

input
corners

Do this: tap Camera (or upload a photo) and point it at something with sharp structure: a window frame, a checkerboard, the corner of a book. Look for the bright dots the right pane sprinkles onto the scene. Now slowly rotate or tilt the view. Notice that the dots cling to the same physical corners even as everything moves.

What you just watched is the first leap from numbers to meaning. The detector never knew "book" or "window." It scanned the grid of pixels for places where brightness changes sharply in two directions at once, the mathematical signature of a corner, and marked them. Those marked points are features: small, repeatable, computable anchors. Almost every higher trick in this course (matching two photos, reconstructing 3D, tracking an object across frames) is built on features like these. We turned a flat sea of intensities into a sparse set of things worth paying attention to.

A picture is a function of numbers

Strip away the romance and an image is a function. A grayscale picture assigns one brightness value to each location:

I(x,y)[0,255]I(x, y) \in [0, 255]

Reading that aloud: II is the image, a rule that takes a horizontal coordinate xx and a vertical coordinate yy and hands back a single number, the brightness at that spot. The square brackets mean the value lives between 00 (pure black) and 255255 (pure white). A colour image is just three such functions stacked, one for red, one for green, one for blue, written I(x,y)=(R,G,B)I(x, y) = (R, G, B).

That single line is why seeing is hard. The function II has no idea what it depicts. There is no "cat" stored anywhere in it, only the brightness of light that happened to land on a sensor. Every algorithm in this book is, deep down, a clever way of asking questions of II: where does it change fastest? where does it repeat? which patches of it move together between two frames?

From pixels to a pipeline

Most vision systems, classical or modern, march through the same broad stages. The names change, the shape rarely does.

  1. Acquisition. A lens focuses light onto a sensor; the sensor quantises it into the grid I(x,y)I(x, y).
  2. Pre-processing. Tame the raw signal: reduce sensor noise, correct exposure, normalise contrast so later steps are not fooled by lighting.
  3. Feature extraction. Find the edges, corners, blobs, and textures that carry information, exactly what the sim above just did.
  4. Analysis. Group features into structure: this cluster is an object, those lines form a lane, that region is skin.
  5. Decision. Emit something the world can use: a label, a bounding box, a steering angle, an alarm.
A stop sign with two overlapping rectangles, a green ground-truth box and a red predicted box, illustrating Intersection over Union
Object detection scored by Intersection over Union: the green box is the human-labelled ground truth, the red box is the model's prediction. How well they overlap becomes a single number you can optimise. — Adrian Rosebrock, CC BY-SA 4.0

Two cultures: computer vision and machine vision

The lab and the factory floor want different things from the same idea. Computer vision is the science: extract general descriptions of the world from images, robust to whatever the world throws at you. Machine vision is the engineering cousin: a controlled camera, controlled lighting, one job done fast and reliably, again and again, on an assembly line.

The Autovision II machine-vision system on display at a 1983 trade show, with a camera on a tripod over a light table and an image on a screen
Machine vision in 1983: an Automatix Autovision II at a trade show. A tripod camera points down at a backlit light table, and the resulting silhouette is processed on-screen by blob extraction, the same operation used in this course to isolate objects from their background. — ArnoldReinhold, CC BY-SA 3.0

Both lean on the same toolbox. The hand-detection demo below is pure machine-vision spirit: a backlit silhouette, a brightness threshold, and a blob detector that reports a single object with its outline and centre. It is the 1983 light table reborn in software.

Four-pane OpenCV demo performing background subtraction and blob detection on a hand silhouette, ending with a contour and bounding box
Background subtraction plus blob detection in OpenCV: a hand pressed against frosted glass is isolated from the lit background, thresholded to a clean white silhouette, then wrapped in a contour and bounding box. One object found. — Unknown, CC BY-SA 4.0

Features: the idea that ties it together

Look again at the writing desk below. On the left, a perfectly ordinary photograph. On the right, the same photo with hundreds of small markers dropped onto it by a corner detector.

A grayscale photo of a writing desk on the left; on the right the same desk with yellow dots marking corner features found by the Harris detector
The Harris corner detector at work: the left pane is the raw grayscale photo, the right pane marks every point where image intensity bends sharply in two directions. Those points are stable enough to re-find in another photo of the same desk. — Dvortygirl, derivative by Kleliboux, CC BY-SA 3.0

A feature is any piece of an image worth singling out: a corner, an edge, a blob, a patch of distinctive texture. The reason features matter so much is repeatability. Photograph the desk from a slightly different angle, and a good detector lands on nearly the same physical corners. That stability is what lets a self-driving car know it is looking at the same lane line two frames later, or lets a phone stitch a panorama from overlapping shots. We will spend whole chapters of this course on better and better feature detectors, but the desk above is the whole idea in one image: find the few points that survive being looked at differently.

why a corner is mathematically special

A corner is where image brightness changes in every direction you push. Slide a small window around a point and measure how much the patch underneath changes; this is captured by the second-moment matrix, built from the image gradients IxI_x and IyI_y:

M=(x,y)Ww(x,y)[Ix2IxIyIxIyIy2]M = \sum_{(x,y) \in W} w(x,y) \begin{bmatrix} I_x^2 & I_x I_y \\[2pt] I_x I_y & I_y^2 \end{bmatrix}

Here IxI_x and IyI_y are how fast brightness changes left-right and up-down, w(x,y)w(x,y) is a weighting window, and the sum runs over a small patch WW. The two eigenvalues λ1,λ2\lambda_1, \lambda_2 of MM tell the whole story: both small means a flat region (nothing happening), one large means an edge (change in one direction only), both large means a corner (change in two directions). Rather than compute eigenvalues directly, the Harris detector uses a cheap proxy:

R=det(M)k(trace(M))2=λ1λ2k(λ1+λ2)2R = \det(M) - k\,\big(\operatorname{trace}(M)\big)^2 = \lambda_1\lambda_2 - k\,(\lambda_1 + \lambda_2)^2

A large positive RR flags a corner, with kk a small constant (typically 0.040.04 to 0.060.06). The bright dots in the sim and the yellow markers on the writing desk are exactly the pixels where RR crosses a threshold. We derive this from scratch in the corner-detection chapter; for now, the takeaway is that "interesting" can be made into arithmetic.

Key takeaways

Papert's students thought a single summer would be enough to teach a machine to look. Sixty years later, a rover on Mars steers itself across a dead red plain, a phone unlocks at a glance, and a car reads a stop sign through the rain, all of them quietly performing the impossible homework.

And underneath every one of them is the same humble move you made a moment ago, pointing a camera at a corner and watching a dot appear. A grid of numbers, and the long, beautiful labour of pulling meaning back out.