Applications · #15 of 16

Color Tracking

Following Objects

A grid of video frames showing a Formula 1 car, a snake, a panda, a surfer, and a tiger, each with its silhouette traced in red across successive frames — Object co-segmentation: the same moving subject located frame after frame across very different scenes. — Bikingdog, CC BY-SA 4.0

A child watching a red balloon does not solve a system of equations. The balloon drifts behind a tree, reappears, tumbles, catches the light from a new angle, and the small head turns to follow it the whole time. No template. No memory of the balloon's exact shape. Just one stubborn rule: keep your eyes on the red thing.

That rule is older than computers and almost embarrassingly cheap to run. Strip away texture, edges, gradients, and deep nets, and a surprising amount of "follow that object" survives on a single cue: color.

Before you track anything clever, learn to track the simplest thing reliably: a patch of a color that nothing else in the frame happens to share.

The trick is not "look for red pixels." It is choosing a way to describe color that survives the real world: a passing cloud, a flickering bulb, a shadow sliding across the object. We do that by switching the language we use to talk about color, then drawing a fence around the region of that language our object lives in. The fence is called a color threshold, and everything else in this chapter is just cleaning up after it.

Drag the hue range and the saturation/value thresholds below until the mask isolates one color and nothing else. Watch what happens to the white blob when you widen the hue too far: the background leaks in. Then tap 📷 Camera, point it at a brightly colored object, and walk it into shadow. Notice the mask holding (or not).

colour scene

tracked hue

target hue = 120° hue range = ±20°

min saturation = 0.30 min value = 0.20

min blob size = 25 px clean up mask (morphological open — erode → dilate)

🖼 Upload

What you just felt is the whole game. A loose fence catches the object but also the carpet; a tight fence drops the object the instant the light shifts. Color tracking is the art of placing that fence in a color space where lighting moves the object as little as possible. That space, almost always, is HSV.

The mask itself is a per-pixel test. A pixel survives if all three of its HSV channels fall inside the chosen bounds:

M(x,y) = \begin{cases} 1 & \text{if } H_{\text{lo}} \le H(x,y) \le H_{\text{hi}} \;\wedge\; S_{\text{lo}} \le S \le S_{\text{hi}} \;\wedge\; V_{\text{lo}} \le V \le V_{\text{hi}} \\[2pt] 0 & \text{otherwise} \end{cases}

Here $M(x,y)$ is the binary mask (white where the object is, black elsewhere). $H$ , $S$ , $V$ are the hue, saturation, and value of the pixel at $(x,y)$ . The six bounds with subscripts $\text{lo}$ and $\text{hi}$ are the sliders you just dragged: the lower and upper walls of the fence on each channel. The symbol $\wedge$ means logical AND, so a pixel must clear all three tests at once to count.

Why HSV beats RGB for color

In RGB, a single real-world color is smeared across all three channels at once. A red apple in bright sun and the same apple in shade are both "red" to your eye, yet their RGB triples can be wildly different, because dimming the light scales red, green, and blue together. You cannot draw a tidy box around "red" in RGB without it being either too greedy or too timid.

HSV untangles this. It splits color into three intuitive axes: Hue is the color itself (red, orange, cyan), Saturation is how pure or washed-out it is, and Value is how bright it is. The payoff is that turning the lights down mostly moves a pixel along the Value axis while leaving Hue almost untouched. So you can fence "red" with a narrow band of Hue and let Value roam, and your fence survives shadows.

In OpenCV the ranges are quirky for a practical reason: Hue is stored in $0$ – $179$ (degrees halved, so the full $360°$ color wheel fits in a single byte), while Saturation and Value run $0$ – $255$ .

From mask to motion: the tracking pipeline

A mask is a snapshot. Tracking is what turns a sequence of masks into a moving point. The classic color pipeline is five steps that repeat every frame:

Convert the frame from RGB to HSV.
Threshold with the HSV fence to make a binary mask.
Clean the mask with morphology (next section).
Find contours in the mask, the closed outlines of the white blobs.
Pick the largest contour as the object and report its center.

The object's position is the centroid of that winning blob, computed from image moments:

\bar{x} = \frac{M_{10}}{M_{00}}, \qquad \bar{y} = \frac{M_{01}}{M_{00}}

$M_{00}$ is the zeroth moment, literally the count of white pixels (the area). $M_{10}$ and $M_{01}$ are the first moments, the sums of the $x$ and $y$ coordinates of those pixels. Dividing the coordinate-sum by the pixel-count gives the average position: the blob's balance point. Draw a crosshair there and you have a tracker.

The image above this chapter is a vivid illustration of the harder cousin of this problem. Each row is a single subject (a Formula 1 car, a snake, a panda, a surfer, a tiger) located across six frames, its silhouette traced in red. Color alone could lock onto the panda's white or the car's scarlet, but the snake against foliage and the tiger against tall grass show exactly where pure color cues break down.

Eight rows of video frames, each tracking one subject (race car, snake, panda, lobster-red figure, surfer, tiger) with a red outline across the sequence — Object co-segmentation across video: easy when the subject's color is unique to the scene, hard when foreground and background share a palette. — Bikingdog, CC BY-SA 4.0

Cleaning the mask with morphology

Thresholding is never tidy. A real mask is freckled with stray white specks (sensor noise, glints) and pocked with black holes (a specular highlight on the object that read as "too bright"). Morphological operations scrub this up using a small sliding window called a structuring element.

The two primitives are erosion and dilation. Erosion shrinks white regions, eating away their borders and so deleting tiny specks. Dilation grows white regions, swelling them to fill small pinholes. Chaining them yields the two workhorses:

Opening is erosion then dilation: it removes small noise but restores the object to roughly its original size.
Closing is dilation then erosion: it fills holes inside the object without bloating its outline.

A typical color tracker does an open-then-close: kill the speckle, then heal the holes, and the blob you finally hand to the contour finder is one clean island instead of an archipelago.

Where color tracking breaks

Color is cheap, and cheap has a price. Four failure modes haunt every color tracker:

Two of the same color. A second red object in frame and the "largest blob" rule flips between them, or worse, they merge into one phantom centroid halfway between.
Lighting drift. A cloud passes, the auto-exposure kicks in, and the hue you fenced is now just outside the wall. The mask blinks out.
Occlusion. The object slips behind something. The blob vanishes, the area drops to zero, and the centroid is undefined until it reappears.
A matching background. Point a green tracker at a lawn and everything is "the object."

The professional answer is not to abandon color but to fuse it with other cues, motion, shape, a learned appearance model, so that no single failure sinks the track.

For the advanced reader → The full moment math behind the centroid and the object's tilt

The raw image moment of order $(p, q)$ over the binary mask $M$ is a weighted sum over every pixel:

M_{pq} = \sum_{x}\sum_{y} x^{p}\, y^{q}\, M(x,y)

Setting $(p,q) = (0,0)$ gives $M_{00}$ , the total white area. The centroid follows from the first moments as shown earlier, $(\bar{x}, \bar{y}) = (M_{10}/M_{00},\; M_{01}/M_{00})$ .

To recover the object's orientation, shift to central moments, which measure spread about the centroid and so are translation invariant:

\mu_{pq} = \sum_{x}\sum_{y} (x - \bar{x})^{p}\, (y - \bar{y})^{q}\, M(x,y)

The blob's tilt angle is then read off the second-order central moments:

\theta = \frac{1}{2}\,\arctan\!\left( \frac{2\,\mu_{11}}{\mu_{20} - \mu_{02}} \right)

where $\mu_{20}$ and $\mu_{02}$ are the variances along $x$ and $y$ , and $\mu_{11}$ is the covariance. With $(\bar{x}, \bar{y})$ , the area $M_{00}$ , and $\theta$ , you have not just where the object is but how big it appears (a crude depth cue, since area shrinks with distance) and which way it is turned, all from summing over the very mask the sliders produced.

Key takeaways

Switch color spaces first. RGB tangles lighting into every channel; HSV isolates the color itself in Hue so your fence survives shadows and bright spots.
The mask is a per-pixel AND of three range tests on Hue, Saturation, and Value. The sliders are literally the six walls of that fence.
Red wraps the hue seam, so it needs two ranges (near $0$ and near $179$ ) OR-ed together. This is the single most common color-tracking bug.
Morphology cleans the mask: opening kills speckle, closing fills holes, and the result is one clean blob whose centroid $(M_{10}/M_{00},\, M_{01}/M_{00})$ is the tracked point.
Color alone is brittle against duplicate colors, lighting drift, occlusion, and matching backgrounds. Real systems fuse it with motion, shape, or histogram methods like mean shift.

The child never stops watching the red balloon, and neither does the machine, as long as nothing else in the room is red. We started with a single stubborn rule, keep your eyes on the colored thing, and found it hiding inside three comparisons per pixel, a fence in a kinder color space, and a little arithmetic on where the white blob balances.

It is the humblest tracker there is, and that is exactly why it matters: every grand vision system, somewhere near the bottom, is still just following a color it decided to care about.