Image captioning

We want computers to look at a picture and tell us what’s in it. That’s image captioning. She takes an image, turns it into text, and gives us a sentence. Think of it as auto-alt text, except smarter and less embarrassing.

Encoder decoder models

The usual setup is called an encoder–decoder. The encoder looks at the image, chops it into numbers, and builds a kind of summary. The decoder takes that summary and spits out words, one at a time. We can imagine her whispering to herself: “cat… sitting… couch…” until a caption comes out.

Do this so users can skip guessing what a blurry thumbnail means.

Why attention helps

Plain encoder–decoder is fine, but it acts like she’s squinting at the whole image at once. Attention changes that. It lets her focus on one region at a time while choosing the next word. She can stare at the couch when saying “couch” and swing over to the cat for “cat.” The captions sound less like bad fortune cookies and more like something we’d actually write.

Rule of thumb: let her look where the action is.

Training her to talk

She doesn’t come knowing language. We train her on huge piles of photos with human-written captions. Each time she guesses wrong, we nudge her weights. Over time, her guesses turn from “dog airplane tree” into “a brown dog playing in the park.”

Do this so she starts speaking like us, not like a random word generator.

Why we care

For us coders, it’s a clean example of vision meeting language. For users, it’s the difference between clicking a picture blind and knowing what’s inside.

We can’t help thinking: if she can describe a cat on a couch, how long before she critiques our messy desks?