Document clustering

We can’t read thousands of documents by hand. We’d give up long before she does. Artificial intelligence (AI) is happy to group them for us. She sees patterns, then drops each document into the right bucket.

Why grouping matters

Think of a messy downloads folder. We sort it once in a while so we can find stuff. Same idea, except she can do it instantly, even with millions of files. The result isn’t perfect, but it’s way better than staring at a haystack.

K-means

K-means is the “pick a number and divide” method. We tell her how many groups we want. She guesses where the centers are, checks every document, and assigns it to the nearest center. Then she shifts the centers until the math settles down. It’s fast, but if we pick the wrong number of groups, we’re stuck with lopsided buckets.

Hierarchical clustering

Hierarchical clustering builds a family tree. She starts with every document alone, then keeps merging the closest pair. The tree shows which documents are tight-knit and which are distant cousins. We can cut the tree at any level to decide how many groups we want. Slower, but more flexible.

Where this helps

Email archives, research papers, customer feedback. If we feed her the text, she will cluster it. The payoff is finding themes without digging by hand. We can scan summaries instead of scanning everything.

One coder’s thought

We like when code cleans up after us. Clustering feels like that. We don’t mind if the groups are a little off—better than doing it ourselves with sticky notes.

Funny quotations

Here’s a tiny console app to see how clustering works in practice. Suggest a topic (such as “quotes about work”) and the app shows a group of quotations that sit closest together in meaning.

Behind the scenes, we embedded every quote with MiniML, ran k-means once, and saved those clusters. At runtime, the AI turns our query into a vector and points us to the nearest centroid so we can show the group of quotations.