kmeans text clustering

Given text documents, we can group them automatically: text clustering. We’ll use KMeans which is an unsupervised machine learning algorithm.

I’ve collected some articles about cats and google. You’ve guessed it: the algorithm will create clusters. The articles can be about anything, the clustering algorithm will create clusters automatically. Even cooler: prediction.

Related course:
Data Science and Machine Learning with Python – Hands On!

The data

We create the documents using a Python list. In our example, documents are simply text strings that fit on the screen. In a real world situation, they may be big files.

Feature extraction

KMeans normally works with numbers only: we need to have numbers. To get numbers, we do a common step known as feature extraction.

The feature we’ll use is TF-IDF, a numerical statistic. This statistic uses term frequency and inverse document frequency. In short: we use statistics to get to numerical features. Because I’m lazy, We’ll use the existing implementation of the TF-IDF algorithm in sklearn.

The method TfidfVectorizer() implements the TF-IDF algorithm. Briefly, the method TfidfVectorizer converts a collection of raw documents to a matrix of TF-IDF features.

algorithm joke

Text clustering

After we have numerical features, we initialize the KMeans algorithm with K=2. If you want to determine K automatically, see the previous article. We’ll then print the top words per cluster.

Then we get to the cool part: we give a new document to the clustering algorithm and let it predict its class. In the code below I’ve done that twice.

Previous Post
Next Post


  • How do I find the distance of the test data from all of the centroids? Any kind of a distance matrix? Also, how do I identify the documents that are neither cat related nor Google related? (The distance is more than a certain distance from either of the clusters)

    • ninja says:

      The KMeans call returns the centroids. You could use those to compare. To have more categories, increase k: true_k is set to 2 categories (cat and G). To use more categories, increase k to the number of categories.

Leave a Reply