kmeans text clustering

Given text documents, we can group them automatically: text clustering. We’ll use KMeans which is an unsupervised machine learning algorithm.

I’ve collected some articles about cats and google. You’ve guessed it: the algorithm will create clusters. The articles can be about anything, the clustering algorithm will create clusters automatically. Even cooler: prediction.

Related course: Complete Machine Learning Course with Python

Kmeans

We create the documents using a Python list. In our example, documents are simply text strings that fit on the screen. In a real world situation, they may be big files.

 
documents = ["This little kitty came to play when I was eating at a restaurant.",
             "Merley has the best squooshy kitten belly.",
             "Google Translate app is incredible.",
             "If you open 100 tab in google you get a smiley face.",
             "Best cat photo I've ever taken.",
             "Climbing ninja cat.",
             "Impressed with google map feedback.",
             "Key promoter extension for Google Chrome."]

Feature extraction

KMeans normally works with numbers only: we need to have numbers. To get numbers, we do a common step known as feature extraction.

The feature we’ll use is TF-IDF, a numerical statistic. This statistic uses term frequency and inverse document frequency. In short: we use statistics to get to numerical features. Because I’m lazy, We’ll use the existing implementation of the TF-IDF algorithm in sklearn.

The method TfidfVectorizer() implements the TF-IDF algorithm. Briefly, the method TfidfVectorizer converts a collection of raw documents to a matrix of TF-IDF features.

Text clustering

After we have numerical features, we initialize the KMeans algorithm with K=2. If you want to determine K automatically, see the previous article. We’ll then print the top words per cluster.

Then we get to the cool part: we give a new document to the clustering algorithm and let it predict its class. In the code below I’ve done that twice.

 
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score

documents = ["This little kitty came to play when I was eating at a restaurant.",
             "Merley has the best squooshy kitten belly.",
             "Google Translate app is incredible.",
             "If you open 100 tab in google you get a smiley face.",
             "Best cat photo I've ever taken.",
             "Climbing ninja cat.",
             "Impressed with google map feedback.",
             "Key promoter extension for Google Chrome."]

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)

true_k = 2
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)

print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
    print("Cluster %d:" % i),
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind]),
    print

print("\n")
print("Prediction")

Y = vectorizer.transform(["chrome browser to open."])
prediction = model.predict(Y)
print(prediction)

Y = vectorizer.transform(["My cat is hungry."])
prediction = model.predict(Y)
print(prediction)

If you are new to Machine Learning, I highly recommend this book

Download Machine Learning examples

terms ['100', 'app', 'belly', 'best', 'came', 'cat', 'chrome', 'climbing', 'eating', 'extension', 'face', 'feedback', 'google', 'impressed', 'incredible', 'key', 'kitten', 'kitty', 'little', 'map', 'merley', 'ninja', 'open', 'photo', 'play', 'promoter', 'restaurant', 'smiley', 'squooshy', 'tab', 'taken', 'translate', 've']

from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.cluster import KMeans from sklearn.metrics import adjusted_rand_score documents = ["This little kitty came to play when I was eating at a restaurant.", "Merley has the best squooshy kitten belly.", "Google Translate app is incredible.", "If you open 100 tab in google you get a smiley face.", "Best cat photo I've ever taken.", "Climbing ninja cat.", "Impressed with google map feedback.", "Key promoter extension for Google Chrome."] vectorizer = TfidfVectorizer(stop_words='english') X = vectorizer.fit_transform(documents) true_k = 2 model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1) model.fit(X) print("Prediction") Y = vectorizer.transform(["chrome browser to open."]) prediction = model.predict(Y) print(prediction)

Leave a Reply:

Ayyavu M • Wed, 18 Mar 2020

Please tell me,how to proceed with lot of text documents and how to import that . Thank you

dev • Wed, 18 Mar 2020

load every text document into the list documents. You can use pathlib to get every text file in the directory, then read file by file as an element in the list documents.

So remove the documents list, and replace it with this:


import pathlib
from pathlib import Path

documents = []
path = Path("./files/")
for x in path.iterdir():
    data = ""
    with open(x, 'r') as myfile:
       data = myfile.read()
    documents.append(data)

where all your files are in the directory `files` (a subdirectory in your program)

Vibhor Sharma • Thu, 07 May 2020

I have a few doubts:
1) what is the meaning of [:, ::-1]
2) On changing the second query to "My cat is hungry and wants to eat apple." why are the words in the 2 clusters getting changes as the clusters are getting formed before processing the query.
3) On making the number of clusters as 3, why is it showing the same working 'climbing' in two different clusters which should not be feasible.
Please reply asap.

dev • Sat, 09 May 2020

1. it is string slicing, -1 means in reverse order
2. words can be in multiple clusters, if you have raw data
3. just the algorithms implementation

Vibhor Sharma • Sun, 10 May 2020

Thank you
And can you please explain these statements to me like what this is actually doing?
1) order_centroids = model.cluster_centers_.argsort()[:, ::-1]
2) for ind in order_centroids[i, :10] (I am unable to understand that what this :10 is doing here?)
Thank you

dev • Wed, 13 May 2020

1. cluster_centers are the coordinates of cluster centers. For a visual example see https://pythonprogrammingla...
It outputs the cluster_centers

 print(f"Cluster centers: \n{kmeans_model.cluster_centers_}")

you'll see it is two dimensional for that example. That's not the case for this example, it's not two dimensional. Returns the indices that would sort an array, to understand why we need order_centroids see the answer below.

2. This is only text output. What we are showing here is which words belong to which cluster. In this case the variable terms contains

So ind is the word index for that cluster. The variable terms has all the words (as shown below), ind is the index in that array. Depending on the cluster, that order is different


for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind]),

where i is the current cluster. We loop over every cluster (true_k is the number of clusters) and show the words that belong to that cluster

  
for i in range(true_k):
    print("Cluster %d:" % i),

Now we only output 10 words per cluster. Try changing it to 20 and see what happens.


for ind in order_centroids[i, :20]:
        print(' %s' % terms[ind]),

There is a limit to the number of words, which is the size of the array terms.

Note that the code for both question 1 and 2 are for the purpose of console output only. As in, I've added that to explain the algorithm. If you just want to make predictions all you need is:

Vibhor Sharma • Thu, 21 May 2020

Can I in some way get the frequency of occurences of top terms of each cluster and then store it in a pandas dataframe with columns: top terms and its corresponding frequency for all the clusters created

sudha rani • Mon, 25 May 2020

Hi,
I need to cluster addresses of different users, i.e i have the complete address in one cell and 100 rows like that. how can i do that?

dev • Wed, 10 Jun 2020

yes, terms[ind] contains the top terms for each cluster. Inside the loop `for i in range(true_k):` you can add them to a pandas dataframe

You could use K-means clustering for that, but with K-means you'd have to specify the K (the number of clusters you want). You'll likely have to parse the address into another format that's usable by the algorithm, could be vectorization or simply string format depending on the data

Saranya Gupta • Thu, 06 Aug 2020

Hi,
Instead of printing the top terms per cluster, is there a way to print all the terms per cluster?

Rushil • Wed, 19 Aug 2020

Hi, thanks a lot for the post. I have a couple of questions:
1. When I run the code multiple times, I get different results. How does this work?
2. Is there a way to force the algorithm to make particular clusters? For example, would there be a way to indicate that one cluster should be about cats and the other about chrome?

Thanks!

Amran Hossain • Sun, 13 Sep 2020

@d@disqus_djn7ClpXiL:disqus Can you help me? After clustering completion how to summeize data ?


terms = vectorizer.get_feature_names()
for i in range(true_k):
    print("Cluster %d:" % i),
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind]),
    print()
print('\n')

after writing this what can I do to summerize single document text data? I will be glad to you

dev • Sun, 03 Jan 2021

yes, change [i:10] to [i:]

1. I think that's because it's using kmeans++ instead of kmeans, it must have some change that causes that but I haven't checked the implementation
2. not with this implementation, but you can implement the knn algorithm to do that

you need another algorithm for summarizating. knn is for clustering