Given text documents, we can group them automatically: text clustering. We’ll use KMeans which is an unsupervised machine learning algorithm.
I’ve collected some articles about cats and google. You’ve guessed it: the algorithm will create clusters. The articles can be about anything, the clustering algorithm will create clusters automatically. Even cooler: prediction.
Related course: Complete Machine Learning Course with Python
Kmeans
We create the documents using a Python list. In our example, documents are simply text strings that fit on the screen. In a real world situation, they may be big files.
documents = ["This little kitty came to play when I was eating at a restaurant.",
"Merley has the best squooshy kitten belly.",
"Google Translate app is incredible.",
"If you open 100 tab in google you get a smiley face.",
"Best cat photo I've ever taken.",
"Climbing ninja cat.",
"Impressed with google map feedback.",
"Key promoter extension for Google Chrome."]
Feature extraction
KMeans normally works with numbers only: we need to have numbers. To get numbers, we do a common step known as feature extraction.
The feature we’ll use is TF-IDF, a numerical statistic. This statistic uses term frequency and inverse document frequency. In short: we use statistics to get to numerical features. Because I’m lazy, We’ll use the existing implementation of the TF-IDF algorithm in sklearn.
The method TfidfVectorizer() implements the TF-IDF algorithm. Briefly, the method TfidfVectorizer converts a collection of raw documents to a matrix of TF-IDF features.
Text clustering
After we have numerical features, we initialize the KMeans algorithm with K=2. If you want to determine K automatically, see the previous article. We’ll then print the top words per cluster.
Then we get to the cool part: we give a new document to the clustering algorithm and let it predict its class. In the code below I’ve done that twice.
|
If you are new to Machine Learning, I highly recommend this book
Please tell me,how to proceed with lot of text documents and how to import that . Thank you
load every text document into the list documents. You can use pathlib to get every text file in the directory, then read file by file as an element in the list documents.
So remove the documents list, and replace it with this:
where all your files are in the directory `files` (a subdirectory in your program)
I have a few doubts:
1) what is the meaning of [:, ::-1]
2) On changing the second query to "My cat is hungry and wants to eat apple." why are the words in the 2 clusters getting changes as the clusters are getting formed before processing the query.
3) On making the number of clusters as 3, why is it showing the same working 'climbing' in two different clusters which should not be feasible.
Please reply asap.
1. it is string slicing, -1 means in reverse order
2. words can be in multiple clusters, if you have raw data
3. just the algorithms implementation
Thank you
And can you please explain these statements to me like what this is actually doing?
1) order_centroids = model.cluster_centers_.argsort()[:, ::-1]
2) for ind in order_centroids[i, :10] (I am unable to understand that what this :10 is doing here?)
Thank you
1. cluster_centers are the coordinates of cluster centers. For a visual example see https://pythonprogrammingla...
It outputs the cluster_centers
you'll see it is two dimensional for that example. That's not the case for this example, it's not two dimensional. Returns the indices that would sort an array, to understand why we need order_centroids see the answer below.
2. This is only text output. What we are showing here is which words belong to which cluster. In this case the variable terms contains
So ind is the word index for that cluster. The variable terms has all the words (as shown below), ind is the index in that array. Depending on the cluster, that order is different
where i is the current cluster. We loop over every cluster (true_k is the number of clusters) and show the words that belong to that cluster
Now we only output 10 words per cluster. Try changing it to 20 and see what happens.
There is a limit to the number of words, which is the size of the array terms.
Note that the code for both question 1 and 2 are for the purpose of console output only. As in, I've added that to explain the algorithm. If you just want to make predictions all you need is:
Can I in some way get the frequency of occurences of top terms of each cluster and then store it in a pandas dataframe with columns: top terms and its corresponding frequency for all the clusters created
Hi,
I need to cluster addresses of different users, i.e i have the complete address in one cell and 100 rows like that. how can i do that?
yes, terms[ind] contains the top terms for each cluster. Inside the loop `for i in range(true_k):` you can add them to a pandas dataframe
You could use K-means clustering for that, but with K-means you'd have to specify the K (the number of clusters you want). You'll likely have to parse the address into another format that's usable by the algorithm, could be vectorization or simply string format depending on the data
Hi,
Instead of printing the top terms per cluster, is there a way to print all the terms per cluster?
Hi, thanks a lot for the post. I have a couple of questions:
1. When I run the code multiple times, I get different results. How does this work?
2. Is there a way to force the algorithm to make particular clusters? For example, would there be a way to indicate that one cluster should be about cats and the other about chrome?
Thanks!
@d@disqus_djn7ClpXiL:disqus Can you help me? After clustering completion how to summeize data ?
after writing this what can I do to summerize single document text data? I will be glad to youyes, change [i:10] to [i:]
1. I think that's because it's using kmeans++ instead of kmeans, it must have some change that causes that but I haven't checked the implementation
2. not with this implementation, but you can implement the knn algorithm to do that
you need another algorithm for summarizating. knn is for clustering