If we represent text documents as feature vectors using the bag of words method, we can calculate the euclidian distance between them.
Vectors always have a distance between them, consider the vectors (2,2) and (4,2). We can use the euclidian distance to automatically calculate the distance.
Data Science and Machine Learning with Python – Hands On!
Because we represent the text as vectors, this tells us how similar the text documents are.
We start with the corups, then calculate the feature vectors from the corpus and finally calculate the euclidian distance. In this example we compare everything to the first document.
# Feature extraction from text
# Method: bag of words
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import euclidean_distances
corpus = [
'All my cats in a row',
'When my cat sits down, she looks like a Furby toy!',
'The cat from outer space',
'Sunshine loves to sit like this for some reason.'
vectorizer = CountVectorizer()
features = vectorizer.fit_transform(corpus).todense()
print( vectorizer.vocabulary_ )
for f in features:
print( euclidean_distances(features, f) )