If we represent text documents as feature vectors using the bag of words method, we can calculate the euclidian distance between them.

Vectors always have a distance between them, consider the vectors (2,2) and (4,2). We can use the euclidian distance to automatically calculate the distance.

Related course:

Text similarity

Because we represent the text as vectors, this tells us how similar the text documents are.

We start with the corups, then calculate the feature vectors from the corpus and finally calculate the euclidian distance. In this example we compare everything to the first document.

  # Feature extraction from text
# Method: bag of words
# https://pythonprogramminglanguage.com

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import euclidean_distances

corpus = [
'All my cats in a row',
'When my cat sits down, she looks like a Furby toy!',
'The cat from outer space',
'Sunshine loves to sit like this for some reason.'

vectorizer = CountVectorizer()
features = vectorizer.fit_transform(corpus).todense()
print( vectorizer.vocabulary_ )

for f in features:
print( euclidean_distances(features[0], f) )

Download examples