The bag-of-words model is a model used in natural language processing (NLP) and information retrieval. It converts a text to set of words with their frequences, hence the name “bag of words”.

If we represent text documents as feature vectors using the bag of words method, we can calculate the euclidian distance between them.

Vectors always have a distance between them, consider the vectors (2,2) and (4,2). We can use the euclidian distance to automatically calculate the distance.

Related course: Complete Machine Learning Course with Python

Introduction

Each text is represented as a vector with frequence of each word. That’s why if you have two texts, you can compare how similar they are by comparing their bag of words vectors.

You’ll want to use the bag-of-words model because for lots of data, the computer is much faster at processing vectors than large file of text.

A text can be anything from a single string to a book.

corpus = [
'All my cats in a row',
'When my cat sits down, she looks like a Furby toy!',
'The cat from outer space',
'Sunshine loves to sit like this for some reason.'
]

Because we represent the text as vectors, this tells us how similar the text documents are. Every text is converted to a feature vector:

vectorizer = CountVectorizer()
features = vectorizer.fit_transform(corpus).todense()

Then you can compare other feature vectors distance (imagine every vector in an n-dimensional plot) to the given feature.

for f in features:
print( euclidean_distances(features[0], f) )

Text similarity

We start with the corups, then calculate the feature vectors from the corpus and finally calculate the euclidian distance. In this example we compare everything to the first document.

  # Feature extraction from text
# Method: bag of words
# https://pythonprogramminglanguage.com

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import euclidean_distances

corpus = [
'All my cats in a row',
'When my cat sits down, she looks like a Furby toy!',
'The cat from outer space',
'Sunshine loves to sit like this for some reason.'
]

vectorizer = CountVectorizer()
features = vectorizer.fit_transform(corpus).todense()
print( vectorizer.vocabulary_ )

for f in features:
print( euclidean_distances(features[0], f) )

This then outputs the bag of words and the distance to the first text vector. Of course the distance to itself is zero.

{'all': 0, 'my': 11, 'cats': 2, 'in': 7, 'row': 14, 'when': 25, 'cat': 1, 'sits': 17, 'down': 3, 'she': 15, 'looks': 9, 'like': 8, 'furby': 6, 'toy': 24, 'the': 21, 'from': 5, 'outer': 12, 'space': 19, 'sunshine': 20, 'loves': 10, 'to': 23, 'sit': 16, 'this': 22, 'for': 4, 'some': 18, 'reason': 13}
[[0.]]
[[3.60555128]]
[[3.16227766]]
[[3.74165739]]

Download examples