Lets make a spam filter using logistic regression. We will classify messages to be either ham or spam. The dataset we’ll use is the SMSSpamCollection dataset. The dataset contains messages, which are either spam or ham.

Related course: Complete Machine Learning Course with Python

what is logistic regression?

Logistic regression is a simple classification algorithm. Given an example, we try to predict the probability that it belongs to “0” class or “1” class.

Remember that with linear regression, we tried to predict the value of y(i) for x(i). Such continous output is not suited for the classification task.

Given the logisitic function and an example, it always returns a value between one and zero.

sigmoid function, logistic function

Lets plot the data for that function. We’ll use the range {-6,6}:

logistic function

This shows an S shape. The inverse of the logistic function is called the logit function. To make the correlation between the predictor and dependent variable linear, we need to do the logit transformation of the dependent variable.

Logit = Log (p/1-p) = β 0 + β x

We can now apply it to the binary classification task.

Related course: Complete Machine Learning Course with Python

spam filter code

We load the dataset using pandas. Then we split in a training and test set. We extract text features known as TF-IDF features, because we need to work with numeric vectors.

Then we create the logistic regression object and train it with the data. Finally we create a set of messages to make predictions.

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score

df = pd.read_csv('SMSSpamCollection', delimiter='\t',header=None)

X_train_raw, X_test_raw, y_train, y_test = train_test_split(df[1],df[0])

vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train_raw)
classifier = LogisticRegression()
classifier.fit(X_train, y_train)

X_test = vectorizer.transform( ['URGENT! Your Mobile No 1234 was awarded a Prize', 'Hey honey, whats up?'] )
predictions = classifier.predict(X_test)
print(predictions)

Download examples