ŷhat Core API
Building a Text Message Spam Filter
This tutorial explains how to classify text messages as spam / not spam using scikit-learn.
For documentation on deploying models to ŷhat, click here.
Make sure you have the following Python libraries installed before we begin.
Load Libraries
We'll be using some convient data structures in Numpy and Pandas to manage and reshape our data, so let's load those in addition to SciKits which we'll use for feature extraction and text classification.
Download Dataset
The SMS Spam Collection is open source and available at the UCI Machine Learning Repository.
To save time, we'll be training our classifier using a version which has been modified to exclude non-ASCII characters which don't work well with SciKits.
Download original dataset
Download cleaned dataset
Reading the Dataset into Memory
If you're following along using the cleaned version of the dataset, you can read data from the tab delimited file using Pandas' read_table utility function. Pandas has numerous utility functions to read and write files. To read about read_table and other Pandas IO tools click here.
| cat | message | |
|---|---|---|
| 0 | ham | Go until jurong point, crazy.. Available only ... |
| 1 | ham | Ok lar... Joking wif u oni... |
| 2 | spam | Free entry in 2 a wkly comp to win FA Cup fina... |
| 3 | ham | U dun say so early hor... U c already then say... |
| 4 | ham | Nah I don't think he goes to usf, he lives aro... |
The data has two columns: cat, or category, and message. We're going to train a classifier to assess whether a text message is spam or not spam (ham).
From Text Messages to Feature Vectors
We need to transform our text data into feature vectors, numerical representations which are suitable for performing statistical analysis. The most common way to do this is to apply a bag-of-words approach where the frequency of an occurrence of a word becomes a feature for our classifier.
Term Frequency-Inverse Document Frequency
We want to consider the relative importance of particular words, so we'll use term frequency–inverse document frequency as a weighting factor. This will control for the fact that some words are more "spamy" than others.
Here, we create a variable, tfidf, which is a vectorizer responsible for performing three important steps:
- First, it will build a dictionary of features where keys are terms and
values are indices of the term in the feature matrix (that's the
fitpart infit_transform) - Second, it will
transformour documents into numerical feature vectors according to the frequency of words appearing in each text message. Since any one text message is short, each feature vector will be made up of mostly zeros, each of which indicates that a given word appeared zero times in that message. - Lastly, it will compute the tf-idf weights for our term frequency matrix.
For a deeper dive into feature extraction, Christian Perone wrote a wonderful blog post covering both the theory and practice which includes illustrations using scikit-learn.
Naive Bayes
We'll be using SciKits' MultinomialNB, a Naive Bayes classifier effective for catching spam with the added benefits of scalability and low training time.
We create a variable, y_train, which is simply a list of target classes which our classifier will be trained to identify. 1 indicates spam while 0 indicates ham, or non-spam.
Then we fit the model by passing the X_train sparse matrix
and y_train to our MultinomialNB classifier's fit function.
Classifying New Observations
Now let's classify the test documents as spam or not spam and see how we did.
The predict function yields an array of True / False values (True for spam, False for not spam). You can read about these functions in the SciKits' MultinomialNB documentation.
Conclusion
In this tutorial we covered:
- Reading a dataset using Pandas
- Transforming text data into to numerical feature vector representations
- Applying tf-idf as a weighting factor to account for the relative importance of words
- Training a Naive Bayes classifier to perform text classification against our data in tf-idf vector representation
- Extracting features from a hold out sample using the same vectorizer used to train our model
- Using our model to classify the hold out documents
A common stumbling block of analytics teams is to declare the job done before their work has been operationalized. ŷhat helps analytics teams carry projects farther, to the point at which their work is integrated into their companies' software systems and actively involved in operations.
To learn more, deploy this model in just a few minutes.