Introduction

I understand that NLP itself is not a classifier, however I just put it here since it is used in tangent with them. The data set I received was given by the Udemy course I was taking at the time. The data was a list of restaurant reviews with the aim of sorting them into either positive or negative reviews. It was a fun experience learning, however I felt it was a bit trite, and wanted to try my hand at another application of this: sorting emails by whether they are spam or legitimate. This data set was acquired from Kaggle.

Code

Restaurant Review

The method I used to organize these words are stemming and tokenization. Since it would be impossible to ask every user to only type in 1 tense and perfectly, this method allows for the user to type a word in any form, so it is broken down into its stem. Tokenization will also be used to take each word individually in the review, and associate it with its score.

#Import libraries
import pandas as pd
import nltk
import re
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
data = pd.read_csv('Restaurant_Reviews.tsv', delimiter = '\t', quoting = 3) #/t to show data is separated by tabs and quoting to ignore double quotes

corpus = [] #Collection of the individual reviews
for i in range(0, 1000):
    review = re.sub('[^a-zA-Z]', ' ', data['Review'][i])
    review = review.lower() #lower case all words"
    review = review.split() #Splits words
    stem = PorterStemmer() #Gets for stem
    review = [stem.stem(word) for word in review if not word in set(stopwords.words('english'))] #Ignores certain words that are common such as "this"
    review =' '. join(review) #Join all extra spaces in reviews
    corpus.append(review) #Create the collection of review words

#Bag of Words Model via tokenization
from sklearn.feature_extraction.text import CountVectorizer
vector = CountVectorizer(max_features = 1500) #Extracts features of unique 1500 words
X = vector.fit_transform(corpus).toarray()
y = data.iloc[:, 1].values

Spam

There are a couple of small differences between the two processes. For instance, instead of stemming I was going to try to use a lemmatizer. In doing so, I was hoping to see if having words that aren’t broken from stemming would have an effect. I also used a TF-IDF vectorizer instead of bag of words since it takes into account the frequency of each term.

import pandas as pd
import nltk
import re
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
data = pd.read_csv('spam.csv', encoding = 'latin-1') #Special character readable

#Creating a loop that cleans each text file then collects the cleaned data
corpus = [] #Collection of the individual reviews
for i in range(0, 5572):
    review = re.sub('[^a-zA-Z]', ' ', data['v2'][i])
    review = review.lower() #lower case all words"
    review = review.split() #Splits words
    lem = WordNetLemmatizer() #Gets for stem
    review = [lem.lemmatize(word) for word in review if not word in set(stopwords.words('english'))] #Ignores certain words that are common such as "this"
    review =' '. join(review) #Join all extra spaces in reviews
    corpus.append(review) #Create the collection of review words

#TDFI model for to factor word density
from sklearn.feature_extraction.text import TfidfVectorizer
vector = TfidfVectorizer() #Extracts features of unique 1500 words
X = vector.fit_transform(corpus).toarray()
y = data.iloc[:, 0].values

Results

Both used the Gaussian classifier since the data would be a little more complex than just being linearly separable.

Review and Spam accuracies respectively

The spam prediction algorithm came out to be very successful, while the restaurant reviews came out average. I expected this since the spam data set was much larger than the reviews.

Updated: