Sentiment Analysis of Reviews from Amazon (acrylic markers/pens)
I am using logistic regression on imbalanced 2 label dataset of Amazon reviews to predict the sentiment of the review (e.g. Positive/Negative)
I add features, tune the model and compare the results at the end
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.dummy import DummyClassifier
import nltk, re, pprint
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import metrics
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score, RepeatedStratifiedKFold, StratifiedKFold
from sklearn.metrics import accuracy_score, confusion_matrix,roc_curve, roc_auc_score, precision_score, recall_score, precision_recall_curve
from sklearn.metrics import f1_score
Removing neutral 3-star ratings:
df = resul[resul['Rating'] != 3]
df['Positively Rated'] = np.where(df['Rating'] > 3, 1, 0)
Our dataset is inbalanced. Majority label "1" accounts for around 93% of the population
df['Positively Rated'].value_counts()/df.shape[0]
Label | Ratio |
---|---|
1 | 0.925511 |
0 | 0.074489 |
Splitting data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(df['Body'],
df['Positively Rated'],
random_state=2)
Default Logistic Regression model:
Transforming the documents in the training data to a document-term matrix:
X_train_vectorized = vect.transform(X_train)
Training the model:
model_simple = LogisticRegression()
model_simple.fit(X_train_vectorized, y_train)
Predicting the transformed test documents and printing the results:
predictions_simple = model_simple.predict(vect.transform(X_test))
print(f'Accuracy Score: {accuracy_score(y_test,predictions_simple)}')
print(f'Confusion Matrix: \n{confusion_matrix(y_test, predictions_simple)}')
print(f'Area Under Curve: {roc_auc_score(y_test, predictions_simple)}')
print(f'Recall score: {recall_score(y_test,predictions_simple)}')
print(f'Precision score: {precision_score(y_test,predictions_simple)}')
Accuracy Score: 0.9503404084901882
Confusion Matrix:
[[ 106 - 91]
[ 33 - 2267]]
Area Under Curve: 0.7618616199514456
Recall score: 0.9856521739130435
Precision score: 0.9614079728583546
Although the Accuracy Score is high, almost half of the negative reviews (0) estimated incorrectly (False Positives = 91)
What are the features with smallest (predicting negatives reviews) and largest (predicting positive) coefficients?
# get the feature names as numpy array
feature_names = np.array(vect.get_feature_names())
# Sort the coefficients from the model
sorted_coef_index = model_simple.coef_[0].argsort()
# Find the 10 smallest and 10 largest coefficients
# The 10 largest coefficients are being indexed using [:-11:-1]
# so the list returned is in order of largest to smallest
print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))
Smallest Coefs:
['return' 'half' 'not' 'never' 'terrible' 'splatter' 'disappointed'
'cheap' 'say' 'dried']
Largest Coefs:
['easy' 'love' 'perfect' 'great' 'excellent' 'best' 'fun' 'perfectly'
'nice' 'beautiful']
Let's try to test the model using generic input "markers are good" and "markers are not good"
print(model_simple.predict(vect.transform(['markers are good',
'markers are not good'])))
Output: [1 1] - both strings were predicted as Positive
Adding n-grams to feature to compare word pairs (e.g. "good" and "not good"). Also, getting rid of most frequent words that do not add to the quality of prediction (e.g. "you")
# Fit the CountVectorizer to the training data specifiying a minimum
# document frequency of 5 and extracting 1-grams and 2-grams
vect_count = CountVectorizer(min_df=7, ngram_range=(1,2),
stop_words = frozenset(["you", "your", "zu"])).fit(X_train)
vect_tfidf = TfidfVectorizer(min_df=7, ngram_range=(1,2),
stop_words = frozenset(["you", "your", "zu"])).fit(X_train)
X_train_vectorized = vect_count.transform(X_train)
X_train_vectorized_tfidf = vect_tfidf.transform(X_train)
model = LogisticRegression()
model.fit(X_train_vectorized, y_train)
X_test_vectorized = vect_count.transform(X_test)
predictions = model.predict(vect_count.transform(X_test))
#print('AUC: ', roc_auc_score(y_test, predictions))
# performance
print(f'Accuracy Score: {accuracy_score(y_test,predictions)}')
print(f'Confusion Matrix: \n{confusion_matrix(y_test, predictions)}')
print(f'Area Under Curve: {roc_auc_score(y_test, predictions)}')
print(f'Recall score: {recall_score(y_test,predictions)}')
print(f'Precision score: {precision_score(y_test,predictions)}')
Accuracy Score: 0.9583500200240288
Confusion Matrix:
[[ 116 - 81]
[ 23 - 2277]]
Area Under Curve: 0.7894162436548222
Recall score: 0.99
Precision score: 0.9656488549618321
model = LogisticRegression()
model.fit(X_train_vectorized_tfidf, y_train)
X_test_vectorized_tfidf = vect_tfidf.transform(X_test)
predictions_tfidf = model.predict(vect_tfidf.transform(X_test))
#print('AUC: ', roc_auc_score(y_test, predictions))
# performance
print(f'Accuracy Score: {accuracy_score(y_test,predictions_tfidf)}')
print(f'Confusion Matrix: \n{confusion_matrix(y_test, predictions_tfidf)}')
print(f'Area Under Curve: {roc_auc_score(y_test, predictions_tfidf)}')
print(f'Recall score: {recall_score(y_test,predictions_tfidf)}')
print(f'Precision score: {precision_score(y_test,predictions_tfidf)}')
Accuracy Score: 0.9427312775330396
Confusion Matrix:
[[ 59 - 138]
[ 5 2295]]
Area Under Curve: 0.6486592363716619
Recall score: 0.9978260869565218
Precision score: 0.9432799013563502
Altough we have added n-gram features to out models they still don't work well on generic reviews "markers are good" and "markers are not good". Both models have no very strong AUC values. We will add weights to LR model in order to increase the impact of minority class (Negative reviews "0")
# define class weights
model_w = LogisticRegression(random_state=13,C=680,fit_intercept=True,
penalty='l2',class_weight={0: 99.991, 1: 0.009} )
model_w.fit(X_train_vectorized_tfidf, y_train)
X_test_vectorized_tfidf = vect_tfidf.transform(X_test)
predictions_w_idf = model_w.predict(vect_tfidf.transform(X_test))
#print('AUC: ', roc_auc_score(y_test, predictions))
# performance
print(f'Accuracy Score: {accuracy_score(y_test,predictions_w_idf)}')
print(f'Confusion Matrix: \n{confusion_matrix(y_test, predictions_w_idf)}')
print(f'Area Under Curve: {roc_auc_score(y_test, predictions_w_idf)}')
print(f'Recall score: {recall_score(y_test,predictions_w_idf)}')
print(f'Precision score: {precision_score(y_test,predictions_w_idf)}')
Accuracy Score: 0.8470164197036444
Confusion Matrix:
[[ 190 7]
[ 375 1925]]
Area Under Curve: 0.9007117634076363
Recall score: 0.8369565217391305
Precision score: 0.9963768115942029
The above model has strongest AUC and does predict generic labels correctly
print(model_w.predict(vect_tfidf.transform(['markers are good',
'markers are not good'])))
Trying to use CountVectorized parameterd LR with weights, which doesn't take into account same word distributions accross documents comparing to Tfidf
model_w = LogisticRegression(random_state=13,C=80,fit_intercept=True,
penalty='l2',class_weight={0: 99.991, 1: 0.009} )
model_w.fit(X_train_vectorized, y_train)
X_test_vectorized = vect_count.transform(X_test)
predictions_w = model_w.predict(vect_count.transform(X_test))
#print('AUC: ', roc_auc_score(y_test, predictions))
# performance
print(f'Accuracy Score: {accuracy_score(y_test,predictions_w)}')
print(f'Confusion Matrix: \n{confusion_matrix(y_test, predictions_w)}')
print(f'Area Under Curve: {roc_auc_score(y_test, predictions_w)}')
print(f'Recall score: {recall_score(y_test,predictions_w)}')
print(f'Precision score: {precision_score(y_test,predictions_w)}')
Accuracy Score: 0.8446135362434922
Confusion Matrix:
[[ 174 - 23]
[ 365 - 1935]]
Area Under Curve: 0.8622765393952769
Recall score: 0.841304347826087
Precision score: 0.9882533197139939
Although this model has lightly low AUC score (0.86 comparing to 0.90 in Tfidf), it does well in generic reviews "markers are good" and "markers are not good"
print(model_w.predict(vect_count.transform(['markers are good',
'markers are not good'])))
Output: [1 - 0]
For handling imbalance dataset weights were added to LR model and further tuning (adding regulazation parameter and CountVectorizer/Tfidf) was needed to get the best results in AUC as well as tests on generic strings.
Comparing the results:
Statistics/Model | Default Model | CountVectorizer with n-grams | Tfidf with n-grams | CountVectorizer with regularization and weights | Tfidf with regularization and weights |
---|---|---|---|---|---|
Accuracy Score | 0.9503404084901882 | 0.9583500200240288 | 0.9427312775330396 | 0.8470164197036444 | 0.8466159391269523 |
Area Under Curve | 0.7618616199514456 | 0.7894162436548222 | 0.6486592363716619 | 0.9007117634076363 | 0.8981736923416463 |
Recall score | 0.9856521739130435 | 0.99 | 0.9978260869565218 | 0.8369565217391305 | 0.8369565217391305 |
Precision score | 0.9614079728583546 | 0.9656488549618321 | 0.9432799013563502 | 0.9963768115942029 | 0.9958613554061045 |
Generic test ("markers are good"; "markers are not good") |
Both predicted as Positive | Both as Predicted positive | Both predicted as Positive | Predicted correctly - Positive, Negative | Predicted correctly - Positive, Negative |