# Naive Bayes for Sentiment Classification

In this notebook, we will implement Naive Bayes algorithm for text classification. We will use sentiment classification data in the notebook.


## Data

We will use the sentiment analysis corpus in [polarity dataset v1.0](http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.README.1.0.txt) from [Moview Review Data](http://www.cs.cornell.edu/people/pabo/movie-review-data/) created by Bo Pang and Lillian Lee. The task is to classify reviews into positive or negative polarity.

Dataset contains 10662 reviews of movies in which 50% of reviews have positive sentiment and 50% of reviews have negative sentiment. Data is stored in the file `sentiment.txt` in which each line is a review with labels (+1 or -1) at the beginning. All reviews are tokenized. For instance.

```
+1 if you sometimes like to go to the movies to have fun , wasabi is a good place to start .
-1 enigma is well-made , but it's just too dry and too placid .
```

We need to download data first.

In [None]:
!rm -f sentiment.txt
!wget https://raw.githubusercontent.com/minhpqn/nlp_100_drill_exercises/master/data/sentiment.txt

--2021-12-04 02:54:36--  https://raw.githubusercontent.com/minhpqn/nlp_100_drill_exercises/master/data/sentiment.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1270444 (1.2M) [text/plain]
Saving to: ‘sentiment.txt’


2021-12-04 02:54:37 (17.6 MB/s) - ‘sentiment.txt’ saved [1270444/1270444]



In [None]:
!head sentiment.txt

+1 the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal . 
+1 the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth . 
+1 effective but too-tepid biopic
+1 if you sometimes like to go to the movies to have fun , wasabi is a good place to start . 
+1 emerges as something rare , an issue movie that's so honest and keenly observed that it doesn't feel like one . 
+1 the film provides some great insight into the neurotic mindset of all comics -- even those who have reached the absolute top of the game . 
+1 offers that rare combination of entertainment and education . 
+1 perhaps no picture ever made has more literally showed that the road to hell is paved with good intentions . 
+1 steers turns in a 

### Loading data

We will load data into a list of sentences with their labels.

In [None]:
import re


def load_data(file_path):
    data = []
    # Regular expression to get the label and the text
    regx = re.compile(r'^(\+1|-1)\s+(.+)$')
    with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
        for line in f:
            line = line.strip()
            if line == '':
                continue
            match = regx.match(line)
            if match:
                lb = match.group(1)
                text = match.group(2)
                data.append((text, lb))
    return data

In [None]:
data = load_data('./sentiment.txt')

In [None]:
print(data[0])
print(data[-1])

('the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .', '+1')
("enigma is well-made , but it's just too dry and too placid .", '-1')


## Train/test split

We will split the data into train/test so that the label distributions on two data files are similar. We will split data with the ratio 80/20.

We use [scikit-learn](https://scikit-learn.org) library to do train/test split.

In [None]:
from sklearn.model_selection import train_test_split

texts, labels = zip(*data)
train_texts, test_texts, train_labels, test_labels = train_test_split(texts, labels, test_size=0.2, random_state=42)

Let's check labels on the training data and test data.


In [None]:
from collections import Counter

print(Counter(train_labels))
print(Counter(test_labels))

Counter({'+1': 4269, '-1': 4260})
Counter({'-1': 1071, '+1': 1062})


## Multinomial Naive Bayes Model

In this section, we will implement the Multinomial Naive Bayes (MNB) model. The implementation follows the pseudo code if Figure 4.2, chapter 4 "Naive Bayes and Sentiment Analysis" (SLP Book).

### Training Multinomial Naive Bayes Model

We first extract a vocabulary from a training dataset which is a list of sentences. For the sake of simplicity, we extract all words except punctuations.

In [None]:
import string

def build_vocab(texts):
    """Build vocabulary from dataset

    Args:
        texts (list): list of tokenized sentences

    Returns:
        vocab (dict): map from word to index
    """
    vocab = {}
    for s in texts:
        for word in s.split():
            # Check if word is a punctuation
            if word in string.punctuation:
                continue
            if word not in vocab:
                idx = len(vocab)
                vocab[word] = idx
    return vocab

Let's check how the function `build_vocab` works.

In [None]:
vocab = build_vocab(train_texts)

In [None]:
print(vocab)



In [None]:
from collections import defaultdict
import math


def train_naive_bayes(texts, labels, target_classes, alpha=1):
    """Train a multinomial Naive Bayes model
    """
    ndoc = 0
    nc = defaultdict(int)   # map from a class label to number of documents in the class
    logprior = dict()
    loglikelihood = dict()
    count = defaultdict(int)  # count the occurrences of w in documents of class c

    vocab = build_vocab(texts)
    # Training
    for s, c in zip(texts, labels):
        ndoc += 1
        nc[c] += 1
        for w in s.split():
            if w in vocab:
                count[(w,c)] += 1

    vocab_size = len(vocab)
    for c in target_classes:
        logprior[c] = math.log(nc[c]/ndoc)
        sum_ = 0
        for w in vocab.keys():
            if (w,c) not in count: count[(w,c)] = 0
            sum_ += count[(w,c)]

        for w in vocab.keys():
            loglikelihood[(w,c)] = math.log( (count[(w,c)] + alpha) / (sum_ + alpha * vocab_size) )

    return logprior, loglikelihood, vocab

Let's test the train function on a toy example

In [None]:
data = [
    ("Chinese Beijing Chinese", "c"),
    ("Chinese Chinese Shanghai", "c"),
    ("Chinese Macao", "c"),
    ("Tokyo Japan Chinese", "j")
]
texts, labels = zip(*data)
target_classes = ["c", "j"]

logprior, loglikelihood, vocab = train_naive_bayes(texts, labels, target_classes)

Let's confirm our implementation works correctly.

In [None]:
assert logprior['c'] == math.log(0.75)
assert logprior['j'] == math.log(0.25)
assert loglikelihood[('Chinese', 'c')] == math.log(3/7)
assert loglikelihood[('Tokyo', 'c')] == math.log(1/14)
assert loglikelihood[('Japan', 'c')] == math.log(1/14)
assert loglikelihood[('Tokyo', 'j')] == math.log(2/9)

There is no assert exception, so our implementation of the training step is correct!

#### Prediction Function

In [None]:
def test_naive_bayes(testdoc, logprior, loglikelihood, target_classes, vocab):
    sum_ = {}
    for c in  target_classes:
        sum_[c] = logprior[c]
        for w in testdoc.split():
            if w in vocab:
                sum_[c] += loglikelihood[(w,c)]
    # sort keys in sum_ by value
    sorted_keys = sorted(sum_.keys(), key=lambda x: sum_[x], reverse=True)
    return sorted_keys[0]

Let's try to predict the label for a test document.

In [None]:
print('Predicted class: %s' % test_naive_bayes('Chinese Chinese Tokyo Japan', logprior, loglikelihood, target_classes, vocab))

Predicted class: c


Now, it is time to train our Naive Bayes model on the sentiment data.

In [None]:
target_classes = ['+1', '-1']    # we can construct a fixed set of classes from train_labels
logprior, loglikelihood, vocab = train_naive_bayes(train_texts, train_labels, target_classes)

In [None]:
test_naive_bayes("enigma is well-made , but it's just too dry and too placid .", logprior, loglikelihood, target_classes, vocab)

'-1'

### Evaluation

We will calculate evaluation measures on the test data. You can implement evaluation measures by yourself, but in this notebook, we are going to use scikit-learn to do that.

Let's get predicted classes of test documents.

In [None]:
predicted_labels = [test_naive_bayes(s, logprior, loglikelihood, target_classes, vocab)
                    for s in test_texts]

In [None]:
from sklearn import metrics

print('Accuracy score: %f' % metrics.accuracy_score(test_labels, predicted_labels))

Accuracy score: 0.759962


We can calculate precision, recall, f1_score per class.

In [None]:
for c in target_classes:
    print('Evaluation measures for class %s' % c)
    print('  Precision: %f' % metrics.precision_score(test_labels, predicted_labels, pos_label=c))
    print('  Recall: %f' % metrics.recall_score(test_labels, predicted_labels, pos_label=c))
    print('  F1: %f' % metrics.f1_score(test_labels, predicted_labels, pos_label=c))

Evaluation measures for class +1
  Precision: 0.766925
  Recall: 0.746704
  F1: 0.756679
Evaluation measures for class -1
  Precision: 0.755232
  Recall: 0.774977
  F1: 0.764977


We can also compute macro-averaged and micro-averaged f1 score.

In [None]:
print('Macro-averaged f1: %f' % metrics.f1_score(test_labels, predicted_labels, average='macro'))
print('Micro-averaged f1: %f' % metrics.f1_score(test_labels, predicted_labels, average='micro'))

Macro-averaged f1: 0.759890
Micro-averaged f1: 0.759962


We can report classification results all by once.

In [None]:
print(metrics.classification_report(test_labels, predicted_labels))

              precision    recall  f1-score   support

          +1       0.77      0.75      0.76      1062
          -1       0.75      0.77      0.76      1071

    accuracy                           0.76      2133
   macro avg       0.76      0.76      0.76      2133
weighted avg       0.76      0.76      0.76      2133

