# Perceptron Algorithm for Text Classification

January 6, 2023

Contents of the notebook:

- Implement Perceptron algorithm in Python from scratch
- Train the model on the labeled data
- Evaluate the model on the test dataset

## Task description

- We will train a binary classification model to determine that a title is about a person. We will use the train dataset [here](https://raw.githubusercontent.com/neubig/nlptutorial/master/data/titles-en-train.labeled)
- We will evaluate the model on a [test dataset](https://raw.githubusercontent.com/neubig/nlptutorial/master/data/titles-en-test.labeled). We use accuracy as the evaluation measure.



## Downloading dataset

In [1]:
%%capture
!rm -f titles-en-train.labeled
!rm -f titles-en-test.labeled

!wget https://raw.githubusercontent.com/neubig/nlptutorial/master/data/titles-en-train.labeled
!wget https://raw.githubusercontent.com/neubig/nlptutorial/master/data/titles-en-test.labeled

Each sample is written in a line. There are two labels {1, -1} in the data.

```
1	FUJIWARA no Chikamori ( year of birth and death unknown ) was a samurai and poet who lived at the end of the Heian period .
-1	Yomi is the world of the dead .
```

## Loading Data

We will load data into a list of sentences with their labels.

In [2]:
def load_data(file_path):
    data = []
    with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
        for line in f:
            line = line.strip()
            if line == '':
                continue
            lb, text = line.split('\t')
            data.append((text,int(lb)))

    return data

Loading data from files

In [3]:
train_data = load_data('./titles-en-train.labeled')
test_data = load_data('./titles-en-test.labeled')

In [4]:
train_data[0]

('FUJIWARA no Chikamori ( year of birth and death unknown ) was a samurai and poet who lived at the end of the Heian period .',
 1)

## Building Perceptron Model

We will implement the class Perceptron with following methods:

- `create_features`: to extract features from a sentence. For the sake of the simplicity, we will use unigram features
- `train`: train the Perceptron model on the training data
- `predict_one`: Predict the label for on sample
- `predict_all`: Predict labels for all sentences in the test data

In [15]:
"""
Implementation of Perceptron model
"""
from collections import defaultdict

class Perceptron:
    """Perceptron classifier
    """
    def __init__(self, eta=0.001, n_iter=10):
        self.eta = eta
        self.n_iter = n_iter

    def train(self, data):
        """Training the model

        Parameters
        ----------
        data: list of tuples (x,y) where x is a sentence and y is the label

        Returns
        -------
        self : object
        """
        self.w = defaultdict(int)
        for _ in range(self.n_iter):
            for x, y in data:
                phi = self.create_features(x)
                y_pred = self.predict_one(self.w, phi)
                if y != y_pred:
                        self.update_weights(self.w, phi, y)

    def predict_one(self, w, phi):
        score = 0
        for name, value in phi.items():
            if name in w:
                score += value * w[name]
        if score >= 0:
            return 1
        else:
            return -1

    def create_features(self, x):
        phi = defaultdict()
        words = x.split()
        for word in words:
            phi["UNI:" + word] = 1
        for i in range(len(words)-1):
            phi["BI:" + words[i] + " " + words[i+1]] = 1
        return phi

    def update_weights(self, w, phi, y):
        for name, value in phi.items():
            w[name] += value * y

    def classify(self, x):
        phi = self.create_features(x)
        return self.predict_one(self.w, phi)

    def predict_all(self, test_samples):
        y_preds = []
        for x in test_samples:
            y_pred = self.classify(x)
            y_preds.append(y_pred)
        return y_preds

## Training the model

In [16]:
model = Perceptron(eta=1)
model.train(train_data)

## Prediction

In [12]:
test_data[0]

('Bojo family were kuge ( court nobles ) with kakaku ( family status ) of meike ( the fourth highest status for court nobles ) .',
 -1)

In [13]:
model.classify(test_data[0][0])

-1

In [9]:
test_data[1]

('Kotaifujin ( also called Sumemioya ) means a person who was the biological mother of an Emperor and consort of the previous Emperor .',
 1)

In [10]:
model.classify(test_data[1][0])

1

## Evaluation

In [17]:
from sklearn import metrics

X_test, y_true = zip(*test_data)
y_preds = model.predict_all(X_test)

print("Accuracy: ", metrics.accuracy_score(y_true, y_preds))

Accuracy:  0.9376549769748495
