{"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"provenance":[]},"kernelspec":{"name":"python3","display_name":"Python 3"},"language_info":{"name":"python"}},"cells":[{"cell_type":"markdown","source":["# Perceptron Algorithm for Text Classification\n","\n","January 6, 2023\n","\n","Contents of the notebook:\n","\n","- Implement Perceptron algorithm in Python from scratch\n","- Train the model on the labeled data\n","- Evaluate the model on the test dataset"],"metadata":{"id":"jXVdWWeSnMV2"}},{"cell_type":"markdown","source":["## Task description\n","\n","- We will train a binary classification model to determine that a title is about a person. We will use the train dataset [here](https://raw.githubusercontent.com/neubig/nlptutorial/master/data/titles-en-train.labeled)\n","- We will evaluate the model on a [test dataset](https://raw.githubusercontent.com/neubig/nlptutorial/master/data/titles-en-test.labeled). We use accuracy as the evaluation measure.\n","\n"],"metadata":{"id":"8DgcYjuppIOy"}},{"cell_type":"markdown","source":["## Downloading dataset"],"metadata":{"id":"s1Omg1Q3qvIR"}},{"cell_type":"code","source":["%%capture\n","!rm -f titles-en-train.labeled\n","!rm -f titles-en-test.labeled\n","\n","!wget https://raw.githubusercontent.com/neubig/nlptutorial/master/data/titles-en-train.labeled\n","!wget https://raw.githubusercontent.com/neubig/nlptutorial/master/data/titles-en-test.labeled"],"metadata":{"id":"02B3tmGFqzO9","executionInfo":{"status":"ok","timestamp":1706339161421,"user_tz":-420,"elapsed":943,"user":{"displayName":"Minh Pham","userId":"01293297774691882951"}}},"execution_count":1,"outputs":[]},{"cell_type":"markdown","source":["Each sample is written in a line. There are two labels {1, -1} in the data.\n","\n","```\n","1\tFUJIWARA no Chikamori ( year of birth and death unknown ) was a samurai and poet who lived at the end of the Heian period .\n","-1\tYomi is the world of the dead .\n","```"],"metadata":{"id":"K-KDKP3krEVG"}},{"cell_type":"markdown","source":["## Loading Data\n","\n","We will load data into a list of sentences with their labels."],"metadata":{"id":"sJx6VFrDrnet"}},{"cell_type":"code","source":["def load_data(file_path):\n"," data = []\n"," with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:\n"," for line in f:\n"," line = line.strip()\n"," if line == '':\n"," continue\n"," lb, text = line.split('\\t')\n"," data.append((text,int(lb)))\n","\n"," return data"],"metadata":{"id":"ImPi5e1ssUQd","executionInfo":{"status":"ok","timestamp":1706339214799,"user_tz":-420,"elapsed":440,"user":{"displayName":"Minh Pham","userId":"01293297774691882951"}}},"execution_count":2,"outputs":[]},{"cell_type":"markdown","source":["Loading data from files"],"metadata":{"id":"vf7xFh3ltCx2"}},{"cell_type":"code","source":["train_data = load_data('./titles-en-train.labeled')\n","test_data = load_data('./titles-en-test.labeled')"],"metadata":{"id":"DGHFrGwzs-L7","executionInfo":{"status":"ok","timestamp":1706339232560,"user_tz":-420,"elapsed":408,"user":{"displayName":"Minh Pham","userId":"01293297774691882951"}}},"execution_count":3,"outputs":[]},{"cell_type":"code","source":["train_data[0]"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"J3ji8asktL2h","outputId":"5f2efb27-32f7-4586-9df8-e7290dc97052","executionInfo":{"status":"ok","timestamp":1706339237089,"user_tz":-420,"elapsed":432,"user":{"displayName":"Minh Pham","userId":"01293297774691882951"}}},"execution_count":4,"outputs":[{"output_type":"execute_result","data":{"text/plain":["('FUJIWARA no Chikamori ( year of birth and death unknown ) was a samurai and poet who lived at the end of the Heian period .',\n"," 1)"]},"metadata":{},"execution_count":4}]},{"cell_type":"markdown","source":["## Building Perceptron Model\n","\n","We will implement the class Perceptron with following methods:\n","\n","- `create_features`: to extract features from a sentence. For the sake of the simplicity, we will use unigram features\n","- `train`: train the Perceptron model on the training data\n","- `predict_one`: Predict the label for on sample\n","- `predict_all`: Predict labels for all sentences in the test data"],"metadata":{"id":"hvL6sL1Auvps"}},{"cell_type":"code","source":["\"\"\"\n","Implementation of Perceptron model\n","\"\"\"\n","from collections import defaultdict\n","\n","class Perceptron:\n"," \"\"\"Perceptron classifier\n"," \"\"\"\n"," def __init__(self, eta=0.001, n_iter=10):\n"," self.eta = eta\n"," self.n_iter = n_iter\n","\n"," def train(self, data):\n"," \"\"\"Training the model\n","\n"," Parameters\n"," ----------\n"," data: list of tuples (x,y) where x is a sentence and y is the label\n","\n"," Returns\n"," -------\n"," self : object\n"," \"\"\"\n"," self.w = defaultdict(int)\n"," for _ in range(self.n_iter):\n"," for x, y in data:\n"," phi = self.create_features(x)\n"," y_pred = self.predict_one(self.w, phi)\n"," if y != y_pred:\n"," self.update_weights(self.w, phi, y)\n","\n"," def predict_one(self, w, phi):\n"," score = 0\n"," for name, value in phi.items():\n"," if name in w:\n"," score += value * w[name]\n"," if score >= 0:\n"," return 1\n"," else:\n"," return -1\n","\n"," def create_features(self, x):\n"," phi = defaultdict()\n"," words = x.split()\n"," for word in words:\n"," phi[\"UNI:\" + word] = 1\n"," for i in range(len(words)-1):\n"," phi[\"BI:\" + words[i] + \" \" + words[i+1]] = 1\n"," return phi\n","\n"," def update_weights(self, w, phi, y):\n"," for name, value in phi.items():\n"," w[name] += value * y\n","\n"," def classify(self, x):\n"," phi = self.create_features(x)\n"," return self.predict_one(self.w, phi)\n","\n"," def predict_all(self, test_samples):\n"," y_preds = []\n"," for x in test_samples:\n"," y_pred = self.classify(x)\n"," y_preds.append(y_pred)\n"," return y_preds"],"metadata":{"id":"GxeJ6hE_wCCd","executionInfo":{"status":"ok","timestamp":1706339806009,"user_tz":-420,"elapsed":438,"user":{"displayName":"Minh Pham","userId":"01293297774691882951"}}},"execution_count":15,"outputs":[]},{"cell_type":"markdown","source":["## Training the model"],"metadata":{"id":"Tt16CnHsKy-u"}},{"cell_type":"code","source":["model = Perceptron(eta=1)\n","model.train(train_data)"],"metadata":{"id":"CU2f85Y1K6a-","executionInfo":{"status":"ok","timestamp":1706339814402,"user_tz":-420,"elapsed":4329,"user":{"displayName":"Minh Pham","userId":"01293297774691882951"}}},"execution_count":16,"outputs":[]},{"cell_type":"markdown","source":["## Prediction"],"metadata":{"id":"Iun_BU4GLB7m"}},{"cell_type":"code","source":["test_data[0]"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"uGPvnuvjLz2u","outputId":"70567e0e-af0e-4f00-95e9-e4634f20da73","executionInfo":{"status":"ok","timestamp":1706339613270,"user_tz":-420,"elapsed":596,"user":{"displayName":"Minh Pham","userId":"01293297774691882951"}}},"execution_count":12,"outputs":[{"output_type":"execute_result","data":{"text/plain":["('Bojo family were kuge ( court nobles ) with kakaku ( family status ) of meike ( the fourth highest status for court nobles ) .',\n"," -1)"]},"metadata":{},"execution_count":12}]},{"cell_type":"code","source":["model.classify(test_data[0][0])"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"cEHSvd52MIlj","outputId":"a2e87dc0-41b7-4572-e419-be5d4c892e5c","executionInfo":{"status":"ok","timestamp":1706339616458,"user_tz":-420,"elapsed":481,"user":{"displayName":"Minh Pham","userId":"01293297774691882951"}}},"execution_count":13,"outputs":[{"output_type":"execute_result","data":{"text/plain":["-1"]},"metadata":{},"execution_count":13}]},{"cell_type":"code","source":["test_data[1]"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"BnFK91LpMNO6","outputId":"b9475a89-2ffe-4d48-abe1-b5e7f34c0a1f","executionInfo":{"status":"ok","timestamp":1706339585253,"user_tz":-420,"elapsed":414,"user":{"displayName":"Minh Pham","userId":"01293297774691882951"}}},"execution_count":9,"outputs":[{"output_type":"execute_result","data":{"text/plain":["('Kotaifujin ( also called Sumemioya ) means a person who was the biological mother of an Emperor and consort of the previous Emperor .',\n"," 1)"]},"metadata":{},"execution_count":9}]},{"cell_type":"code","source":["model.classify(test_data[1][0])"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"_FJMB1GKMO6x","outputId":"bba66edf-f63a-4a75-e8b9-5f8f24ce027d","executionInfo":{"status":"ok","timestamp":1706339588570,"user_tz":-420,"elapsed":3,"user":{"displayName":"Minh Pham","userId":"01293297774691882951"}}},"execution_count":10,"outputs":[{"output_type":"execute_result","data":{"text/plain":["1"]},"metadata":{},"execution_count":10}]},{"cell_type":"markdown","source":["## Evaluation"],"metadata":{"id":"IFNRjX6WMRDK"}},{"cell_type":"code","source":["from sklearn import metrics\n","\n","X_test, y_true = zip(*test_data)\n","y_preds = model.predict_all(X_test)\n","\n","print(\"Accuracy: \", metrics.accuracy_score(y_true, y_preds))"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"EHu9U_EDM5Fn","outputId":"58fb5fee-0145-4304-b9c1-ee9eb04abdb7","executionInfo":{"status":"ok","timestamp":1706339820275,"user_tz":-420,"elapsed":469,"user":{"displayName":"Minh Pham","userId":"01293297774691882951"}}},"execution_count":17,"outputs":[{"output_type":"stream","name":"stdout","text":["Accuracy: 0.9376549769748495\n"]}]},{"cell_type":"code","source":[],"metadata":{"id":"i5Qkxw2oNFJG"},"execution_count":null,"outputs":[]}]}