{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "provenance": [] }, "kernelspec": { "name": "python3", "display_name": "Python 3" }, "language_info": { "name": "python" } }, "cells": [ { "cell_type": "markdown", "source": [ "# Perceptron Algorithm for Text Classification\n" ], "metadata": { "id": "WXSfxa6jQvFU" } }, { "cell_type": "markdown", "source": [ "## Task Description\n", "\n", "- We will train a binary classification model to determine that a title is about a person. We will use the train dataset [here](https://raw.githubusercontent.com/neubig/nlptutorial/master/data/titles-en-train.labeled)\n", "- We will evaluate the model on a [test dataset](https://raw.githubusercontent.com/neubig/nlptutorial/master/data/titles-en-test.labeled). We use accuracy as the evaluation measure." ], "metadata": { "id": "CONyRgxLSPbL" } }, { "cell_type": "markdown", "source": [ "## Downloading dataset" ], "metadata": { "id": "P8f5N_znSTzv" } }, { "cell_type": "code", "source": [ "%%capture\n", "!rm -f titles-en-train.labeled\n", "!rm -f titles-en-test.labeled\n", "\n", "!wget https://raw.githubusercontent.com/neubig/nlptutorial/master/data/titles-en-train.labeled\n", "!wget https://raw.githubusercontent.com/neubig/nlptutorial/master/data/titles-en-test.labeled" ], "metadata": { "id": "v5aks5DnSyal" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "Each sample is written in a line. There are two labels {1, -1} in the data.\n", "\n", "```\n", "1\tFUJIWARA no Chikamori ( year of birth and death unknown ) was a samurai and poet who lived at the end of the Heian period .\n", "-1\tYomi is the world of the dead .\n", "```" ], "metadata": { "id": "8yIjkHPySzVy" } }, { "cell_type": "markdown", "source": [ "## Loading Data\n", "\n", "We will load data into a list of sentences with their labels." ], "metadata": { "id": "6tWeuXH5S2Vq" } }, { "cell_type": "code", "source": [ "def load_data(file_path):\n", " data = []\n", " with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:\n", " for line in f:\n", " line = line.strip()\n", " if line == '':\n", " continue\n", " lb, text = line.split('\\t')\n", " data.append((text,int(lb)))\n", "\n", " return data" ], "metadata": { "id": "hJ4kiQMVS5TR" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "Loading data from files" ], "metadata": { "id": "xAcpv2SbS7EN" } }, { "cell_type": "code", "source": [ "train_data = load_data('./titles-en-train.labeled')\n", "test_data = load_data('./titles-en-test.labeled')" ], "metadata": { "id": "2chTBZFzS-Iz" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "train_data[0]" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Rl8pw2csS_ca", "outputId": "b95ec9c3-5a0e-4b1d-a75a-9bafe3cdebe8" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "('FUJIWARA no Chikamori ( year of birth and death unknown ) was a samurai and poet who lived at the end of the Heian period .',\n", " 1)" ] }, "metadata": {}, "execution_count": 4 } ] }, { "cell_type": "markdown", "source": [ "## Building Perceptron Model\n", "\n", "You will need to complete the implementation of the Perceptron class as follows." ], "metadata": { "id": "UHpcwG5pTBOC" } }, { "cell_type": "code", "source": [ "\"\"\"\n", "Implementation of Perceptron model\n", "\"\"\"\n", "from collections import defaultdict\n", "\n", "class Perceptron:\n", " \"\"\"Perceptron classifier\n", " \"\"\"\n", " def __init__(self, n_iter=10):\n", " self.n_iter = n_iter\n", "\n", " def train(self, data):\n", " \"\"\"Training the model\n", "\n", " Parameters\n", " ----------\n", " data: list of tuples (x,y) where x is a sentence and y is the label\n", "\n", " Returns\n", " -------\n", " self : object\n", " \"\"\"\n", " self.w = defaultdict(float)\n", " for _ in range(self.n_iter):\n", " for x, y in data:\n", " phi = self.create_features(x)\n", " y_pred = self.predict_one(self.w, phi)\n", " if y != y_pred:\n", " self.update_weights(self.w, phi, y)\n", "\n", " def predict_one(self, w, phi):\n", " \"\"\"\n", "\n", " Parameters\n", " ------------\n", " w (dict): weights of features\n", " phi (dict): features extracted from input\n", "\n", " Returns\n", " ------------\n", " label for the input sentence (1 or -1)\n", " \"\"\"\n", " #TODO: Write your code here\n", " pass\n", "\n", " def create_features(self, x):\n", " \"\"\"\n", " Parameters\n", " -----------------\n", " x (str): Input sentence\n", "\n", " Returns\n", " -----------------\n", " phi: dictionary, feature vector\n", " \"\"\"\n", " #TODO: Write your code here\n", " pass\n", "\n", " def update_weights(self, w, phi, y):\n", " \"\"\"\n", " Parameters\n", " -----------------\n", " w (dict): weights of features\n", " phi (dict): features extracted from input\n", " y (int): Gold label (1 or -1)\n", "\n", " Returns\n", " -----------------\n", " None\n", " \"\"\"\n", " #TODO: Write your code here\n", " pass\n", "\n", " def classify(self, x):\n", " phi = self.create_features(x)\n", " return self.predict_one(self.w, phi)\n", "\n", " def predict_all(self, test_samples):\n", " y_preds = []\n", " for x in test_samples:\n", " y_pred = self.classify(x)\n", " y_preds.append(y_pred)\n", " return y_preds" ], "metadata": { "id": "sIq3HDzLTbst" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "## Training the model" ], "metadata": { "id": "sqqTLXwPUOXq" } }, { "cell_type": "code", "source": [ "model = Perceptron()\n", "model.train(train_data)" ], "metadata": { "id": "n2VTLTtmURRE" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "## Evaluation" ], "metadata": { "id": "KrJx6LJ4UTXH" } }, { "cell_type": "markdown", "source": [ "You need to evaluate the model on the test data and report the accuracy here." ], "metadata": { "id": "6VeO44KuUVuN" } }, { "cell_type": "code", "source": [ "from sklearn import metrics\n", "\n", "X_test, y_true = zip(*test_data)\n", "y_preds = model.predict_all(X_test)\n", "\n", "print(\"Accuracy: \", metrics.accuracy_score(y_true, y_preds))" ], "metadata": { "id": "zHacEpheUcbh" }, "execution_count": null, "outputs": [] } ] }