{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "source": [
        "# Perceptron Algorithm for Text Classification\n"
      ],
      "metadata": {
        "id": "WXSfxa6jQvFU"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Task Description\n",
        "\n",
        "- We will train a binary classification model to determine that a title is about a person. We will use the train dataset [here](https://raw.githubusercontent.com/neubig/nlptutorial/master/data/titles-en-train.labeled)\n",
        "- We will evaluate the model on a [test dataset](https://raw.githubusercontent.com/neubig/nlptutorial/master/data/titles-en-test.labeled). We use accuracy as the evaluation measure."
      ],
      "metadata": {
        "id": "CONyRgxLSPbL"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Downloading dataset"
      ],
      "metadata": {
        "id": "P8f5N_znSTzv"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "%%capture\n",
        "!rm -f titles-en-train.labeled\n",
        "!rm -f titles-en-test.labeled\n",
        "\n",
        "!wget https://raw.githubusercontent.com/neubig/nlptutorial/master/data/titles-en-train.labeled\n",
        "!wget https://raw.githubusercontent.com/neubig/nlptutorial/master/data/titles-en-test.labeled"
      ],
      "metadata": {
        "id": "v5aks5DnSyal"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Each sample is written in a line. There are two labels {1, -1} in the data.\n",
        "\n",
        "```\n",
        "1\tFUJIWARA no Chikamori ( year of birth and death unknown ) was a samurai and poet who lived at the end of the Heian period .\n",
        "-1\tYomi is the world of the dead .\n",
        "```"
      ],
      "metadata": {
        "id": "8yIjkHPySzVy"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Loading Data\n",
        "\n",
        "We will load data into a list of sentences with their labels."
      ],
      "metadata": {
        "id": "6tWeuXH5S2Vq"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "def load_data(file_path):\n",
        "    data = []\n",
        "    with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:\n",
        "        for line in f:\n",
        "            line = line.strip()\n",
        "            if line == '':\n",
        "                continue\n",
        "            lb, text = line.split('\\t')\n",
        "            data.append((text,int(lb)))\n",
        "\n",
        "    return data"
      ],
      "metadata": {
        "id": "hJ4kiQMVS5TR"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Loading data from files"
      ],
      "metadata": {
        "id": "xAcpv2SbS7EN"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "train_data = load_data('./titles-en-train.labeled')\n",
        "test_data = load_data('./titles-en-test.labeled')"
      ],
      "metadata": {
        "id": "2chTBZFzS-Iz"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "train_data[0]"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "Rl8pw2csS_ca",
        "outputId": "b95ec9c3-5a0e-4b1d-a75a-9bafe3cdebe8"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "('FUJIWARA no Chikamori ( year of birth and death unknown ) was a samurai and poet who lived at the end of the Heian period .',\n",
              " 1)"
            ]
          },
          "metadata": {},
          "execution_count": 4
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Building Perceptron Model\n",
        "\n",
        "You will need to complete the implementation of the Perceptron class as follows."
      ],
      "metadata": {
        "id": "UHpcwG5pTBOC"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "\"\"\"\n",
        "Implementation of Perceptron model\n",
        "\"\"\"\n",
        "from collections import defaultdict\n",
        "\n",
        "class Perceptron:\n",
        "    \"\"\"Perceptron classifier\n",
        "    \"\"\"\n",
        "    def __init__(self, n_iter=10):\n",
        "        self.n_iter = n_iter\n",
        "\n",
        "    def train(self, data):\n",
        "        \"\"\"Training the model\n",
        "\n",
        "        Parameters\n",
        "        ----------\n",
        "        data: list of tuples (x,y) where x is a sentence and y is the label\n",
        "\n",
        "        Returns\n",
        "        -------\n",
        "        self : object\n",
        "        \"\"\"\n",
        "        self.w = defaultdict(float)\n",
        "        for _ in range(self.n_iter):\n",
        "            for x, y in data:\n",
        "                phi = self.create_features(x)\n",
        "                y_pred = self.predict_one(self.w, phi)\n",
        "                if y != y_pred:\n",
        "                        self.update_weights(self.w, phi, y)\n",
        "\n",
        "    def predict_one(self, w, phi):\n",
        "        \"\"\"\n",
        "\n",
        "        Parameters\n",
        "        ------------\n",
        "        w (dict): weights of features\n",
        "        phi (dict): features extracted from input\n",
        "\n",
        "        Returns\n",
        "        ------------\n",
        "        label for the input sentence (1 or -1)\n",
        "        \"\"\"\n",
        "        #TODO: Write your code here\n",
        "        pass\n",
        "\n",
        "    def create_features(self, x):\n",
        "        \"\"\"\n",
        "        Parameters\n",
        "        -----------------\n",
        "        x (str): Input sentence\n",
        "\n",
        "        Returns\n",
        "        -----------------\n",
        "        phi: dictionary, feature vector\n",
        "        \"\"\"\n",
        "        #TODO: Write your code here\n",
        "        pass\n",
        "\n",
        "    def update_weights(self, w, phi, y):\n",
        "        \"\"\"\n",
        "        Parameters\n",
        "        -----------------\n",
        "        w (dict): weights of features\n",
        "        phi (dict): features extracted from input\n",
        "        y (int): Gold label (1 or -1)\n",
        "\n",
        "        Returns\n",
        "        -----------------\n",
        "        None\n",
        "        \"\"\"\n",
        "        #TODO: Write your code here\n",
        "        pass\n",
        "\n",
        "    def classify(self, x):\n",
        "        phi = self.create_features(x)\n",
        "        return self.predict_one(self.w, phi)\n",
        "\n",
        "    def predict_all(self, test_samples):\n",
        "        y_preds = []\n",
        "        for x in test_samples:\n",
        "            y_pred = self.classify(x)\n",
        "            y_preds.append(y_pred)\n",
        "        return y_preds"
      ],
      "metadata": {
        "id": "sIq3HDzLTbst"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Training the model"
      ],
      "metadata": {
        "id": "sqqTLXwPUOXq"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "model = Perceptron()\n",
        "model.train(train_data)"
      ],
      "metadata": {
        "id": "n2VTLTtmURRE"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Evaluation"
      ],
      "metadata": {
        "id": "KrJx6LJ4UTXH"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "You need to evaluate the model on the test data and report the accuracy here."
      ],
      "metadata": {
        "id": "6VeO44KuUVuN"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from sklearn import metrics\n",
        "\n",
        "X_test, y_true = zip(*test_data)\n",
        "y_preds = model.predict_all(X_test)\n",
        "\n",
        "print(\"Accuracy: \", metrics.accuracy_score(y_true, y_preds))"
      ],
      "metadata": {
        "id": "zHacEpheUcbh"
      },
      "execution_count": null,
      "outputs": []
    }
  ]
}