{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "nBXYLRO4owGI" }, "source": [ "# Text Classification Using Feed-forward Neural Networks\n", "\n", "In this tutorial, we are going to implement a Feed-forward Neural Network model for the text classification problem.\n", "\n", "Our FNN contains the following layers:\n", "- Input layer\n", "- Hidden layer\n", "- Output layer\n", "\n", "Pipeline to implement the model as follows:\n", "- We need feature representations for input. The common way is representing a document as a vector.\n", "- Implement the model using `torch.nn` module in Pytorch framework\n", "- Train the model on the training dataset\n", "- Evaluate the model on the test dataset" ] }, { "cell_type": "markdown", "metadata": { "id": "QHhmj1KlK4rj" }, "source": [ "## Dataset\n", "\n", "We are going to use [BBC Full Text Document Classification](https://www.kaggle.com/datasets/alfathterry/bbc-full-text-document-classification) for this tutorial." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "6ahwUP3rNy8J" }, "outputs": [], "source": [ "%%capture\n", "!pip install kaggle" ] }, { "cell_type": "markdown", "metadata": { "id": "sVRUxWBROTkE" }, "source": [ "Download the dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "UhP902PtOf6v", "outputId": "e93d178a-0f17-47e4-c1cd-f51d2acfb2f8" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Dataset URL: https://www.kaggle.com/datasets/alfathterry/bbc-full-text-document-classification\n", "License(s): MIT\n", "Downloading bbc-full-text-document-classification.zip to /content\n", "100% 1.84M/1.84M [00:01<00:00, 1.97MB/s]\n", "100% 1.84M/1.84M [00:01<00:00, 1.74MB/s]\n" ] } ], "source": [ "#!/bin/bash\n", "!rm -f bbc-full-text-document-classification.zip\n", "!kaggle datasets download alfathterry/bbc-full-text-document-classification" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "4lpfgM-iOkr5", "outputId": "df3ff95b-47c9-4f45-912a-5b3e2c3415e2" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Archive: bbc-full-text-document-classification.zip\n", " inflating: bbc_data.csv \n" ] } ], "source": [ "!rm -f bbc_data.csv\n", "!unzip bbc-full-text-document-classification.zip" ] }, { "cell_type": "markdown", "metadata": { "id": "T3eCC14AOtO1" }, "source": [ "The data file includes two columns data and labels" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "tDW_WHkhQ_yW", "outputId": "7ef90c35-dffd-43a6-f680-f91ea377f0e0" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "\n", "RangeIndex: 2225 entries, 0 to 2224\n", "Data columns (total 2 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 data 2225 non-null object\n", " 1 labels 2225 non-null object\n", "dtypes: object(2)\n", "memory usage: 34.9+ KB\n", "None\n", "\n", "Description of numerical features:\n", " data labels\n", "count 2225 2225\n", "unique 2126 5\n", "top Web radio takes Spanish rap global Spin the r... sport\n", "freq 2 511\n", "\n", "Number of samples per class:\n", "labels\n", "sport 511\n", "business 510\n", "politics 417\n", "tech 401\n", "entertainment 386\n", "Name: count, dtype: int64\n" ] } ], "source": [ "# prompt: Give me statistics about the dataset\n", "\n", "import pandas as pd\n", "\n", "# Load the dataset\n", "try:\n", " df = pd.read_csv('bbc_data.csv')\n", " # Display basic statistics\n", " print(df.info())\n", " print(\"\\nDescription of numerical features:\")\n", " print(df.describe())\n", " print(\"\\nNumber of samples per class:\")\n", " print(df['labels'].value_counts())\n", "except FileNotFoundError:\n", " print(\"Error: 'bbc_data.csv' not found. Please make sure the dataset is downloaded and unzipped correctly.\")\n", "except Exception as e:\n", " print(f\"An unexpected error occurred: {e}\")" ] }, { "cell_type": "markdown", "metadata": { "id": "AyPPzTZ121R1" }, "source": [ "## Data Preprocessing\n", "\n", "We are going to perform data preprocessing. The preprocessing step includes:\n", "- Tokenization\n", "- Remove punctuations, stopwords, special characters\n", "- Save the clean data into the column \"clean_data\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "oNzjoiF33pnx", "outputId": "a3854397-680f-4e93-c302-10298ee36292" }, "outputs": [ { "output_type": "stream", "name": "stderr", "text": [ "[nltk_data] Downloading package stopwords to /root/nltk_data...\n", "[nltk_data] Unzipping corpora/stopwords.zip.\n", "[nltk_data] Downloading package punkt_tab to /root/nltk_data...\n", "[nltk_data] Unzipping tokenizers/punkt_tab.zip.\n" ] } ], "source": [ "# prompt: We are going to perform data preprocessing for the text data in the data frame (column data)\n", "# . The preprocessing step includes:\n", "# - Tokenization\n", "# - Remove punctuations, stopwords, special characters\n", "# - Save the clean data into the column \"clean_data\"\n", "\n", "import nltk\n", "import re\n", "from nltk.corpus import stopwords\n", "nltk.download('stopwords')\n", "nltk.download('punkt_tab')\n", "\n", "stop_words = set(stopwords.words('english'))\n", "\n", "def clean_text(text):\n", " text = re.sub(r'[^\\w\\s]', '', text) # Remove punctuations\n", " text = re.sub(r'\\d+', '', text) # Remove numbers\n", " text = text.lower() # Lowercase\n", " tokens = nltk.word_tokenize(text) # Tokenize\n", " tokens = [w for w in tokens if not w in stop_words] # Remove stopwords\n", " text = ' '.join(tokens)\n", " return text\n", "\n", "df['clean_data'] = df['data'].apply(clean_text)" ] }, { "cell_type": "markdown", "metadata": { "id": "jg6iHbUp4C32" }, "source": [ "## Train/Test Split" ] }, { "cell_type": "code", "source": [ "# prompt: # prompt: Split data into train/valid/test with ratio 70/10/20\n", "\n", "from sklearn.model_selection import train_test_split\n", "\n", "# Assuming 'df' is your DataFrame and 'clean_data' and 'labels' are your features and target variable\n", "X = df['clean_data']\n", "y = df['labels']\n", "\n", "# Split data into training and temporary sets (80/20 split)\n", "X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42) # random_state for reproducibility\n", "\n", "X_test, X_val, y_test, y_val = train_test_split(X_temp, y_temp, test_size=0.33, random_state=42)\n", "\n", "# Now you have X_train, y_train, X_val, y_val, X_test, and y_test\n", "\n", "print(f\"Training data size: {len(X_train)}\")\n", "print(f\"Validation data size: {len(X_val)}\")\n", "print(f\"Testing data size: {len(X_test)}\")" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "ZzriciJ-wdhA", "outputId": "e325337d-bb68-44d3-8ede-d85491860a9f" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Training data size: 1557\n", "Validation data size: 221\n", "Testing data size: 447\n" ] } ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 311 }, "id": "c0Ntehcm2qYz", "outputId": "8487f98e-1603-4c43-f028-62e5370ff661" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "'musicians tackle us red tape musicians groups tackle us visa regulations blamed hindering british acts chances succeeding across atlantic singer hoping perform us expect pay xcxa simply obtaining visa groups including musicians union calling end raw deal faced british performers us acts faced comparable expense bureaucracy visiting uk promotional purposes nigel mccune musicians union said british musicians disadvantaged compared us counterparts sponsor make petition behalf form amounting nearly pages musicians face tougher regulations athletes journalists make mistake form risk fiveyear ban thus ability career says mr mccune us worlds biggest music market means something done creaky bureaucracy says mr mccune current situation preventing british acts maintaining momentum developing us added musicians union stance endorsed music managers forum mmf say british artists face uphill struggle succeed us thanks tough visa requirements also seen impractical mmfs general secretary james seller said imagine orchestra orkneys every member would travel london visas processed us market seen holy grail one benchmarks success still going fight get still important markets like europe india china added mr seller department media culture sport spokeswoman said aware people experiencing problems working us embassy record industry see us embassy spokesman said aware entertainers require visas timespecific visas everything process applications speedily aware importance cultural exchange best facilitate added'" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" } }, "metadata": {}, "execution_count": 7 } ], "source": [ "X_train[0]" ] }, { "cell_type": "markdown", "metadata": { "id": "or-daHne1YrS" }, "source": [ "## Feature Representation\n", "\n", "We convert documents into feature vectors by the following method:\n", "- Get vectors of words in the document using a word2vec model glove-wiki-gigaword-300. Use gensim to load the model\n", "- Calculate average word vectors of all words to get the feature representation of the document" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "GyXs91qF5jWo" }, "outputs": [], "source": [ "# prompt: We convert documents into feature vectors by the following method:\n", "# - Get vectors of words in the document using a word2vec model glove-wiki-gigaword-300. Use gensim to load the model\n", "# - Calculate average word vectors of all words to get the feature representation of the document\n", "\n", "%%capture\n", "!pip install gensim\n", "\n", "import gensim.downloader as api\n", "from gensim.models import KeyedVectors\n", "\n", "# Load the pre-trained Word2Vec model\n", "try:\n", " model = api.load(\"glove-wiki-gigaword-300\")\n", "except Exception as e:\n", " print(f\"Error loading the model: {e}\")\n", " # Handle the error appropriately, e.g., exit or use a different model" ] }, { "cell_type": "markdown", "metadata": { "id": "6K_i697s6cZF" }, "source": [ "Calculate the feature representations of documents" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "UWP6HvVo6Z8O" }, "outputs": [], "source": [ "import numpy as np\n", "\n", "def document_vector(doc):\n", " words = doc.split()\n", " vectors = [model[word] for word in words if word in model]\n", " if vectors:\n", " return np.mean(vectors, axis=0)\n", " else:\n", " return np.zeros(300) # Return a zero vector if no words are found in the model\n", "\n", "# Example usage\n", "try:\n", " X_train_vectors = np.array([document_vector(doc) for doc in X_train])\n", " X_test_vectors = np.array([document_vector(doc) for doc in X_test])\n", " X_val_vectors = np.array([document_vector(doc) for doc in X_val])\n", "except NameError:\n", " print(\"X_train, X_test or X_val not defined. Please make sure to run the code that defines them first.\")\n", "except Exception as e:\n", " print(f\"An error occurred during vectorization: {e}\")" ] }, { "cell_type": "markdown", "metadata": { "id": "BmG5GB7t6kwI" }, "source": [ "Let's check the result" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "K1REMbtV6pd2", "outputId": "a0529ed2-63f7-4ad6-bc09-98a421762720" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "(1557, 300)" ] }, "metadata": {}, "execution_count": 11 } ], "source": [ "X_train_vectors.shape" ] }, { "cell_type": "markdown", "metadata": { "id": "tAGuWxrF6tC2" }, "source": [ "## Implement the model\n", "\n", "We will implement a feed-forward neural network model with one hidden layer. In the output layer we use softmax function.\n", "\n", "Use `torch.nn` to implement the model" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "WZJU6jj87T69", "outputId": "377ca955-aad1-4e5b-d759-4fd1274b0570" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "FFNN(\n", " (fc1): Linear(in_features=300, out_features=128, bias=True)\n", " (relu): ReLU()\n", " (fc2): Linear(in_features=128, out_features=5, bias=True)\n", ")\n", "tensor([[-0.0403, -0.0123, 0.0227, -0.1573, -0.1787]],\n", " grad_fn=)\n" ] }, { "output_type": "execute_result", "data": { "text/plain": [ "tensor([[0.2060, 0.2119, 0.2194, 0.1833, 0.1794]], grad_fn=)" ] }, "metadata": {}, "execution_count": 13 } ], "source": [ "# prompt: We will implement a feed-forward neural network model with one hidden layer. In the output layer we use softmax function.\n", "# Use `torch.nn` to implement the model\n", "\n", "import torch\n", "import torch.nn as nn\n", "import torch.nn.functional as F\n", "\n", "class FFNN(nn.Module):\n", " def __init__(self, input_size, hidden_size, num_classes):\n", " super(FFNN, self).__init__()\n", " self.fc1 = nn.Linear(input_size, hidden_size)\n", " self.relu = nn.ReLU()\n", " self.fc2 = nn.Linear(hidden_size, num_classes)\n", "\n", " def forward(self, x):\n", " out = self.fc1(x)\n", " out = self.relu(out)\n", " out = self.fc2(out)\n", " return out\n", "\n", "# Example usage\n", "input_size = 300 # Size of input vectors (from Word2Vec)\n", "hidden_size = 128 # Number of neurons in the hidden layer\n", "num_classes = 5 # Number of classes in the dataset (adjust as needed)\n", "\n", "model = FFNN(input_size, hidden_size, num_classes)\n", "print(model)\n", "# Example input\n", "input_tensor = torch.randn(1, input_size) # Example input\n", "output = model(input_tensor)\n", "print(output)\n", "# Apply softmax\n", "softmax_output = F.softmax(output, dim=1)\n", "softmax_output\n" ] }, { "cell_type": "markdown", "metadata": { "id": "jng-A98a7g1j" }, "source": [ "## Training the model\n", "\n", "In order to train the model, we will write the training loop in Pytorch. Before that we need to transform the data into Tensor dataset." ] }, { "cell_type": "code", "source": [ "# prompt: Write the function train() that implements the training loop. We will train the above the model on the training data. During training, calculate the accuracy on the validation data.\n", "# Note that\n", "# - We need shuffle training data but do not shuffle validation data\n", "# - Use RandomSampler, SequentialSampler object\n", "# After that:\n", "# - Create an object model of FFNN class\n", "# - Convert data into tensor dataset\n", "# - Train the model using the function train()\n", "\n", "from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler\n", "\n", "def train(model, train_loader, val_loader, criterion, optimizer, device, num_epochs=10):\n", " model.to(device)\n", " for epoch in range(num_epochs):\n", " model.train()\n", " running_loss = 0.0\n", " for inputs, labels in train_loader:\n", " inputs = inputs.to(device)\n", " labels = labels.to(device)\n", " optimizer.zero_grad()\n", " outputs = model(inputs)\n", " loss = criterion(outputs, labels)\n", " loss.backward()\n", " optimizer.step()\n", " running_loss += loss.item()\n", "\n", " epoch_loss = running_loss / len(train_loader)\n", " print(f\"Epoch [{epoch+1}/{num_epochs}], Loss: {epoch_loss:.4f}\")\n", " # Validation\n", " model.eval()\n", " correct = 0\n", " total = 0\n", " with torch.no_grad():\n", " for inputs, labels in val_loader:\n", " inputs = inputs.to(device)\n", " labels = labels.to(device)\n", " outputs = model(inputs)\n", " _, predicted = torch.max(outputs.data, 1)\n", " total += labels.size(0)\n", " correct += (predicted == labels).sum().item()\n", "\n", " accuracy = 100 * correct / total\n", " print(f\"Validation Accuracy: {accuracy:.2f}%\")\n", "\n", "# Assuming you have X_train_vectors, y_train, X_val_vectors, and y_val\n", "\n", "# Convert data to tensors\n", "X_train_tensor = torch.tensor(X_train_vectors, dtype=torch.float32)\n", "y_train_tensor = torch.tensor(y_train.astype('category').cat.codes.values, dtype=torch.long) # Convert labels to numerical and ensure LongTensor type\n", "X_val_tensor = torch.tensor(X_val_vectors, dtype=torch.float32)\n", "y_val_tensor = torch.tensor(y_val.astype('category').cat.codes.values, dtype=torch.long) # Convert labels to numerical and ensure LongTensor type\n", "\n", "# Create TensorDatasets\n", "train_data = TensorDataset(X_train_tensor, y_train_tensor)\n", "val_data = TensorDataset(X_val_tensor,y_val_tensor)\n", "\n", "\n", "# Create DataLoaders\n", "train_sampler = RandomSampler(train_data)\n", "train_loader = DataLoader(train_data, sampler=train_sampler, batch_size=32)\n", "\n", "val_sampler = SequentialSampler(val_data)\n", "val_loader = DataLoader(val_data, sampler=val_sampler, batch_size=32)\n", "\n", "# Device configuration\n", "device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\n", "\n", "# Model, loss function, and optimizer\n", "input_size = 300\n", "hidden_size = 128\n", "num_classes = 5\n", "model = FFNN(input_size, hidden_size, num_classes)\n", "criterion = nn.CrossEntropyLoss()\n", "optimizer = torch.optim.Adam(model.parameters(), lr=0.001)\n", "\n", "# Train the model\n", "train(model, train_loader, val_loader, criterion, optimizer, device, num_epochs=20)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "-5Et7l80zf0M", "outputId": "826cf23c-7acc-4baf-f0b4-c91704f0da02" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Epoch [1/20], Loss: 1.1833\n", "Validation Accuracy: 94.57%\n", "Epoch [2/20], Loss: 0.3814\n", "Validation Accuracy: 95.48%\n", "Epoch [3/20], Loss: 0.1856\n", "Validation Accuracy: 96.83%\n", "Epoch [4/20], Loss: 0.1350\n", "Validation Accuracy: 95.93%\n", "Epoch [5/20], Loss: 0.1126\n", "Validation Accuracy: 96.83%\n", "Epoch [6/20], Loss: 0.0963\n", "Validation Accuracy: 96.83%\n", "Epoch [7/20], Loss: 0.0835\n", "Validation Accuracy: 96.83%\n", "Epoch [8/20], Loss: 0.0748\n", "Validation Accuracy: 96.38%\n", "Epoch [9/20], Loss: 0.0695\n", "Validation Accuracy: 97.29%\n", "Epoch [10/20], Loss: 0.0643\n", "Validation Accuracy: 97.29%\n", "Epoch [11/20], Loss: 0.0571\n", "Validation Accuracy: 96.83%\n", "Epoch [12/20], Loss: 0.0533\n", "Validation Accuracy: 96.83%\n", "Epoch [13/20], Loss: 0.0473\n", "Validation Accuracy: 96.83%\n", "Epoch [14/20], Loss: 0.0452\n", "Validation Accuracy: 97.29%\n", "Epoch [15/20], Loss: 0.0409\n", "Validation Accuracy: 97.29%\n", "Epoch [16/20], Loss: 0.0379\n", "Validation Accuracy: 96.83%\n", "Epoch [17/20], Loss: 0.0359\n", "Validation Accuracy: 97.29%\n", "Epoch [18/20], Loss: 0.0323\n", "Validation Accuracy: 97.29%\n", "Epoch [19/20], Loss: 0.0313\n", "Validation Accuracy: 97.74%\n", "Epoch [20/20], Loss: 0.0294\n", "Validation Accuracy: 97.29%\n" ] } ] }, { "cell_type": "markdown", "source": [ "Save the model into model" ], "metadata": { "id": "NPxixKXS1eQ9" } }, { "cell_type": "code", "source": [ "# prompt: Save the model into file\n", "\n", "torch.save(model.state_dict(), 'model.pth')\n" ], "metadata": { "id": "gf7D8qGn1i_K" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "## Evaluation on the test data\n", "\n", "We will use the model above to predict on the test data and calculate evaluation metrics including accuracy, precision, recall, F1 scores." ], "metadata": { "id": "8tMfP20n1nCQ" } }, { "cell_type": "code", "source": [ "# prompt: Use the model above to predict on the test data and calculate evaluation metrics including accuracy, precision, recall, F1 scores.\n", "\n", "import torch\n", "from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score\n", "\n", "# Load the saved model\n", "model = FFNN(input_size, hidden_size, num_classes)\n", "model.load_state_dict(torch.load('model.pth'))\n", "model.to(device)\n", "model.eval()\n", "\n", "\n", "# Assuming X_test_vectors and y_test are defined from the previous steps\n", "\n", "# Convert the test data to tensors\n", "X_test_tensor = torch.tensor(X_test_vectors, dtype=torch.float32)\n", "y_test_tensor = torch.tensor(y_test.astype('category').cat.codes.values, dtype=torch.long)\n", "\n", "# Create a DataLoader for the test data\n", "test_data = TensorDataset(X_test_tensor, y_test_tensor)\n", "test_sampler = SequentialSampler(test_data)\n", "test_loader = DataLoader(test_data, sampler=test_sampler, batch_size=32)\n", "\n", "\n", "# Make predictions on the test set\n", "y_pred_list = []\n", "y_true_list = []\n", "with torch.no_grad():\n", " for inputs, labels in test_loader:\n", " inputs = inputs.to(device)\n", " outputs = model(inputs)\n", " _, predicted = torch.max(outputs.data, 1)\n", " y_pred_list.extend(predicted.cpu().numpy())\n", " y_true_list.extend(labels.cpu().numpy())\n", "\n", "\n", "# Calculate evaluation metrics\n", "accuracy = accuracy_score(y_true_list, y_pred_list)\n", "precision = precision_score(y_true_list, y_pred_list, average='weighted') # Use 'weighted' for multi-class\n", "recall = recall_score(y_true_list, y_pred_list, average='weighted')\n", "f1 = f1_score(y_true_list, y_pred_list, average='weighted')\n", "\n", "print(f\"Accuracy: {accuracy:.4f}\")\n", "print(f\"Precision: {precision:.4f}\")\n", "print(f\"Recall: {recall:.4f}\")\n", "print(f\"F1 Score: {f1:.4f}\")" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "zKhfdiBc18WY", "outputId": "1e5c6263-e5c5-4358-828d-82235e6d5883" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Accuracy: 0.9709\n", "Precision: 0.9715\n", "Recall: 0.9709\n", "F1 Score: 0.9708\n" ] }, { "output_type": "stream", "name": "stderr", "text": [ ":8: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.\n", " model.load_state_dict(torch.load('model.pth'))\n" ] } ] }, { "cell_type": "markdown", "source": [ "## Model Deployment\n", "\n", "Create API endpoint that take a document as input, preprocess the input and use the model to predict the category for that model" ], "metadata": { "id": "ClVr8ns02IDb" } }, { "cell_type": "code", "source": [ "# prompt: Create API endpoint that take a document as input, preprocess the input and use the model to predict the category for that model\n", "\n", "from flask import Flask, request, jsonify\n", "import torch\n", "import torch.nn as nn\n", "import torch.nn.functional as F\n", "import nltk\n", "import re\n", "from nltk.corpus import stopwords\n", "import numpy as np\n", "import gensim.downloader as api\n", "\n", "# Load the pre-trained Word2Vec model (make sure it's downloaded)\n", "try:\n", " model = api.load(\"glove-wiki-gigaword-300\")\n", "except Exception as e:\n", " print(f\"Error loading the model: {e}\")\n", "\n", "# Define the FFNN model (same as in the training notebook)\n", "class FFNN(nn.Module):\n", " def __init__(self, input_size, hidden_size, num_classes):\n", " super(FFNN, self).__init__()\n", " self.fc1 = nn.Linear(input_size, hidden_size)\n", " self.relu = nn.ReLU()\n", " self.fc2 = nn.Linear(hidden_size, num_classes)\n", "\n", " def forward(self, x):\n", " out = self.fc1(x)\n", " out = self.relu(out)\n", " out = self.fc2(out)\n", " return out\n", "\n", "# Load the trained model\n", "input_size = 300\n", "hidden_size = 128\n", "num_classes = 5\n", "device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\n", "model = FFNN(input_size, hidden_size, num_classes)\n", "model.load_state_dict(torch.load('model.pth', map_location=device))\n", "model.to(device)\n", "model.eval()\n", "\n", "\n", "# Data preprocessing function\n", "nltk.download('stopwords')\n", "nltk.download('punkt_tab')\n", "stop_words = set(stopwords.words('english'))\n", "\n", "def clean_text(text):\n", " text = re.sub(r'[^\\w\\s]', '', text) # Remove punctuations\n", " text = re.sub(r'\\d+', '', text) # Remove numbers\n", " text = text.lower() # Lowercase\n", " tokens = nltk.word_tokenize(text) # Tokenize\n", " tokens = [w for w in tokens if not w in stop_words] # Remove stopwords\n", " text = ' '.join(tokens)\n", " return text\n", "\n", "\n", "def document_vector(doc):\n", " words = doc.split()\n", " vectors = [model[word] for word in words if word in model]\n", " if vectors:\n", " return np.mean(vectors, axis=0)\n", " else:\n", " return np.zeros(300) # Return a zero vector if no words are found in the model\n", "\n", "app = Flask(__name__)\n", "\n", "@app.route('/predict', methods=['POST'])\n", "def predict():\n", " try:\n", " data = request.get_json(force=True)\n", " document = data['document']\n", " cleaned_doc = clean_text(document)\n", " doc_vector = document_vector(cleaned_doc)\n", " input_tensor = torch.tensor(doc_vector, dtype=torch.float32).unsqueeze(0).to(device)\n", " with torch.no_grad():\n", " output = model(input_tensor)\n", " _, predicted_class = torch.max(output, 1)\n", " predicted_class = predicted_class.item()\n", " return jsonify({'predicted_category': predicted_class})\n", " except Exception as e:\n", " return jsonify({'error': str(e)})\n", "\n", "if __name__ == '__main__':\n", " app.run(debug=True)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "rBhJSkae2ajI", "outputId": "b5dd69f3-2fdd-4a6c-f1f6-dd6229010946" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ " * Serving Flask app '__main__'\n", " * Debug mode: on\n" ] }, { "output_type": "stream", "name": "stderr", "text": [ ":39: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.\n", " model.load_state_dict(torch.load('model.pth', map_location=device))\n", "[nltk_data] Downloading package stopwords to /root/nltk_data...\n", "[nltk_data] Package stopwords is already up-to-date!\n", "[nltk_data] Downloading package punkt_tab to /root/nltk_data...\n", "[nltk_data] Package punkt_tab is already up-to-date!\n", "INFO:werkzeug:\u001b[31m\u001b[1mWARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.\u001b[0m\n", " * Running on http://127.0.0.1:5000\n", "INFO:werkzeug:\u001b[33mPress CTRL+C to quit\u001b[0m\n", "INFO:werkzeug: * Restarting with stat\n" ] } ] }, { "cell_type": "markdown", "source": [ "Write Python code the use the above endpoint to predict the label for a sample text data" ], "metadata": { "id": "AIkCkpF82-cO" } }, { "cell_type": "code", "source": [ "# prompt: Write Python code the use the above endpoint to predict the label for a sample text data\n", "\n", "import requests\n", "import json\n", "\n", "def predict_category(text):\n", " url = 'http://127.0.0.1:5000/predict' # Replace with your actual endpoint URL\n", " headers = {'Content-type': 'application/json'}\n", " data = {'document': text}\n", " try:\n", " response = requests.post(url, data=json.dumps(data), headers=headers)\n", " response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)\n", " prediction = response.json()\n", " return prediction['predicted_category']\n", " except requests.exceptions.RequestException as e:\n", " print(f\"Error making prediction: {e}\")\n", " return None\n", " except KeyError as e:\n", " print(f\"Unexpected response format: {e}\")\n", " return None\n", "\n", "\n", "# Example usage\n", "sample_text = \"This is a sample news article about business and finance.\"\n", "predicted_label = predict_category(sample_text)\n", "\n", "if predicted_label is not None:\n", " print(f\"Predicted Category: {predicted_label}\")" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "glUC6yHi3Lfk", "outputId": "7bddf26a-a3d6-47a9-e96a-9195a7f78824" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Error making prediction: HTTPConnectionPool(host='127.0.0.1', port=5000): Max retries exceeded with url: /predict (Caused by NewConnectionError(': Failed to establish a new connection: [Errno 111] Connection refused'))\n" ] } ] } ], "metadata": { "accelerator": "GPU", "colab": { "gpuType": "T4", "provenance": [] }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 0 }