# Implementation of Logistic Regression in Pytorch

In this notebook, we will implement Logistic Regression for Text Classification. We are going to use Pytorch framework to do the job.

## Download the data

In [1]:
%%capture
!rm -f titles-en-train.labeled
!rm -f titles-en-test.labeled

!wget https://raw.githubusercontent.com/neubig/nlptutorial/master/data/titles-en-train.labeled
!wget https://raw.githubusercontent.com/neubig/nlptutorial/master/data/titles-en-test.labeled

Each sample is written in a line. There are two labels {1, -1} in the data.

```
1	FUJIWARA no Chikamori ( year of birth and death unknown ) was a samurai and poet who lived at the end of the Heian period .
-1	Yomi is the world of the dead .
```

### Load data

We will load data into a list of sentences with their labels.

In [2]:
def load_data(file_path):
    data = []
    with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
        for line in f:
            line = line.strip()
            if line == '':
                continue
            lb, text = line.split('\t')
            data.append((text,int(lb)))

    return data

Loading data from files

In [3]:
train_data = load_data('./titles-en-train.labeled')
test_data = load_data('./titles-en-test.labeled')

train_docs, train_labels = zip(*train_data)
test_docs, test_labels = zip(*test_data)

## Data Processing

We need to convert textual data into Tensors before putting it into the training steps. We use BoW features (but note that, in deep learning, we often used different way in representing inputs, such as word embeddings).

In [4]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(
                             max_features=10000
                            )
vectorizer

In [5]:
X_train = vectorizer.fit_transform(train_docs)
X_train.shape

(11288, 10000)

In [6]:
X_test = vectorizer.transform(test_docs)

We cannot use `X_train` and `X_test` for training with Pytorch. We need to convert them into dense matrices.

In [7]:
X_train = X_train.toarray()
X_test = X_test.toarray()

In [8]:
X_train.shape

(11288, 10000)

Converting labels

In [9]:
train_labels = [0 if lb == -1 else lb for lb in train_labels]

In [10]:
test_labels = [0 if lb == -1 else lb for lb in test_labels]

### Converting data into Pytorch Tensors

In [11]:
import torch
from torch.utils.data import TensorDataset, DataLoader

device = 'cuda' if torch.cuda.is_available() else 'cpu'

X_train_t = torch.from_numpy(X_train).to(torch.float32).to(device)
y_train_t = torch.tensor(train_labels, dtype=torch.float32).to(device)

X_test_t = torch.from_numpy(X_test).to(torch.float32).to(device)
y_test_t = torch.tensor(test_labels, dtype=torch.float32).to(device)

In [12]:
print("X_train_t.size()=", X_train_t.size())
print("y_train_t.size()=", y_train_t.size())

X_train_t.size()= torch.Size([11288, 10000])
y_train_t.size()= torch.Size([11288])


### Creating datasets

In [13]:
train_dataset = TensorDataset(X_train_t, y_train_t)
val_dataset = TensorDataset(X_test_t, y_test_t)

## Logistic Regression Model

In [14]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class LogisticRegression(nn.Module):
    def __init__(self, input_dim, output_dim):
        super().__init__()
        self.linear = nn.Linear(input_dim, output_dim)

    def forward(self, x):
        # Multinomial: dùng softmax
        return torch.sigmoid(self.linear(x))

## Training Logistic Regression


### Building the model

In [15]:
import time

input_dim = 10000
output_dim = 1
epochs = 200 # epoch
learning_rate = 1e-3  # learning rate
batch_size = 32 # batch size for training

model = LogisticRegression(input_dim, output_dim)
model = model.to(device)

criterion = torch.nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

### Training Looop

In [16]:
from tqdm.notebook import trange, tqdm
from torch.utils.data import RandomSampler, SequentialSampler
from sklearn import metrics

def train():
    train_sampler = RandomSampler(train_dataset)
    train_dataloader = DataLoader(
        train_dataset,
        sampler=train_sampler,
        batch_size=batch_size,
    )
    model.train()
    total_acc, total_count = 0, 0

    train_iterator = trange(int(epochs), desc="Epoch")

    for _ in train_iterator:
        for batch in train_dataloader:
            optimizer.zero_grad()
            pred = model(batch[0]).squeeze(1)  # Compute output
            loss = criterion(pred, batch[1]) # Compute loss function
            loss.backward() # to compute Gradients
            optimizer.step() # Update weights

def evaluate():
    model.eval()
    test_sampler = SequentialSampler(val_dataset)
    test_dataloader = DataLoader(
        val_dataset,
        sampler=test_sampler,
        batch_size=batch_size,
    )

    preds = []
    true_labels = []
    with torch.no_grad():
        for batch in tqdm(test_dataloader, desc="Evaluating"):
            logits = model(batch[0])
            _preds = (logits>0.5).type(torch.long).squeeze(1)
            preds += _preds.detach().cpu().numpy().tolist()
            true_labels += batch[1].detach().cpu().numpy().tolist()

    print(metrics.classification_report(true_labels, preds))

In [17]:
train()

Epoch:   0%|          | 0/200 [00:00<?, ?it/s]

In [None]:
evaluate()

Evaluating:   0%|          | 0/89 [00:00<?, ?it/s]

              precision    recall  f1-score   support

         0.0       0.93      0.93      0.93      1477
         1.0       0.92      0.92      0.92      1346

    accuracy                           0.93      2823
   macro avg       0.93      0.93      0.93      2823
weighted avg       0.93      0.93      0.93      2823



## References

- https://www.kaggle.com/code/glebbuzin/solving-sklearn-datasets-with-pytorch