{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"provenance": []
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "UXnI7KKa5AVF"
},
"source": [
"# Bigram Language Models\n",
"\n",
"In this notebook, we are going to implement a bigram language model and use interpolation for smoothing.\n",
"\n",
"The bigram probabilities will be estimated by using MLE as follows:\n",
"\n",
"$$\n",
"P_{ML}(w_i|w_{i-1})=\\frac{c(w_{i-1}w_{i})}{c(w_{i-1})}\n",
"$$\n",
"\n",
"Recall that, the smoothed bigram probabilities using interpolation technique is calculated as follows.\n",
"\n",
"Bigrams:\n",
"\n",
"$$\n",
"P(w_i|w_{i-1})=\\lambda_2 \\times P_{ML}(w_i|w_{i-1})+(1-\\lambda_2)\\times P(w_i)\n",
"$$\n",
"\n",
"where $P(w_i)$ is the smoothed unigram probability and calculated as follows.\n",
"\n",
"$$\n",
"P(w_i)=\\lambda_1 \\times P_{ML}(w_i) + (1-\\lambda_1) \\times \\frac{1}{N}\n",
"$$\n",
"\n",
"where $N$ is a large number, e.g., $N=1,000,000$.\n",
"\n",
"There are two parts in this programming assignment.\n",
"\n",
"- Training: you will estimate unigram and bigram probabilities from a raw text data file and save parameters of the language model into a file.\n",
"- Test: you use the trained language model to calculate calculates entropy, perplexity on the test data. You will need to use interpolation to calculate smoothed probabilities in testing.\n",
"\n",
"Please refer the pseudo-code in slide 17, 18 of the lecture [NLP Programming Tutorial 2 -Bigram Language Models](http://www.phontron.com/slides/nlp-programming-en-02-bigramlm.pdf) to complete the assignment."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "hZxKE4cMUcH3"
},
"source": [
"## Data\n",
"\n",
"We will use the file [wiki-en-train.word](https://raw.githubusercontent.com/neubig/nlptutorial/master/data/wiki-en-train.word) as the training data, and [wiki-en-test.\n",
"word](https://raw.githubusercontent.com/neubig/nlptutorial/master/data/wiki-en-test.word) as the test data. To test our implementation quickly, we will use small data file [02-train-input.txt](https://github.com/neubig/nlptutorial/blob/master/test/02-train-input.txt). All data files are from the [nlptutorial](https://github.com/neubig/nlptutorial) by Graham Neubig.\n",
"\n",
"As the first step, we will download all necessary data files using `wget` command line."
]
},
{
"cell_type": "code",
"metadata": {
"id": "Q5JRqLFIVnI-",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "1c6f8506-0756-429b-db7b-9378e40ceae0"
},
"source": [
"!rm -f wiki-en-train.word\n",
"!wget https://raw.githubusercontent.com/neubig/nlptutorial/master/data/wiki-en-train.word\n",
"\n",
"!rm -f wiki-en-test.word\n",
"!wget https://raw.githubusercontent.com/neubig/nlptutorial/master/data/wiki-en-test.word\n",
"\n",
"!rm -f 02-train-input.txt\n",
"!wget https://raw.githubusercontent.com/neubig/nlptutorial/master/test/02-train-input.txt"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"--2021-11-27 03:31:18-- https://raw.githubusercontent.com/neubig/nlptutorial/master/data/wiki-en-train.word\n",
"Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...\n",
"Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.\n",
"HTTP request sent, awaiting response... 200 OK\n",
"Length: 203886 (199K) [text/plain]\n",
"Saving to: ‘wiki-en-train.word’\n",
"\n",
"wiki-en-train.word 100%[===================>] 199.11K --.-KB/s in 0.03s \n",
"\n",
"2021-11-27 03:31:18 (6.98 MB/s) - ‘wiki-en-train.word’ saved [203886/203886]\n",
"\n",
"--2021-11-27 03:31:19-- https://raw.githubusercontent.com/neubig/nlptutorial/master/data/wiki-en-test.word\n",
"Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...\n",
"Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.\n",
"HTTP request sent, awaiting response... 200 OK\n",
"Length: 26989 (26K) [text/plain]\n",
"Saving to: ‘wiki-en-test.word’\n",
"\n",
"wiki-en-test.word 100%[===================>] 26.36K --.-KB/s in 0.001s \n",
"\n",
"2021-11-27 03:31:19 (20.1 MB/s) - ‘wiki-en-test.word’ saved [26989/26989]\n",
"\n",
"--2021-11-27 03:31:19-- https://raw.githubusercontent.com/neubig/nlptutorial/master/test/02-train-input.txt\n",
"Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...\n",
"Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.\n",
"HTTP request sent, awaiting response... 200 OK\n",
"Length: 12 [text/plain]\n",
"Saving to: ‘02-train-input.txt’\n",
"\n",
"02-train-input.txt 100%[===================>] 12 --.-KB/s in 0s \n",
"\n",
"2021-11-27 03:31:19 (792 KB/s) - ‘02-train-input.txt’ saved [12/12]\n",
"\n"
]
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "X0WLgNXeVt0Z"
},
"source": [
"## Part 1: Estimating probabilities\n",
"\n",
"What you need to do in this part is complete the function `train_bigram` to estimate unigram, bigram probabilities (using MLE method) from a text file and save probabilities to a model file.\n",
"\n",
"The format of the model file is as follows. Each line includes an n-gram (unigram or bigram) and its probability estimated by MLE method.\n",
"\n",
"```\n",
" a\t1.000000\n",
"a\t0.250000\n",
"a b\t1.000000\n",
"b\t0.250000\n",
"b c\t0.500000\n",
"...\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "KP2Tj-kZjowV"
},
"source": [
"### Part 1.1: Function train_bigram()"
]
},
{
"cell_type": "code",
"metadata": {
"id": "XX2a0Yetfzta"
},
"source": [
"from collections import defaultdict\n",
"\n",
"\n",
"def train_bigram(train_file, model_file):\n",
" \"\"\"Train trigram language model and save to model file\n",
" \"\"\"\n",
" counts = defaultdict(int) # count the n-gram\n",
" context_counts = defaultdict(int) # count the context\n",
" with open(train_file) as f:\n",
" for line in f:\n",
" line = line.strip()\n",
" if line == '':\n",
" continue\n",
" words = line.split()\n",
" words.append('')\n",
" words.insert(0, '')\n",
"\n",
" for i in range(1, len(words)): # Note: starting at 1, after \n",
" # TODO: Write code to count bigrams and their contexts\n",
" # YOUR CODE HERE\n",
"\n",
" counts[words[i-1] + ' ' + words[i]] += 1 # Add bigram and bigram context\n",
" context_counts[words[i-1]] += 1\n",
" counts[words[i]] += 1 # Add unigram and unigram context\n",
" context_counts[\"\"] += 1\n",
"\n",
" # Save probabilities to the model file\n",
" with open(model_file, 'w') as fo:\n",
" for ngram, count in counts.items():\n",
" # TODO: Write code to calculate probabilities of n-grams\n",
" # (unigrams and bigrams)\n",
" # Hint: probabilities of n-grams will be calculated by their counts\n",
" # divided by their context's counts.\n",
" # probability = counts[ngram]/context_counts[context]\n",
" # After calculating probabilities, we will save ngram and probability\n",
" # to the file in the format:\n",
" # ngramprobability\n",
"\n",
" # YOUR CODE HERE\n",
"\n",
" words = ngram.split(' ')\n",
" words.pop()\n",
" context = ' '.join(words)\n",
" probability = counts[ngram]/context_counts[context]\n",
" fo.write('%s\\t%f\\n' % (ngram, probability))\n",
" pass"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "LpbTBe8xi1rH"
},
"source": [
"Let's try to train bigram model on the small data."
]
},
{
"cell_type": "code",
"metadata": {
"id": "4lmBGnZWi5_R"
},
"source": [
"train_bigram('02-train-input.txt', '02-bigram_model.txt')"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "xofAVqrNjEHA"
},
"source": [
"Let's see the content of the model. After completing the function `train_bigram`, you should see. The order of lines may be different.\n",
"\n",
"```\n",
"\t0.250000\n",
" a\t1.000000\n",
"a\t0.250000\n",
"a b\t1.000000\n",
"b\t0.250000\n",
"b c\t0.500000\n",
"b d\t0.500000\n",
"c\t0.125000\n",
"c \t1.000000\n",
"d\t0.125000\n",
"d \t1.000000\n",
"```"
]
},
{
"cell_type": "code",
"metadata": {
"id": "Z6Wf9fXyjLwO",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "322b2750-63c8-4643-886d-43e995daa50a"
},
"source": [
"!cat 02-bigram_model.txt"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
" a\t1.000000\n",
"a\t0.250000\n",
"a b\t1.000000\n",
"b\t0.250000\n",
"b c\t0.500000\n",
"c\t0.125000\n",
"c \t1.000000\n",
"\t0.250000\n",
"b d\t0.500000\n",
"d\t0.125000\n",
"d \t1.000000\n"
]
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "lzqpiLpxjM5z"
},
"source": [
"### Part 1.2: load the model file\n",
"\n",
"We are going to implement the function `load_bigram_model` to load the model file."
]
},
{
"cell_type": "code",
"metadata": {
"id": "s0lc9hiMkAJk"
},
"source": [
"def load_bigram_model(model_file):\n",
" \"\"\"Load the model file\n",
"\n",
" Args:\n",
" model_file (str): Path to the model file\n",
"\n",
" Returns:\n",
" probs (dict): Dictionary object that map from ngrams to their probabilities\n",
" \"\"\"\n",
" probs = {}\n",
" with open(model_file, 'r') as f:\n",
" for line in f:\n",
" # TODO: From each line split ngram, probability\n",
" # and then update probs\n",
"\n",
" # YOUR CODE HERE\n",
" line = line.strip()\n",
" if line == '':\n",
" continue\n",
" ngram, p = line.split('\\t')\n",
" probs[ngram] = float(p)\n",
" pass\n",
" return probs"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "czcC1vOIq3YV"
},
"source": [
"Let's test the function"
]
},
{
"cell_type": "code",
"metadata": {
"id": "X3pTilwdq5lP",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "9420bcbc-4389-46ed-e5e0-87f82bd75412"
},
"source": [
"probs = load_bigram_model('02-bigram_model.txt')\n",
"probs"
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"{'': 0.25,\n",
" ' a': 1.0,\n",
" 'a': 0.25,\n",
" 'a b': 1.0,\n",
" 'b': 0.25,\n",
" 'b c': 0.5,\n",
" 'b d': 0.5,\n",
" 'c': 0.125,\n",
" 'c ': 1.0,\n",
" 'd': 0.125,\n",
" 'd ': 1.0}"
]
},
"metadata": {},
"execution_count": 8
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "rBpjGSG8ky7u"
},
"source": [
"## Part 2: Evaluating Bigram Language Model\n",
"\n",
"In this part, we will evaluate the bigram language model on the test set. We will use linear interpolation as the smoothing technique.\n",
"\n",
"What we need to do is to complete the function `test_bigram` as follows. The function will return perplexity on the test data.\n",
"\n",
"Recall that, the smoothed bigram probabilities using interpolation technique is calculated as follows.\n",
"\n",
"Bigrams:\n",
"\n",
"$$\n",
"P(w_i|w_{i-1})=\\lambda_2 \\times P_{ML}(w_i|w_{i-1})+(1-\\lambda_2)\\times P(w_i)\n",
"$$\n",
"\n",
"where $P(w_i)$ is the smoothed unigram probability and calculated as follows.\n",
"\n",
"$$\n",
"P(w_i)=\\lambda_1 \\times P_{ML}(w_i) + (1-\\lambda_1) \\times \\frac{1}{N}\n",
"$$\n",
"\n",
"where $N$ is a large number, e.g., $N=1,000,000$."
]
},
{
"cell_type": "code",
"metadata": {
"id": "LLb5kGJilgrG"
},
"source": [
"import math\n",
"\n",
"\n",
"def test_bigram(test_file, model_file, lambda2=0.95, lambda1=0.95, N=1000000):\n",
" W = 0 # Total word count\n",
" H = 0\n",
" probs = load_bigram_model(model_file)\n",
" with open(test_file, 'r') as f:\n",
" for line in f:\n",
" line = line.strip()\n",
" if line == '':\n",
" continue\n",
" words = line.split()\n",
" words.append('')\n",
" words.insert(0, '')\n",
" for i in range(1, len(words)): # Note: starting at 1, after \n",
" # TODO: Write code to calculate smoothed unigram probabilties\n",
" # and smoothed bigram probabilities\n",
" # You should use calculate p1 as smoothed unigram probability\n",
" # and p2 as smoothed bigram probability\n",
" p1 = None # Comment out this line\n",
" p2 = None # Comment out this line\n",
"\n",
" # YOUR CODE HERE\n",
"\n",
" p1 = (1-lambda1)/N # Smoothed unigram probability\n",
" if words[i] in probs:\n",
" p1 += lambda1 * probs[words[i]]\n",
" p2 = (1-lambda2) * p1 # Smoothed bigram probability\n",
" if words[i-1] + ' ' + words[i] in probs:\n",
" p2 += lambda2 * probs[words[i-1] + ' ' + words[i]]\n",
"\n",
" # END OF YOUR CODE\n",
" W += 1 # Count the words\n",
" H += -math.log2(p2) # We use logarithm to avoid underflow\n",
" H = H/W\n",
" P = 2**H\n",
"\n",
" print(\"Entropy: {}\".format(H))\n",
" print(\"Perplexity: {}\".format(P))\n",
"\n",
" return P"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "To39tEQ0pAH6"
},
"source": [
"Now let's calculate on the Wikipedia data."
]
},
{
"cell_type": "code",
"metadata": {
"id": "ZHAE96uvpQPR",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "c7cfaa83-c6a0-4111-f96f-64c7aefdc8e5"
},
"source": [
"train_bigram('wiki-en-train.word', 'bigram_model.txt')\n",
"test_bigram('wiki-en-test.word', 'bigram_model.txt')"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Entropy: 11.284333504778214\n",
"Perplexity: 2494.151707389251\n"
]
},
{
"output_type": "execute_result",
"data": {
"text/plain": [
"2494.151707389251"
]
},
"metadata": {},
"execution_count": 9
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "NiyicE5hpUyb"
},
"source": []
}
]
}