{"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"provenance":[]},"kernelspec":{"name":"python3","display_name":"Python 3"}},"cells":[{"cell_type":"markdown","metadata":{"id":"UXnI7KKa5AVF"},"source":["# Bigram Language Models\n","\n","In this notebook, we are going to implement a bigram language model and use interpolation for smoothing.\n","\n","The bigram probabilities will be estimated by using MLE as follows:\n","\n","$$\n","P_{ML}(w_i|w_{i-1})=\\frac{c(w_{i-1}w_{i})}{c(w_{i-1})}\n","$$\n","\n","Recall that, the smoothed bigram probabilities using interpolation technique is calculated as follows.\n","\n","Bigrams:\n","\n","$$\n","P(w_i|w_{i-1})=\\lambda_2 \\times P_{ML}(w_i|w_{i-1})+(1-\\lambda_2)\\times P(w_i)\n","$$\n","\n","where $P(w_i)$ is the smoothed unigram probability and calculated as follows.\n","\n","$$\n","P(w_i)=\\lambda_1 \\times P_{ML}(w_i) + (1-\\lambda_1) \\times \\frac{1}{N}\n","$$\n","\n","where $N$ is a large number, e.g., $N=1,000,000$.\n","\n","There are two parts in this programming assignment.\n","\n","- Training: you will estimate unigram and bigram probabilities from a raw text data file and save parameters of the language model into a file.\n","- Test: you use the trained language model to calculate calculates entropy, perplexity on the test data. You will need to use interpolation to calculate smoothed probabilities in testing.\n","\n","Please refer the pseudo-code in slide 17, 18 of the lecture [NLP Programming Tutorial 2 -Bigram Language Models](http://www.phontron.com/slides/nlp-programming-en-02-bigramlm.pdf) to complete the assignment."]},{"cell_type":"markdown","metadata":{"id":"hZxKE4cMUcH3"},"source":["## Data\n","\n","We will use the file [wiki-en-train.word](https://raw.githubusercontent.com/neubig/nlptutorial/master/data/wiki-en-train.word) as the training data, and [wiki-en-test.\n","word](https://raw.githubusercontent.com/neubig/nlptutorial/master/data/wiki-en-test.word) as the test data. To test our implementation quickly, we will use small data file [02-train-input.txt](https://github.com/neubig/nlptutorial/blob/master/test/02-train-input.txt). All data files are from the [nlptutorial](https://github.com/neubig/nlptutorial) by Graham Neubig.\n","\n","As the first step, we will download all necessary data files using `wget` command line."]},{"cell_type":"code","metadata":{"id":"Q5JRqLFIVnI-","colab":{"base_uri":"https://localhost:8080/"},"outputId":"5e62ecaa-d425-45bf-aacb-9d8718adc7e0","executionInfo":{"status":"ok","timestamp":1766807887609,"user_tz":-420,"elapsed":1155,"user":{"displayName":"Minh Pham","userId":"01293297774691882951"}}},"source":["!rm -f wiki-en-train.word\n","!wget https://raw.githubusercontent.com/neubig/nlptutorial/master/data/wiki-en-train.word\n","\n","!rm -f wiki-en-test.word\n","!wget https://raw.githubusercontent.com/neubig/nlptutorial/master/data/wiki-en-test.word\n","\n","!rm -f 02-train-input.txt\n","!wget https://raw.githubusercontent.com/neubig/nlptutorial/master/test/02-train-input.txt"],"execution_count":null,"outputs":[{"output_type":"stream","name":"stdout","text":["--2025-12-27 03:58:06--  https://raw.githubusercontent.com/neubig/nlptutorial/master/data/wiki-en-train.word\n","Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...\n","Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.\n","HTTP request sent, awaiting response... 200 OK\n","Length: 203886 (199K) [text/plain]\n","Saving to: ‘wiki-en-train.word’\n","\n","\rwiki-en-train.word    0%[                    ]       0  --.-KB/s               \rwiki-en-train.word  100%[===================>] 199.11K  --.-KB/s    in 0.02s   \n","\n","2025-12-27 03:58:06 (11.7 MB/s) - ‘wiki-en-train.word’ saved [203886/203886]\n","\n","--2025-12-27 03:58:06--  https://raw.githubusercontent.com/neubig/nlptutorial/master/data/wiki-en-test.word\n","Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.108.133, ...\n","Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.\n","HTTP request sent, awaiting response... 200 OK\n","Length: 26989 (26K) [text/plain]\n","Saving to: ‘wiki-en-test.word’\n","\n","wiki-en-test.word   100%[===================>]  26.36K  --.-KB/s    in 0.002s  \n","\n","2025-12-27 03:58:07 (14.3 MB/s) - ‘wiki-en-test.word’ saved [26989/26989]\n","\n","--2025-12-27 03:58:07--  https://raw.githubusercontent.com/neubig/nlptutorial/master/test/02-train-input.txt\n","Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...\n","Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.\n","HTTP request sent, awaiting response... 200 OK\n","Length: 12 [text/plain]\n","Saving to: ‘02-train-input.txt’\n","\n","02-train-input.txt  100%[===================>]      12  --.-KB/s    in 0s      \n","\n","2025-12-27 03:58:07 (633 KB/s) - ‘02-train-input.txt’ saved [12/12]\n","\n"]}]},{"cell_type":"markdown","metadata":{"id":"X0WLgNXeVt0Z"},"source":["## Part 1: Estimating probabilities\n","\n","What you need to do in this part is complete the function `train_bigram` to estimate unigram, bigram probabilities (using MLE method) from a text file and save probabilities to a model file.\n","\n","The format of the model file is as follows. Each line includes an n-gram (unigram or bigram) and its probability estimated by MLE method.\n","\n","```\n","<s> a\t1.000000\n","a\t0.250000\n","a b\t1.000000\n","b\t0.250000\n","b c\t0.500000\n","...\n","```"]},{"cell_type":"markdown","metadata":{"id":"KP2Tj-kZjowV"},"source":["### Part 1.1: Function train_bigram()"]},{"cell_type":"code","metadata":{"id":"XX2a0Yetfzta"},"source":["from collections import defaultdict\n","\n","\n","def train_bigram(train_file, model_file):\n","    \"\"\"Train bigram language model and save to model file\n","    \"\"\"\n","    counts = defaultdict(int)  # count the n-gram\n","    context_counts = defaultdict(int)   # count the context\n","    with open(train_file) as f:\n","        for line in f:\n","            line = line.strip()\n","            if line == '':\n","                continue\n","            words = line.split()\n","            words.append('</s>')\n","            words.insert(0, '<s>')\n","\n","            for i in range(1, len(words)):  # Note: starting at 1, after <s>\n","                # TODO: Write code to count bigrams and their contexts\n","                # YOUR CODE HERE\n","\n","                counts[words[i-1] + ' ' +  words[i]] += 1 # Add bigram and bigram context\n","                context_counts[words[i-1]] += 1\n","                counts[words[i]] += 1    # Add unigram and unigram context\n","                context_counts[\"\"] += 1\n","\n","                pass\n","\n","    # Save probabilities to the model file\n","    with open(model_file, 'w') as fo:\n","        for ngram, count in counts.items():\n","            # TODO: Write code to calculate probabilities of n-grams\n","            # (unigrams and bigrams)\n","            # Hint: probabilities of n-grams will be calculated by their counts\n","            # divided by their context's counts.\n","            # probability = counts[ngram]/context_counts[context]\n","            # After calculating probabilities, we will save ngram and probability\n","            # to the file in the format:\n","            # ngram<tab>probability\n","\n","            # YOUR CODE HERE\n","\n","            words = ngram.split(' ')\n","            words.pop()\n","            context = ' '.join(words)\n","            probability = counts[ngram]/context_counts[context]\n","            fo.write('%s\\t%f\\n' % (ngram, probability))\n","            pass"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"LpbTBe8xi1rH"},"source":["Let's try to train bigram model on the small data."]},{"cell_type":"code","metadata":{"id":"4lmBGnZWi5_R"},"source":["train_bigram('02-train-input.txt', '02-bigram_model.txt')"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"xofAVqrNjEHA"},"source":["Let's see the content of the model. After completing the function `train_bigram`, you should see. The order of lines may be different.\n","\n","```\n","</s>\t0.250000\n","<s> a\t1.000000\n","a\t0.250000\n","a b\t1.000000\n","b\t0.250000\n","b c\t0.500000\n","b d\t0.500000\n","c\t0.125000\n","c </s>\t1.000000\n","d\t0.125000\n","d </s>\t1.000000\n","```"]},{"cell_type":"code","metadata":{"id":"Z6Wf9fXyjLwO","colab":{"base_uri":"https://localhost:8080/"},"outputId":"5927f1bc-75c9-46b5-bce5-674d0aa76809","executionInfo":{"status":"ok","timestamp":1766807895496,"user_tz":-420,"elapsed":133,"user":{"displayName":"Minh Pham","userId":"01293297774691882951"}}},"source":["!cat 02-bigram_model.txt"],"execution_count":null,"outputs":[{"output_type":"stream","name":"stdout","text":["<s> a\t1.000000\n","a\t0.250000\n","a b\t1.000000\n","b\t0.250000\n","b c\t0.500000\n","c\t0.125000\n","c </s>\t1.000000\n","</s>\t0.250000\n","b d\t0.500000\n","d\t0.125000\n","d </s>\t1.000000\n"]}]},{"cell_type":"markdown","metadata":{"id":"lzqpiLpxjM5z"},"source":["### Part 1.2: load the model file\n","\n","We are going to implement the function `load_bigram_model` to load the model file."]},{"cell_type":"code","metadata":{"id":"s0lc9hiMkAJk"},"source":["def load_bigram_model(model_file):\n","    \"\"\"Load the model file\n","\n","    Args:\n","        model_file (str): Path to the model file\n","\n","    Returns:\n","        probs (dict): Dictionary object that map from ngrams to their probabilities\n","    \"\"\"\n","    probs = {}\n","    with open(model_file, 'r') as f:\n","        for line in f:\n","            # TODO: From each line split ngram, probability\n","            # and then update probs\n","\n","            # YOUR CODE HERE\n","            line = line.strip()\n","            if line == '':\n","                continue\n","            ngram, p = line.split('\\t')\n","            probs[ngram] = float(p)\n","            pass\n","    return probs"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"czcC1vOIq3YV"},"source":["Let's test the function"]},{"cell_type":"code","metadata":{"id":"X3pTilwdq5lP","colab":{"base_uri":"https://localhost:8080/"},"outputId":"27193b1d-7b36-4c24-8e1c-3124714d03a0","executionInfo":{"status":"ok","timestamp":1766807899329,"user_tz":-420,"elapsed":8,"user":{"displayName":"Minh Pham","userId":"01293297774691882951"}}},"source":["probs = load_bigram_model('02-bigram_model.txt')\n","probs"],"execution_count":null,"outputs":[{"output_type":"execute_result","data":{"text/plain":["{'<s> a': 1.0,\n"," 'a': 0.25,\n"," 'a b': 1.0,\n"," 'b': 0.25,\n"," 'b c': 0.5,\n"," 'c': 0.125,\n"," 'c </s>': 1.0,\n"," '</s>': 0.25,\n"," 'b d': 0.5,\n"," 'd': 0.125,\n"," 'd </s>': 1.0}"]},"metadata":{},"execution_count":7}]},{"cell_type":"markdown","metadata":{"id":"rBpjGSG8ky7u"},"source":["## Part 2: Evaluating Bigram Language Model\n","\n","In this part, we will evaluate the bigram language model on the test set. We will use linear interpolation as the smoothing technique.\n","\n","What we need to do is to complete the function  `test_bigram` as follows. The function will return perplexity on the test data.\n","\n","Recall that, the smoothed bigram probabilities using interpolation technique is calculated as follows.\n","\n","Bigrams:\n","\n","$$\n","P(w_i|w_{i-1})=\\lambda_2 \\times P_{ML}(w_i|w_{i-1})+(1-\\lambda_2)\\times P(w_i)\n","$$\n","\n","where $P(w_i)$ is the smoothed unigram probability and calculated as follows.\n","\n","$$\n","P(w_i)=\\lambda_1 \\times P_{ML}(w_i) + (1-\\lambda_1) \\times \\frac{1}{N}\n","$$\n","\n","where $N$ is a large number, e.g., $N=1,000,000$."]},{"cell_type":"code","metadata":{"id":"LLb5kGJilgrG"},"source":["import math\n","\n","\n","def test_bigram(test_file, model_file, lambda2=0.95, lambda1=0.95, N=1000000):\n","    W = 0 # Total word count\n","    H = 0\n","    probs = load_bigram_model(model_file)\n","    with open(test_file, 'r') as f:\n","        for line in f:\n","            line = line.strip()\n","            if line == '':\n","                continue\n","            words = line.split()\n","            words.append('</s>')\n","            words.insert(0, '<s>')\n","            for i in range(1, len(words)):  # Note: starting at 1, after <s>\n","                # TODO: Write code to calculate smoothed unigram probabilties\n","                # and smoothed bigram probabilities\n","                # You should use calculate p1 as smoothed unigram probability\n","                # and p2 as smoothed bigram probability\n","                p1 = None  # Comment out this line\n","                p2 = None  # Comment out this line\n","\n","                # YOUR CODE HERE\n","\n","                p1 = (1-lambda1)/N   # Smoothed unigram probability\n","                if words[i] in probs:\n","                    p1 += lambda1 * probs[words[i]]\n","                p2 = (1-lambda2) * p1  # Smoothed bigram probability\n","                if words[i-1] + ' ' + words[i] in probs:\n","                    p2 += lambda2 * probs[words[i-1] + ' ' + words[i]]\n","\n","                # END OF YOUR CODE\n","                W += 1  # Count the words\n","                H += -math.log2(p2) # We use logarithm to avoid underflow\n","    H = H/W\n","    P = 2**H\n","\n","    print(\"Entropy: {}\".format(H))\n","    print(\"Perplexity: {}\".format(P))\n","\n","    return P"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"To39tEQ0pAH6"},"source":["Now let's calculate on the Wikipedia data."]},{"cell_type":"code","metadata":{"id":"ZHAE96uvpQPR","colab":{"base_uri":"https://localhost:8080/"},"outputId":"858ebfaa-2405-4520-bb1f-6c727605872b","executionInfo":{"status":"ok","timestamp":1766807903676,"user_tz":-420,"elapsed":180,"user":{"displayName":"Minh Pham","userId":"01293297774691882951"}}},"source":["train_bigram('wiki-en-train.word', 'bigram_model.txt')\n","test_bigram('wiki-en-test.word', 'bigram_model.txt')"],"execution_count":null,"outputs":[{"output_type":"stream","name":"stdout","text":["Entropy: 11.284333504778214\n","Perplexity: 2494.151707389251\n"]},{"output_type":"execute_result","data":{"text/plain":["2494.151707389251"]},"metadata":{},"execution_count":9}]},{"cell_type":"markdown","metadata":{"id":"NiyicE5hpUyb"},"source":["## Text Generation"]},{"cell_type":"code","source":["from collections import defaultdict\n","\n","def build_bigram_distribution(probs):\n","    bigram = defaultdict(dict)\n","\n","    for key, p in probs.items():\n","        parts = key.split()\n","        if len(parts) == 2:          # chỉ lấy bigram\n","            w_prev, w_next = parts\n","            bigram[w_prev][w_next] = p\n","\n","    return bigram"],"metadata":{"id":"wOKJfX6v5u4F"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["probs = load_bigram_model('bigram_model.txt')\n","bigram = build_bigram_distribution(probs)"],"metadata":{"id":"raFM9yVK9L-Y"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["import random\n","\n","def random_choice(distribution):\n","    r = random.random()\n","    cumulative = 0.0\n","    last_item = None\n","\n","    for item, prob in distribution.items():\n","        cumulative += prob\n","        last_item = item\n","        if r <= cumulative:\n","            return item\n","\n","    return last_item"],"metadata":{"id":"BIezAgBf9hig"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["def generate_text(bigram, max_length=20):\n","    current = '<s>'\n","    result = []\n","\n","    for _ in range(max_length):\n","        if current not in bigram:\n","            break\n","\n","        next_word = random_choice(bigram[current])\n","\n","        if next_word == '</s>':\n","            break\n","\n","        result.append(next_word)\n","        current = next_word\n","\n","    return ' '.join(result)\n"],"metadata":{"id":"L-bs48pK9o3Q"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["for _ in range(10):\n","    print(generate_text(bigram))"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"vjXiSyQ69yBl","executionInfo":{"status":"ok","timestamp":1766817263978,"user_tz":-420,"elapsed":10,"user":{"displayName":"Minh Pham","userId":"01293297774691882951"}},"outputId":"59462865-3ac3-4210-bc3b-61a725e1c5fa"},"execution_count":null,"outputs":[{"output_type":"stream","name":"stdout","text":["The examples .\n","A machine translation , which are capable of a team led to texts by domain .\n","This has improved their answers the language to Australia .\n","Back-End or classified into machine-encoded text is prone to be the European group of these databases .\n","... .\n","In English sentences , to discover these words in the POS tagger on unsupervised and the question answering systems such\n","However , since 1965 it is very limited in the harmonic mean word , there are 9 parts of the\n","Our brain is not the top T vertices\\/unigrams are the second edition published at least five years development has been\n","Both QA systems are spoken text , it -RRB- securely ; and computational linguistics is the final letter to be\n","This is usually can benefit from section requires an inseparable part of the social psychology Response based on the keyphrases\n"]}]},{"cell_type":"code","source":[],"metadata":{"id":"EJp2ntcL97VS"},"execution_count":null,"outputs":[]}]}