{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "yscsqVusJ07P" }, "source": [ "# Getting started with Text Analysis\n", "\n", "NLP Course 2022\n", "\n", "## Objectives\n", "\n", "In this documents, we will learn following text analysis techniques:\n", "\n", "- How to clean text data with regular expression\n", "- Tokenisation\n", "- POS Tagging\n", "\n", "We will see how to use NLP Toolkits such as NLTK, SpaCY, Underthesea, PhoNLP to do NLP pipeline." ] }, { "cell_type": "markdown", "metadata": { "id": "7TohQGX21Luo" }, "source": [ "## Cleaning Texts\n", "\n", "In NLP tasks, we may want to remove irrelevant contents in input data before using that for training or prediction. For instance, when we will delete URLs, punctuations in Twitter Posts in sentiment analysis task. We can do that with [Regular Expressions](https://www.tutorialspoint.com/python/python_reg_expressions.htm).\n", "\n", "In Python we can use `re` module for regular expressions.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "3U5H216N3H5y" }, "source": [ "### Demo data" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "executionInfo": { "elapsed": 4, "status": "ok", "timestamp": 1704527014833, "user": { "displayName": "Minh Pham", "userId": "01293297774691882951" }, "user_tz": -420 }, "id": "NkX8DVKn3OCd" }, "outputs": [], "source": [ "sent1 = \"Stories of Pittsburghers trapped in #Houston flooding!!!! @@ - https://t.co/j5igfpvLJu https://t.co/8gsUpD8jsa\"" ] }, { "cell_type": "markdown", "metadata": { "id": "Wpx-B9Xf3P7o" }, "source": [ "### Remove URL" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 36 }, "executionInfo": { "elapsed": 521, "status": "ok", "timestamp": 1704527021167, "user": { "displayName": "Minh Pham", "userId": "01293297774691882951" }, "user_tz": -420 }, "id": "hnFbggJZ3ZhK", "outputId": "d5b87242-bc3b-48e1-e488-b06137cdadea" }, "outputs": [ { "data": { "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" }, "text/plain": [ "'Stories of Pittsburghers trapped in #Houston flooding!!!! @@ - '" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import re\n", "\n", "# When the UNICODE flag is not specified, matches any non-whitespace character\n", "result = re.sub(r\"[(http(s)?):\\/\\/(www\\.)?a-zA-Z0-9@:%._\\+~#=]{2,256}\\.[a-z]{2,6}\\b([-a-zA-Z0-9@:%_\\+.~#?&//=]*)\", \"\", sent1)\n", "result" ] }, { "cell_type": "markdown", "metadata": { "id": "Qsl6K6Bf3uez" }, "source": [ "### Remove punctuations in text" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "executionInfo": { "elapsed": 546, "status": "ok", "timestamp": 1704527778254, "user": { "displayName": "Minh Pham", "userId": "01293297774691882951" }, "user_tz": -420 }, "id": "xcnKBU5j34k2" }, "outputs": [], "source": [ "import string" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 36 }, "executionInfo": { "elapsed": 528, "status": "ok", "timestamp": 1704527782689, "user": { "displayName": "Minh Pham", "userId": "01293297774691882951" }, "user_tz": -420 }, "id": "oFTtImLc4Phf", "outputId": "569b5959-b59d-4102-f2eb-a24204735143" }, "outputs": [ { "data": { "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" }, "text/plain": [ "'Stories of Pittsburghers trapped in Houston flooding '" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "translator = str.maketrans('', '', string.punctuation)\n", "result2 = result.translate(translator)\n", "result2" ] }, { "cell_type": "markdown", "metadata": { "id": "YN_lo5I74nfJ" }, "source": [ "## NTLK (Natural Language Toolkit)\n", "\n", "[NLTK](https://www.nltk.org/) is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.\n", "\n", "We will see how to use nltk to do basic text analysis in NLP pipeline.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "i0OvmkVY8D7R" }, "source": [ "### Sentence tokenization\n", "\n", "We split a long paragraph into sentences." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "executionInfo": { "elapsed": 505, "status": "ok", "timestamp": 1704527957643, "user": { "displayName": "Minh Pham", "userId": "01293297774691882951" }, "user_tz": -420 }, "id": "lSzHbdqV8NuB" }, "outputs": [], "source": [ "para = \"Hello World Dr. John. It's good to see you. Thanks for buying this book.\"" ] }, { "cell_type": "markdown", "metadata": { "id": "_daqy3TC8ZLZ" }, "source": [ "We need to download appropriate models before using nltk to do some tasks." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 3498, "status": "ok", "timestamp": 1704527985702, "user": { "displayName": "Minh Pham", "userId": "01293297774691882951" }, "user_tz": -420 }, "id": "LJJuV9uN8lOu", "outputId": "1a46c590-5047-4b0d-b7ee-9f2446fcc31c" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[nltk_data] Downloading package punkt to /root/nltk_data...\n", "[nltk_data] Unzipping tokenizers/punkt.zip.\n" ] }, { "data": { "text/plain": [ "True" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import nltk\n", "nltk.download('punkt')" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 3, "status": "ok", "timestamp": 1704528004990, "user": { "displayName": "Minh Pham", "userId": "01293297774691882951" }, "user_tz": -420 }, "id": "vuxa9LFA9cwi", "outputId": "3a6c4ebd-024d-4644-d10e-15b1404642ff" }, "outputs": [ { "data": { "text/plain": [ "['Hello World Dr. John.',\n", " \"It's good to see you.\",\n", " 'Thanks for buying this book.']" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from nltk.tokenize import sent_tokenize\n", "sent_tokenize(para)" ] }, { "cell_type": "markdown", "metadata": { "id": "FiMAjqaV9g2N" }, "source": [ "### Word tokenization\n", "\n", "Word tokenization is to split a sentence into tokens." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 517, "status": "ok", "timestamp": 1704528276178, "user": { "displayName": "Minh Pham", "userId": "01293297774691882951" }, "user_tz": -420 }, "id": "Z9ZEk8TRPvfk", "outputId": "8506b2b4-42cc-472d-cada-271bf57e1c5b" }, "outputs": [ { "data": { "text/plain": [ "['Hello', 'World', 'Dr.', 'John', '.']" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from nltk.tokenize import word_tokenize\n", "\n", "# sent = 'The history of NLP generally starts in the 1950s, although work can be found from earlier periods.'\n", "sent = 'Hello World Dr. John.'\n", "word_tokenize(sent)" ] }, { "cell_type": "markdown", "metadata": { "id": "FGtoMVgMPxfW" }, "source": [ "### POS Tagging\n", "\n", "POS Tagging is the process of assigning Part-of-speech to each token in a sentence. We need to tokenize the sentence first before performing POS Tagging.\n", "\n", "In NLTK, we need to download the tagging model first." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 668, "status": "ok", "timestamp": 1704528359753, "user": { "displayName": "Minh Pham", "userId": "01293297774691882951" }, "user_tz": -420 }, "id": "0FWKGy8eReR-", "outputId": "9b90ade1-635f-4841-b8bd-b68a24e19098" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[nltk_data] Downloading package averaged_perceptron_tagger to\n", "[nltk_data] /root/nltk_data...\n", "[nltk_data] Unzipping taggers/averaged_perceptron_tagger.zip.\n" ] }, { "data": { "text/plain": [ "True" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nltk.download('averaged_perceptron_tagger')" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 730, "status": "ok", "timestamp": 1704528364684, "user": { "displayName": "Minh Pham", "userId": "01293297774691882951" }, "user_tz": -420 }, "id": "Tmu6kLiUQWOh", "outputId": "f22e3245-2ef2-43da-ba5e-42e88f9c2fe4" }, "outputs": [ { "data": { "text/plain": [ "[('The', 'DT'),\n", " ('history', 'NN'),\n", " ('of', 'IN'),\n", " ('NLP', 'NNP'),\n", " ('generally', 'RB'),\n", " ('starts', 'VBZ'),\n", " ('in', 'IN'),\n", " ('the', 'DT'),\n", " ('1950s', 'CD'),\n", " (',', ','),\n", " ('although', 'IN'),\n", " ('work', 'NN'),\n", " ('can', 'MD'),\n", " ('be', 'VB'),\n", " ('found', 'VBN'),\n", " ('from', 'IN'),\n", " ('earlier', 'JJR'),\n", " ('periods', 'NNS'),\n", " ('.', '.')]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from nltk.tag import pos_tag\n", "from nltk.tokenize import word_tokenize\n", "\n", "sent = 'The history of NLP generally starts in the 1950s, although work can be found from earlier periods.'\n", "pos_tag(word_tokenize(sent))" ] }, { "cell_type": "markdown", "metadata": { "id": "YFcdFb__Q_z_" }, "source": [ "### Filtering stop words\n", "\n", "What are stopwords?\n", "\n", "Stop words are the words in a stop list (or stoplist or negative dictionary) which are filtered out (i.e. stopped) before or after processing of natural language data (text) because they are insignificant. (Wikipedia).\n", "\n", "We are going to filter out stop words in a sentence.\n", "\n", "NLTK includes English stop words." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 1332, "status": "ok", "timestamp": 1704528579584, "user": { "displayName": "Minh Pham", "userId": "01293297774691882951" }, "user_tz": -420 }, "id": "vqX8OJXfS2VZ", "outputId": "7272aa4c-222e-4d7e-e257-11e5590c1340" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[nltk_data] Downloading package stopwords to /root/nltk_data...\n", "[nltk_data] Unzipping corpora/stopwords.zip.\n" ] } ], "source": [ "%%capture\n", "nltk.download('stopwords')" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 652, "status": "ok", "timestamp": 1704528587455, "user": { "displayName": "Minh Pham", "userId": "01293297774691882951" }, "user_tz": -420 }, "id": "NRG4SSw5Tsq8", "outputId": "3ee18826-79ef-4f05-9a2f-2516ad579156" }, "outputs": [ { "data": { "text/plain": [ "['The',\n", " 'history',\n", " 'NLP',\n", " 'generally',\n", " 'starts',\n", " '1950s',\n", " ',',\n", " 'although',\n", " 'work',\n", " 'found',\n", " 'earlier',\n", " 'periods',\n", " '.']" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from nltk.corpus import stopwords\n", "english_stops = set(stopwords.words('english'))\n", "\n", "sent = 'The history of NLP generally starts in the 1950s, although work can be found from earlier periods.'\n", "\n", "words = nltk.word_tokenize(sent)\n", "words_without_stopwords = [word for word in words if word not in english_stops]\n", "words_without_stopwords" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "9GPissth0ntC" }, "outputs": [], "source": [ "# prompt: Print stop words in English\n", "\n", "stopwords.words('english')\n" ] }, { "cell_type": "markdown", "metadata": { "id": "egBk-N64T4Z0" }, "source": [ "## spaCy\n", "\n", "[spaCy](https://spacy.io/) is an open-source software library for advanced natural language processing, written in Python and Cython. Let's see how we can do NLP Pipeline with spaCy.\n", "\n", "We need to download English model before using spaCy.\n" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "executionInfo": { "elapsed": 36967, "status": "ok", "timestamp": 1704528902125, "user": { "displayName": "Minh Pham", "userId": "01293297774691882951" }, "user_tz": -420 }, "id": "jouE8yhnXGt4" }, "outputs": [], "source": [ "%%capture\n", "!pip install -U spacy\n", "!python -m spacy download en_core_web_sm" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "executionInfo": { "elapsed": 21564, "status": "ok", "timestamp": 1704528928369, "user": { "displayName": "Minh Pham", "userId": "01293297774691882951" }, "user_tz": -420 }, "id": "MWrMwbLccobM" }, "outputs": [], "source": [ "import spacy\n", "nlp = spacy.load(\"en_core_web_sm\")" ] }, { "cell_type": "markdown", "metadata": { "id": "mLfKCmubWuy-" }, "source": [ "### Sentence tokenization" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 6, "status": "ok", "timestamp": 1704528969817, "user": { "displayName": "Minh Pham", "userId": "01293297774691882951" }, "user_tz": -420 }, "id": "ymVhbDsrciFD", "outputId": "f0e8930f-73a5-40c8-9aff-dc91ee932dcc" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['Hello World Dr. John.', \"It's good to see you.\", 'Thanks for buying this book.']\n" ] } ], "source": [ "para = \"Hello World Dr. John. It's good to see you. Thanks for buying this book.\"\n", "doc = nlp(para)\n", "sents = [sent.text for sent in doc.sents]\n", "print(sents)" ] }, { "cell_type": "markdown", "metadata": { "id": "brypQEi1W1jx" }, "source": [ "### Word tokenization" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 523, "status": "ok", "timestamp": 1704528979346, "user": { "displayName": "Minh Pham", "userId": "01293297774691882951" }, "user_tz": -420 }, "id": "YCeH_cJXcszT", "outputId": "65543395-d97d-4c00-d8f1-f16c80b0ddae" }, "outputs": [ { "data": { "text/plain": [ "['The',\n", " 'history',\n", " 'of',\n", " 'NLP',\n", " 'generally',\n", " 'starts',\n", " 'in',\n", " 'the',\n", " '1950s',\n", " ',',\n", " 'although',\n", " 'work',\n", " 'can',\n", " 'be',\n", " 'found',\n", " 'from',\n", " 'earlier',\n", " 'periods',\n", " '.']" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sent = 'The history of NLP generally starts in the 1950s, although work can be found from earlier periods.'\n", "doc = nlp(sent)\n", "tokens = [x.text for x in doc]\n", "tokens" ] }, { "cell_type": "markdown", "metadata": { "id": "Q862p1aWW6X2" }, "source": [ "### POS Tagging" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 549, "status": "ok", "timestamp": 1704528984484, "user": { "displayName": "Minh Pham", "userId": "01293297774691882951" }, "user_tz": -420 }, "id": "V_Hu6FcUdChy", "outputId": "8c1b7730-411c-4fed-9968-b0d36f13376d" }, "outputs": [ { "data": { "text/plain": [ "[('The', 'DET'),\n", " ('history', 'NOUN'),\n", " ('of', 'ADP'),\n", " ('NLP', 'PROPN'),\n", " ('generally', 'ADV'),\n", " ('starts', 'VERB'),\n", " ('in', 'ADP'),\n", " ('the', 'DET'),\n", " ('1950s', 'NOUN'),\n", " (',', 'PUNCT'),\n", " ('although', 'SCONJ'),\n", " ('work', 'NOUN'),\n", " ('can', 'AUX'),\n", " ('be', 'AUX'),\n", " ('found', 'VERB'),\n", " ('from', 'ADP'),\n", " ('earlier', 'ADJ'),\n", " ('periods', 'NOUN'),\n", " ('.', 'PUNCT')]" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[(x.text, x.pos_) for x in doc]" ] }, { "cell_type": "markdown", "metadata": { "id": "ylEbEtZJd2hx" }, "source": [ "### Filtering stop words" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 4, "status": "ok", "timestamp": 1671869707057, "user": { "displayName": "Minh Pham", "userId": "01293297774691882951" }, "user_tz": -420 }, "id": "w6IQWLrreWet", "outputId": "42eb7d4c-2358-4185-a036-3a7937e967dc" }, "outputs": [ { "data": { "text/plain": [ "['The',\n", " 'history',\n", " 'NLP',\n", " 'generally',\n", " 'starts',\n", " '1950s',\n", " ',',\n", " 'work',\n", " 'found',\n", " 'earlier',\n", " 'periods',\n", " '.']" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "spacy_stopwords = set(spacy.lang.en.stop_words.STOP_WORDS)\n", "sent = 'The history of NLP generally starts in the 1950s, although work can be found from earlier periods.'\n", "doc = nlp(sent)\n", "words = [x.text for x in doc]\n", "[word for word in words if word not in spacy_stopwords]" ] }, { "cell_type": "markdown", "metadata": { "id": "ovnyLt17W85N" }, "source": [ "### Named-Entity Recognition" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 644, "status": "ok", "timestamp": 1704528989880, "user": { "displayName": "Minh Pham", "userId": "01293297774691882951" }, "user_tz": -420 }, "id": "fUnrQs43W_fe", "outputId": "7ff07453-a386-484e-a046-128656095bad" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Donald Trump 0 12 PERSON\n", "Washington 48 58 GPE\n", "the White House 96 111 FAC\n" ] } ], "source": [ "text = (\"Donald Trump has brought his drama show back to Washington early, \"\n", " \"perhaps realizing his time in the White House is down to days and counting\")\n", "doc = nlp(text)\n", "for ent in doc.ents:\n", " print(ent.text, ent.start_char, ent.end_char, ent.label_)" ] }, { "cell_type": "markdown", "metadata": { "id": "2qb2uhZQe-00" }, "source": [ "## Other NLP Toolkits for English\n", "\n", "- [Stanza](https://github.com/stanfordnlp/stanza/) (Python)\n", "- [Stanford CoreNLP](https://stanfordnlp.github.io/CoreNLP/) (Java)\n", "- [Apache OpenNLP](https://github.com/apache/opennlp) (Java)\n", "- [DKPro Core](https://dkpro.github.io/dkpro-core/) (Java)" ] }, { "cell_type": "markdown", "metadata": { "id": "KC7qd9Vfdk7u" }, "source": [ "## References\n", "\n", "1. [spaCy 101: Everything you need to know](https://spacy.io/usage/spacy-101)\n", "2. [Advanced NLP with spacy](https://course.spacy.io/)\n", "3. Bird, Steven; Klein, Ewan; Loper, Edward (2009). *Natural Language Processing with Python*. [http://www.nltk.org/book/](http://www.nltk.org/book/)\n", "4. [NLTK in 20 minutes](http://www.slideshare.net/japerk/nltk-in-20-minutes), by Jacob Perkins\n" ] } ], "metadata": { "colab": { "authorship_tag": "ABX9TyOrLXjowac0LjfO/KU9H09X", "provenance": [] }, "kernelspec": { "display_name": "Python [conda env:base] *", "language": "python", "name": "conda-base-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.7" } }, "nbformat": 4, "nbformat_minor": 4 }