{"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"provenance":[],"authorship_tag":"ABX9TyNIMreHB2sK0dvFgUNBjrAG"},"kernelspec":{"name":"python3","display_name":"Python 3"},"language_info":{"name":"python"}},"cells":[{"cell_type":"markdown","source":["# Getting started with Text Analysis\n","\n","NLP Course 2024\n","\n","## Objectives\n","\n","In this documents, we will learn following text analysis techniques:\n","\n","- How to clean text data with regular expression\n","- Tokenisation\n","- POS Tagging\n","\n","We will see how to use NLP Toolkits such as NLTK, SpaCY, Underthesea, PhoNLP to do NLP pipeline."],"metadata":{"id":"yscsqVusJ07P"}},{"cell_type":"markdown","source":["## Cleaning Texts\n","\n","In NLP tasks, we may want to remove irrelevant contents in input data before using that for training or prediction. For instance, when we will delete URLs, punctuations in Twitter Posts in sentiment analysis task. We can do that with [Regular Expressions](https://www.tutorialspoint.com/python/python_reg_expressions.htm).\n","\n","In Python we can use `re` module for regular expressions.\n"],"metadata":{"id":"7TohQGX21Luo"}},{"cell_type":"markdown","source":["### Demo data"],"metadata":{"id":"3U5H216N3H5y"}},{"cell_type":"code","source":["sent1 = \"Stories of Pittsburghers trapped in #Houston flooding!!!! @@ - https://t.co/j5igfpvLJu https://t.co/8gsUpD8jsa\""],"metadata":{"id":"NkX8DVKn3OCd","executionInfo":{"status":"ok","timestamp":1765581313707,"user_tz":-420,"elapsed":4,"user":{"displayName":"Minh Pham","userId":"01293297774691882951"}}},"execution_count":1,"outputs":[]},{"cell_type":"markdown","source":["### Remove URL"],"metadata":{"id":"Wpx-B9Xf3P7o"}},{"cell_type":"code","source":["import re\n","\n","# When the UNICODE flag is not specified, matches any non-whitespace character\n","result = re.sub(r\"[(http(s)?):\\/\\/(www\\.)?a-zA-Z0-9@:%._\\+~#=]{2,256}\\.[a-z]{2,6}\\b([-a-zA-Z0-9@:%_\\+.~#?&//=]*)\", \"\", sent1)\n","result"],"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":36},"id":"hnFbggJZ3ZhK","executionInfo":{"status":"ok","timestamp":1765581315717,"user_tz":-420,"elapsed":25,"user":{"displayName":"Minh Pham","userId":"01293297774691882951"}},"outputId":"1a0d2a3c-aba3-4ba5-ef95-67a15de62813"},"execution_count":2,"outputs":[{"output_type":"execute_result","data":{"text/plain":["'Stories of Pittsburghers trapped in #Houston flooding!!!! @@ -  '"],"application/vnd.google.colaboratory.intrinsic+json":{"type":"string"}},"metadata":{},"execution_count":2}]},{"cell_type":"markdown","source":["### Remove punctuations in text"],"metadata":{"id":"Qsl6K6Bf3uez"}},{"cell_type":"code","source":["import string"],"metadata":{"id":"xcnKBU5j34k2","executionInfo":{"status":"ok","timestamp":1765581318612,"user_tz":-420,"elapsed":19,"user":{"displayName":"Minh Pham","userId":"01293297774691882951"}}},"execution_count":3,"outputs":[]},{"cell_type":"code","source":["translator = str.maketrans('', '', string.punctuation)\n","result2 = result.translate(translator)\n","result2"],"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":36},"id":"oFTtImLc4Phf","executionInfo":{"status":"ok","timestamp":1765581319816,"user_tz":-420,"elapsed":19,"user":{"displayName":"Minh Pham","userId":"01293297774691882951"}},"outputId":"62f1a40c-64ab-4613-acb6-1ec539ba0ca6"},"execution_count":4,"outputs":[{"output_type":"execute_result","data":{"text/plain":["'Stories of Pittsburghers trapped in Houston flooding    '"],"application/vnd.google.colaboratory.intrinsic+json":{"type":"string"}},"metadata":{},"execution_count":4}]},{"cell_type":"markdown","source":["## NTLK (Natural Language Toolkit)\n","\n","[NLTK](https://www.nltk.org/) is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.\n","\n","We will see how to use nltk to do basic text analysis in NLP pipeline.\n"],"metadata":{"id":"YN_lo5I74nfJ"}},{"cell_type":"markdown","source":["### Sentence tokenization\n","\n","We split a long paragraph into sentences."],"metadata":{"id":"i0OvmkVY8D7R"}},{"cell_type":"code","source":["para = \"Hello World Dr. John. It's good to see you. Thanks for buying this book.\""],"metadata":{"id":"lSzHbdqV8NuB","executionInfo":{"status":"ok","timestamp":1765581322052,"user_tz":-420,"elapsed":4,"user":{"displayName":"Minh Pham","userId":"01293297774691882951"}}},"execution_count":5,"outputs":[]},{"cell_type":"markdown","source":["We need to download appropriate models before using nltk to do some tasks."],"metadata":{"id":"_daqy3TC8ZLZ"}},{"cell_type":"code","source":["import nltk\n","nltk.download('punkt')\n","nltk.download('punkt_tab')"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"LJJuV9uN8lOu","executionInfo":{"status":"ok","timestamp":1765581327161,"user_tz":-420,"elapsed":3437,"user":{"displayName":"Minh Pham","userId":"01293297774691882951"}},"outputId":"b8bbb6fa-4a21-4f25-b2b5-a0366f7e95f1"},"execution_count":6,"outputs":[{"output_type":"stream","name":"stderr","text":["[nltk_data] Downloading package punkt to /root/nltk_data...\n","[nltk_data]   Unzipping tokenizers/punkt.zip.\n","[nltk_data] Downloading package punkt_tab to /root/nltk_data...\n","[nltk_data]   Unzipping tokenizers/punkt_tab.zip.\n"]},{"output_type":"execute_result","data":{"text/plain":["True"]},"metadata":{},"execution_count":6}]},{"cell_type":"code","source":["from nltk.tokenize import sent_tokenize\n","sent_tokenize(para)"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"vuxa9LFA9cwi","executionInfo":{"status":"ok","timestamp":1765581328396,"user_tz":-420,"elapsed":183,"user":{"displayName":"Minh Pham","userId":"01293297774691882951"}},"outputId":"e259536e-ba2d-4d82-d450-81d1cb444ec4"},"execution_count":7,"outputs":[{"output_type":"execute_result","data":{"text/plain":["['Hello World Dr. John.',\n"," \"It's good to see you.\",\n"," 'Thanks for buying this book.']"]},"metadata":{},"execution_count":7}]},{"cell_type":"markdown","source":["### Word tokenization\n","\n","Word tokenization is to split a sentence into tokens."],"metadata":{"id":"FiMAjqaV9g2N"}},{"cell_type":"code","source":["from nltk.tokenize import word_tokenize\n","\n","# sent = 'The history of NLP generally starts in the 1950s, although work can be found from earlier periods.'\n","sent = 'Hello World Dr. John.'\n","word_tokenize(sent)"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"Z9ZEk8TRPvfk","executionInfo":{"status":"ok","timestamp":1765581329937,"user_tz":-420,"elapsed":9,"user":{"displayName":"Minh Pham","userId":"01293297774691882951"}},"outputId":"1f7a5964-ad63-44e5-bd81-69a4724e1b29"},"execution_count":8,"outputs":[{"output_type":"execute_result","data":{"text/plain":["['Hello', 'World', 'Dr.', 'John', '.']"]},"metadata":{},"execution_count":8}]},{"cell_type":"markdown","source":["### POS Tagging\n","\n","POS Tagging is the process of assigning Part-of-speech to each token in a sentence. We need to tokenize the sentence first before performing POS Tagging.\n","\n","In NLTK, we need to download the tagging model first."],"metadata":{"id":"FGtoMVgMPxfW"}},{"cell_type":"code","source":["nltk.download('averaged_perceptron_tagger')\n","nltk.download('averaged_perceptron_tagger_eng')"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"0FWKGy8eReR-","executionInfo":{"status":"ok","timestamp":1765581331726,"user_tz":-420,"elapsed":277,"user":{"displayName":"Minh Pham","userId":"01293297774691882951"}},"outputId":"c6a8acc7-dc0e-4f59-835c-2aa2760b9db4"},"execution_count":9,"outputs":[{"output_type":"stream","name":"stderr","text":["[nltk_data] Downloading package averaged_perceptron_tagger to\n","[nltk_data]     /root/nltk_data...\n","[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.\n","[nltk_data] Downloading package averaged_perceptron_tagger_eng to\n","[nltk_data]     /root/nltk_data...\n","[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.\n"]},{"output_type":"execute_result","data":{"text/plain":["True"]},"metadata":{},"execution_count":9}]},{"cell_type":"code","source":["from nltk.tag import pos_tag\n","from nltk.tokenize import word_tokenize\n","\n","sent = 'The history of NLP generally starts in the 1950s, although work can be found from earlier periods.'\n","pos_tag(word_tokenize(sent))"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"Tmu6kLiUQWOh","executionInfo":{"status":"ok","timestamp":1765581334109,"user_tz":-420,"elapsed":157,"user":{"displayName":"Minh Pham","userId":"01293297774691882951"}},"outputId":"cf941b31-dfea-40f2-ca01-a79826ca2d49"},"execution_count":10,"outputs":[{"output_type":"execute_result","data":{"text/plain":["[('The', 'DT'),\n"," ('history', 'NN'),\n"," ('of', 'IN'),\n"," ('NLP', 'NNP'),\n"," ('generally', 'RB'),\n"," ('starts', 'VBZ'),\n"," ('in', 'IN'),\n"," ('the', 'DT'),\n"," ('1950s', 'CD'),\n"," (',', ','),\n"," ('although', 'IN'),\n"," ('work', 'NN'),\n"," ('can', 'MD'),\n"," ('be', 'VB'),\n"," ('found', 'VBN'),\n"," ('from', 'IN'),\n"," ('earlier', 'JJR'),\n"," ('periods', 'NNS'),\n"," ('.', '.')]"]},"metadata":{},"execution_count":10}]},{"cell_type":"markdown","source":["### Filtering stop words\n","\n","What are stopwords?\n","\n","Stop words are the words in a stop list (or stoplist or negative dictionary) which are filtered out (i.e. stopped) before or after processing of natural language data (text) because they are insignificant. (Wikipedia).\n","\n","We are going to filter out stop words in a sentence.\n","\n","NLTK includes English stop words."],"metadata":{"id":"YFcdFb__Q_z_"}},{"cell_type":"code","source":["%%capture\n","nltk.download('stopwords')"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"vqX8OJXfS2VZ","executionInfo":{"status":"ok","timestamp":1765581335905,"user_tz":-420,"elapsed":42,"user":{"displayName":"Minh Pham","userId":"01293297774691882951"}},"outputId":"8311ad77-ed0a-4b38-c18d-15a7e7a3742a"},"execution_count":11,"outputs":[{"output_type":"stream","name":"stderr","text":["[nltk_data] Downloading package stopwords to /root/nltk_data...\n","[nltk_data]   Unzipping corpora/stopwords.zip.\n"]}]},{"cell_type":"code","source":["from nltk.corpus import stopwords\n","english_stops = set(stopwords.words('english'))\n","\n","sent = 'The history of NLP generally starts in the 1950s, although work can be found from earlier periods.'\n","\n","words = nltk.word_tokenize(sent)\n","words_without_stopwords = [word for word in words if word not in english_stops]\n","words_without_stopwords"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"NRG4SSw5Tsq8","executionInfo":{"status":"ok","timestamp":1765581336956,"user_tz":-420,"elapsed":7,"user":{"displayName":"Minh Pham","userId":"01293297774691882951"}},"outputId":"ff94ca3b-608a-43a4-fc34-ed6cc87ecc13"},"execution_count":12,"outputs":[{"output_type":"execute_result","data":{"text/plain":["['The',\n"," 'history',\n"," 'NLP',\n"," 'generally',\n"," 'starts',\n"," '1950s',\n"," ',',\n"," 'although',\n"," 'work',\n"," 'found',\n"," 'earlier',\n"," 'periods',\n"," '.']"]},"metadata":{},"execution_count":12}]},{"cell_type":"code","source":["# prompt: Print stop words in English\n","\n","stopwords.words('english')\n"],"metadata":{"id":"9GPissth0ntC","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1765581337909,"user_tz":-420,"elapsed":6,"user":{"displayName":"Minh Pham","userId":"01293297774691882951"}},"outputId":"0f12f8ec-c440-4a98-d263-400af81b9a3a"},"execution_count":13,"outputs":[{"output_type":"execute_result","data":{"text/plain":["['a',\n"," 'about',\n"," 'above',\n"," 'after',\n"," 'again',\n"," 'against',\n"," 'ain',\n"," 'all',\n"," 'am',\n"," 'an',\n"," 'and',\n"," 'any',\n"," 'are',\n"," 'aren',\n"," \"aren't\",\n"," 'as',\n"," 'at',\n"," 'be',\n"," 'because',\n"," 'been',\n"," 'before',\n"," 'being',\n"," 'below',\n"," 'between',\n"," 'both',\n"," 'but',\n"," 'by',\n"," 'can',\n"," 'couldn',\n"," \"couldn't\",\n"," 'd',\n"," 'did',\n"," 'didn',\n"," \"didn't\",\n"," 'do',\n"," 'does',\n"," 'doesn',\n"," \"doesn't\",\n"," 'doing',\n"," 'don',\n"," \"don't\",\n"," 'down',\n"," 'during',\n"," 'each',\n"," 'few',\n"," 'for',\n"," 'from',\n"," 'further',\n"," 'had',\n"," 'hadn',\n"," \"hadn't\",\n"," 'has',\n"," 'hasn',\n"," \"hasn't\",\n"," 'have',\n"," 'haven',\n"," \"haven't\",\n"," 'having',\n"," 'he',\n"," \"he'd\",\n"," \"he'll\",\n"," 'her',\n"," 'here',\n"," 'hers',\n"," 'herself',\n"," \"he's\",\n"," 'him',\n"," 'himself',\n"," 'his',\n"," 'how',\n"," 'i',\n"," \"i'd\",\n"," 'if',\n"," \"i'll\",\n"," \"i'm\",\n"," 'in',\n"," 'into',\n"," 'is',\n"," 'isn',\n"," \"isn't\",\n"," 'it',\n"," \"it'd\",\n"," \"it'll\",\n"," \"it's\",\n"," 'its',\n"," 'itself',\n"," \"i've\",\n"," 'just',\n"," 'll',\n"," 'm',\n"," 'ma',\n"," 'me',\n"," 'mightn',\n"," \"mightn't\",\n"," 'more',\n"," 'most',\n"," 'mustn',\n"," \"mustn't\",\n"," 'my',\n"," 'myself',\n"," 'needn',\n"," \"needn't\",\n"," 'no',\n"," 'nor',\n"," 'not',\n"," 'now',\n"," 'o',\n"," 'of',\n"," 'off',\n"," 'on',\n"," 'once',\n"," 'only',\n"," 'or',\n"," 'other',\n"," 'our',\n"," 'ours',\n"," 'ourselves',\n"," 'out',\n"," 'over',\n"," 'own',\n"," 're',\n"," 's',\n"," 'same',\n"," 'shan',\n"," \"shan't\",\n"," 'she',\n"," \"she'd\",\n"," \"she'll\",\n"," \"she's\",\n"," 'should',\n"," 'shouldn',\n"," \"shouldn't\",\n"," \"should've\",\n"," 'so',\n"," 'some',\n"," 'such',\n"," 't',\n"," 'than',\n"," 'that',\n"," \"that'll\",\n"," 'the',\n"," 'their',\n"," 'theirs',\n"," 'them',\n"," 'themselves',\n"," 'then',\n"," 'there',\n"," 'these',\n"," 'they',\n"," \"they'd\",\n"," \"they'll\",\n"," \"they're\",\n"," \"they've\",\n"," 'this',\n"," 'those',\n"," 'through',\n"," 'to',\n"," 'too',\n"," 'under',\n"," 'until',\n"," 'up',\n"," 've',\n"," 'very',\n"," 'was',\n"," 'wasn',\n"," \"wasn't\",\n"," 'we',\n"," \"we'd\",\n"," \"we'll\",\n"," \"we're\",\n"," 'were',\n"," 'weren',\n"," \"weren't\",\n"," \"we've\",\n"," 'what',\n"," 'when',\n"," 'where',\n"," 'which',\n"," 'while',\n"," 'who',\n"," 'whom',\n"," 'why',\n"," 'will',\n"," 'with',\n"," 'won',\n"," \"won't\",\n"," 'wouldn',\n"," \"wouldn't\",\n"," 'y',\n"," 'you',\n"," \"you'd\",\n"," \"you'll\",\n"," 'your',\n"," \"you're\",\n"," 'yours',\n"," 'yourself',\n"," 'yourselves',\n"," \"you've\"]"]},"metadata":{},"execution_count":13}]},{"cell_type":"markdown","source":["## spaCy\n","\n","[spaCy](https://spacy.io/) is an open-source software library for advanced natural language processing, written in Python and Cython. Let's see how we can do NLP Pipeline with spaCy.\n","\n","We need to download English model before using spaCy.\n"],"metadata":{"id":"egBk-N64T4Z0"}},{"cell_type":"code","source":["%%capture\n","!pip install -U spacy\n","!python -m spacy download en_core_web_sm"],"metadata":{"id":"jouE8yhnXGt4","executionInfo":{"status":"ok","timestamp":1765581354178,"user_tz":-420,"elapsed":14370,"user":{"displayName":"Minh Pham","userId":"01293297774691882951"}}},"execution_count":14,"outputs":[]},{"cell_type":"code","source":["import spacy\n","nlp = spacy.load(\"en_core_web_sm\")"],"metadata":{"id":"MWrMwbLccobM","executionInfo":{"status":"ok","timestamp":1765581364284,"user_tz":-420,"elapsed":8049,"user":{"displayName":"Minh Pham","userId":"01293297774691882951"}}},"execution_count":15,"outputs":[]},{"cell_type":"markdown","source":["### Sentence tokenization"],"metadata":{"id":"mLfKCmubWuy-"}},{"cell_type":"code","source":["para = \"Hello World Dr. John. It's good to see you. Thanks for buying this book.\"\n","doc = nlp(para)\n","sents = [sent.text for sent in doc.sents]\n","print(sents)"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"ymVhbDsrciFD","executionInfo":{"status":"ok","timestamp":1765581366311,"user_tz":-420,"elapsed":22,"user":{"displayName":"Minh Pham","userId":"01293297774691882951"}},"outputId":"ace9039a-fc0c-42e6-8575-b3c97bff2d74"},"execution_count":16,"outputs":[{"output_type":"stream","name":"stdout","text":["['Hello World Dr. John.', \"It's good to see you.\", 'Thanks for buying this book.']\n"]}]},{"cell_type":"markdown","source":["### Word tokenization"],"metadata":{"id":"brypQEi1W1jx"}},{"cell_type":"code","source":["sent = 'The history of NLP generally starts in the 1950s, although work can be found from earlier periods.'\n","doc = nlp(sent)\n","tokens = [x.text for x in doc]\n","tokens"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"YCeH_cJXcszT","executionInfo":{"status":"ok","timestamp":1765581368050,"user_tz":-420,"elapsed":22,"user":{"displayName":"Minh Pham","userId":"01293297774691882951"}},"outputId":"f436e47e-5789-430d-f27b-b8a7a30ab64b"},"execution_count":17,"outputs":[{"output_type":"execute_result","data":{"text/plain":["['The',\n"," 'history',\n"," 'of',\n"," 'NLP',\n"," 'generally',\n"," 'starts',\n"," 'in',\n"," 'the',\n"," '1950s',\n"," ',',\n"," 'although',\n"," 'work',\n"," 'can',\n"," 'be',\n"," 'found',\n"," 'from',\n"," 'earlier',\n"," 'periods',\n"," '.']"]},"metadata":{},"execution_count":17}]},{"cell_type":"markdown","source":["### POS Tagging"],"metadata":{"id":"Q862p1aWW6X2"}},{"cell_type":"code","source":["[(x.text, x.pos_) for x in doc]"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"V_Hu6FcUdChy","executionInfo":{"status":"ok","timestamp":1765581370237,"user_tz":-420,"elapsed":7,"user":{"displayName":"Minh Pham","userId":"01293297774691882951"}},"outputId":"ccab6a52-cadb-42ab-dc5c-52d0a2035a15"},"execution_count":18,"outputs":[{"output_type":"execute_result","data":{"text/plain":["[('The', 'DET'),\n"," ('history', 'NOUN'),\n"," ('of', 'ADP'),\n"," ('NLP', 'PROPN'),\n"," ('generally', 'ADV'),\n"," ('starts', 'VERB'),\n"," ('in', 'ADP'),\n"," ('the', 'DET'),\n"," ('1950s', 'NOUN'),\n"," (',', 'PUNCT'),\n"," ('although', 'SCONJ'),\n"," ('work', 'NOUN'),\n"," ('can', 'AUX'),\n"," ('be', 'AUX'),\n"," ('found', 'VERB'),\n"," ('from', 'ADP'),\n"," ('earlier', 'ADJ'),\n"," ('periods', 'NOUN'),\n"," ('.', 'PUNCT')]"]},"metadata":{},"execution_count":18}]},{"cell_type":"markdown","source":["### Filtering stop words"],"metadata":{"id":"ylEbEtZJd2hx"}},{"cell_type":"code","source":["spacy_stopwords = set(spacy.lang.en.stop_words.STOP_WORDS)\n","sent = 'The history of NLP generally starts in the 1950s, although work can be found from earlier periods.'\n","doc = nlp(sent)\n","words = [x.text for x in doc]\n","[word for word in words if word not in spacy_stopwords]"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"w6IQWLrreWet","executionInfo":{"status":"ok","timestamp":1765581372138,"user_tz":-420,"elapsed":11,"user":{"displayName":"Minh Pham","userId":"01293297774691882951"}},"outputId":"1abaf785-a47d-4d32-c448-67ced656859c"},"execution_count":19,"outputs":[{"output_type":"execute_result","data":{"text/plain":["['The',\n"," 'history',\n"," 'NLP',\n"," 'generally',\n"," 'starts',\n"," '1950s',\n"," ',',\n"," 'work',\n"," 'found',\n"," 'earlier',\n"," 'periods',\n"," '.']"]},"metadata":{},"execution_count":19}]},{"cell_type":"markdown","source":["### Named-Entity Recognition"],"metadata":{"id":"ovnyLt17W85N"}},{"cell_type":"code","source":["text = (\"Donald Trump has brought his drama show back to Washington early, \"\n","        \"perhaps realizing his time in the White House is down to days and counting\")\n","doc = nlp(text)\n","for ent in doc.ents:\n","    print(ent.text, ent.start_char, ent.end_char, ent.label_)"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"fUnrQs43W_fe","executionInfo":{"status":"ok","timestamp":1765581374444,"user_tz":-420,"elapsed":15,"user":{"displayName":"Minh Pham","userId":"01293297774691882951"}},"outputId":"7fa89a32-b6c2-40cf-e4c3-26fbe2356f19"},"execution_count":20,"outputs":[{"output_type":"stream","name":"stdout","text":["Donald Trump 0 12 PERSON\n","Washington 48 58 GPE\n","the White House 96 111 FAC\n"]}]},{"cell_type":"markdown","source":["## Other NLP Toolkits for English\n","\n","- [Stanza](https://github.com/stanfordnlp/stanza/) (Python)\n","- [Stanford CoreNLP](https://stanfordnlp.github.io/CoreNLP/) (Java)\n","- [Apache OpenNLP](https://github.com/apache/opennlp) (Java)\n","- [DKPro Core](https://dkpro.github.io/dkpro-core/) (Java)"],"metadata":{"id":"2qb2uhZQe-00"}},{"cell_type":"markdown","source":["## References\n","\n","1. [spaCy 101: Everything you need to know](https://spacy.io/usage/spacy-101)\n","2. [Advanced NLP with spacy](https://course.spacy.io/)\n","3. Bird, Steven; Klein, Ewan; Loper, Edward (2009). *Natural Language Processing with Python*. [http://www.nltk.org/book/](http://www.nltk.org/book/)\n","4. [NLTK in 20 minutes](http://www.slideshare.net/japerk/nltk-in-20-minutes), by Jacob Perkins\n"],"metadata":{"id":"KC7qd9Vfdk7u"}}]}