# Getting started with Text Analysis

NLP Course 2022

## Objectives

In this documents, we will learn following text analysis techniques:

- How to clean text data with regular expression
- Tokenisation
- POS Tagging

We will see how to use NLP Toolkits such as NLTK, SpaCY, Underthesea, PhoNLP to do NLP pipeline.

## Cleaning Texts

In NLP tasks, we may want to remove irrelevant contents in input data before using that for training or prediction. For instance, when we will delete URLs, punctuations in Twitter Posts in sentiment analysis task. We can do that with [Regular Expressions](https://www.tutorialspoint.com/python/python_reg_expressions.htm).

In Python we can use `re` module for regular expressions.


### Demo data

In [1]:
sent1 = "Stories of Pittsburghers trapped in #Houston flooding!!!! @@ - https://t.co/j5igfpvLJu https://t.co/8gsUpD8jsa"

### Remove URL

In [2]:
import re

# When the UNICODE flag is not specified, matches any non-whitespace character
result = re.sub(r"[(http(s)?):\/\/(www\.)?a-zA-Z0-9@:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9@:%_\+.~#?&//=]*)", "", sent1)
result

'Stories of Pittsburghers trapped in #Houston flooding!!!! @@ - '

### Remove punctuations in text

In [3]:
import string

In [4]:
translator = str.maketrans('', '', string.punctuation)
result2 = result.translate(translator)
result2

'Stories of Pittsburghers trapped in Houston flooding '

## NTLK (Natural Language Toolkit)

[NLTK](https://www.nltk.org/) is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.

We will see how to use nltk to do basic text analysis in NLP pipeline.


### Sentence tokenization

We split a long paragraph into sentences.

In [5]:
para = "Hello World Dr. John. It's good to see you. Thanks for buying this book."

We need to download appropriate models before using nltk to do some tasks.

In [6]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Unzipping tokenizers/punkt.zip.


True

In [7]:
from nltk.tokenize import sent_tokenize
sent_tokenize(para)

['Hello World Dr. John.',
 "It's good to see you.",
 'Thanks for buying this book.']

### Word tokenization

Word tokenization is to split a sentence into tokens.

In [8]:
from nltk.tokenize import word_tokenize

# sent = 'The history of NLP generally starts in the 1950s, although work can be found from earlier periods.'
sent = 'Hello World Dr. John.'
word_tokenize(sent)

['Hello', 'World', 'Dr.', 'John', '.']

### POS Tagging

POS Tagging is the process of assigning Part-of-speech to each token in a sentence. We need to tokenize the sentence first before performing POS Tagging.

In NLTK, we need to download the tagging model first.

In [9]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data] /root/nltk_data...
[nltk_data] Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [10]:
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize

sent = 'The history of NLP generally starts in the 1950s, although work can be found from earlier periods.'
pos_tag(word_tokenize(sent))

[('The', 'DT'),
 ('history', 'NN'),
 ('of', 'IN'),
 ('NLP', 'NNP'),
 ('generally', 'RB'),
 ('starts', 'VBZ'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('1950s', 'CD'),
 (',', ','),
 ('although', 'IN'),
 ('work', 'NN'),
 ('can', 'MD'),
 ('be', 'VB'),
 ('found', 'VBN'),
 ('from', 'IN'),
 ('earlier', 'JJR'),
 ('periods', 'NNS'),
 ('.', '.')]

### Filtering stop words

What are stopwords?

Stop words are the words in a stop list (or stoplist or negative dictionary) which are filtered out (i.e. stopped) before or after processing of natural language data (text) because they are insignificant. (Wikipedia).

We are going to filter out stop words in a sentence.

NLTK includes English stop words.

In [11]:
%%capture
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data] Unzipping corpora/stopwords.zip.


In [12]:
from nltk.corpus import stopwords
english_stops = set(stopwords.words('english'))

sent = 'The history of NLP generally starts in the 1950s, although work can be found from earlier periods.'

words = nltk.word_tokenize(sent)
words_without_stopwords = [word for word in words if word not in english_stops]
words_without_stopwords

['The',
 'history',
 'NLP',
 'generally',
 'starts',
 '1950s',
 ',',
 'although',
 'work',
 'found',
 'earlier',
 'periods',
 '.']

In [None]:
# prompt: Print stop words in English

stopwords.words('english')


## spaCy

[spaCy](https://spacy.io/) is an open-source software library for advanced natural language processing, written in Python and Cython. Let's see how we can do NLP Pipeline with spaCy.

We need to download English model before using spaCy.


In [17]:
%%capture
!pip install -U spacy
!python -m spacy download en_core_web_sm

In [18]:
import spacy
nlp = spacy.load("en_core_web_sm")

### Sentence tokenization

In [19]:
para = "Hello World Dr. John. It's good to see you. Thanks for buying this book."
doc = nlp(para)
sents = [sent.text for sent in doc.sents]
print(sents)

['Hello World Dr. John.', "It's good to see you.", 'Thanks for buying this book.']


### Word tokenization

In [20]:
sent = 'The history of NLP generally starts in the 1950s, although work can be found from earlier periods.'
doc = nlp(sent)
tokens = [x.text for x in doc]
tokens

['The',
 'history',
 'of',
 'NLP',
 'generally',
 'starts',
 'in',
 'the',
 '1950s',
 ',',
 'although',
 'work',
 'can',
 'be',
 'found',
 'from',
 'earlier',
 'periods',
 '.']

### POS Tagging

In [21]:
[(x.text, x.pos_) for x in doc]

[('The', 'DET'),
 ('history', 'NOUN'),
 ('of', 'ADP'),
 ('NLP', 'PROPN'),
 ('generally', 'ADV'),
 ('starts', 'VERB'),
 ('in', 'ADP'),
 ('the', 'DET'),
 ('1950s', 'NOUN'),
 (',', 'PUNCT'),
 ('although', 'SCONJ'),
 ('work', 'NOUN'),
 ('can', 'AUX'),
 ('be', 'AUX'),
 ('found', 'VERB'),
 ('from', 'ADP'),
 ('earlier', 'ADJ'),
 ('periods', 'NOUN'),
 ('.', 'PUNCT')]

### Filtering stop words

In [None]:
spacy_stopwords = set(spacy.lang.en.stop_words.STOP_WORDS)
sent = 'The history of NLP generally starts in the 1950s, although work can be found from earlier periods.'
doc = nlp(sent)
words = [x.text for x in doc]
[word for word in words if word not in spacy_stopwords]

['The',
 'history',
 'NLP',
 'generally',
 'starts',
 '1950s',
 ',',
 'work',
 'found',
 'earlier',
 'periods',
 '.']

### Named-Entity Recognition

In [22]:
text = ("Donald Trump has brought his drama show back to Washington early, "
 "perhaps realizing his time in the White House is down to days and counting")
doc = nlp(text)
for ent in doc.ents:
 print(ent.text, ent.start_char, ent.end_char, ent.label_)

Donald Trump 0 12 PERSON
Washington 48 58 GPE
the White House 96 111 FAC


## Other NLP Toolkits for English

- [Stanza](https://github.com/stanfordnlp/stanza/) (Python)
- [Stanford CoreNLP](https://stanfordnlp.github.io/CoreNLP/) (Java)
- [Apache OpenNLP](https://github.com/apache/opennlp) (Java)
- [DKPro Core](https://dkpro.github.io/dkpro-core/) (Java)

## References

1. [spaCy 101: Everything you need to know](https://spacy.io/usage/spacy-101)
2. [Advanced NLP with spacy](https://course.spacy.io/)
3. Bird, Steven; Klein, Ewan; Loper, Edward (2009). *Natural Language Processing with Python*. [http://www.nltk.org/book/](http://www.nltk.org/book/)
4. [NLTK in 20 minutes](http://www.slideshare.net/japerk/nltk-in-20-minutes), by Jacob Perkins
