Buscar

3_ISCAP - NLP - Data Collection

Prévia do material em texto

Text Mining Pipeline
Types of Data
Depending of the data and his use case, we can storage in different format. 
● Structured Data
● Unstructured Data
● Semi-Structured data
Structured Data
Structured data consists of clearly defined data types with patterns that make them easily searchable.
Storage: Usually in relational databases
Examples of adequate data:
● Zip codes
● Phone numbers
● Email addresses
Main advantages:
● Easy to use
● Convenient Storage
● Instant Usability
Disadvantages:
● Limitations on use
● Limited storage
● High Overhead
Unstructured Data
Unstructured data has an internal structure but is not structured via predefined data models or schema. It may be textual or 
non-textual
Storage: Usually in non-relational database like NoSQL
Examples of adequate data:
● Text Files
● Websites
● Media
● Sensor data
Main advantages:
● Limitless Use
● Greater Insights
● Low Overhead
Disadvantages:
● Hard to Analyse
● Data Analytic Tools
● Numerous Formats
Semi-Structured data
Semi-structured data is the “bridge” between structured and unstructured data. It does not have a predefined data model 
and is more complex than structured data, yet easier to store than unstructured data.
Storage: Usually using formats like JSON or XML
● Example of metadata usage: An online article displays a headline, a snippet, a featured image, image alt-text, slug, 
etc., which helps differentiate one piece of web content from similar pieces.
● Example of semi-structured data vs. structured data: A tab-delimited file containing customer data versus a 
database containing CRM tables.
● Example of semi-structured data vs. unstructured data: A tab-delimited file versus a list of comments from a 
customer’s Instagram.
Language Syntax
When we talk about a language, we have a hierarchical structure as in the following image. In each phase of the modulation 
we have to decide which part of the hierarchy we want to use.
NLTK
Natural Language Toolkit (NLTK)
● A collection of python programs, modules, datasets to support the development of Text mining
● Written by Seven Bird, Edvard Loper and Ewan Klien
● Completely free and open-source
● Lots of documentation available
https://www.nltk.org/
NLTK - Install
You can install NLTK just as a normal python library, suggest to install also numpy.
Using command line:
> pip install numpy
> pip install nltk
NLTK - Install Data
NLTK datasets are not installed by default. Before trying to use any dataset you should install it.
Easier way to install is actually running a python code:
> import nltk
> nltk.download()
Text Corpus
A text corpus is a large body of text, usually divided into groups (books, chapters, any categorization)
In NLTK we can import the corpus module to download texts and use lots of different useful functions to manipulate the 
data.
NLTK Corpus
Some of the most known corpus available in nltk (½):
Corpus Contents
Brown Corpus 15 genres, 1.15M words, tagged, categorized
SentiWordNet sentiment scores for 145k WordNet synonym sets
Floresta Treebank 9k sentences, tagged and parsed (Portuguese)
MacMorpho Corpus 1M words, tagged (Portuguese - Brazilian)
Gutenberg (selections) 18 texts, 2M words
Inaugural Address Corpus US Presidential Inaugural Addresses (1789-present)
Movie Reviews 2k movie reviews with sentiment polarity classification
NLTK Corpus
Some of the most known corpus available in nltk (2/2):
Corpus Contents
Names Corpus 8k male and female names
NPS Chat Corpus 10k IM chat posts, POS-tagged and dialogue-act tagged
Reuters Corpus 1.3M words, 10k news documents, categorized
Shakespeare texts (selections) 8 books
Stopwords Corpus 2,400 stopwords for 11 languages
Gazetteer Lists Lists of cities and countries
SEMCOR 880k words, part-of-speech and sense tagged
Brown Corpus
The Browns Corpus is one of the most popular in the world. It was the first million-word electronic corpus of English, 
created in 1961 at Brown University.
It contains more than 500 sources, and all categorized by genre as news, editorial, reviews, humor, romance, ….
Let's explore the dataset!!
Brown Corpus - Code Example
The first step is load the “Brown Corpus”
>>> from nltk.corpus import brown
From now on we have a "brown" variable that we can use to analyse the texts, for example we can check which categories 
of texts we have available
We can start by checking the available categories in the Brown Corpus
>>> print(f"Number of categories: {len(brown.categories())}\nList of categories: {brown.categories()}")
Number of categories: 15
List of categories: ['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 
'news', 'religion', 'reviews', 'romance', 'science_fiction']
Brown Corpus - Code Example
Some options to get the paragraphs:
>>> brown.paras()
>>> brown.paras(fileids=["ca01"]) # filtered by fileid
>>> brown.paras(categories=["news"]) # filtered by categories
Brown Corpus - Code Example
Some options to get the sentences:
>>> brown.sents()
>>> brown.sents(fileids=["ca01"]) # filtered by fileid
>>> brown.sents(categories=["news"]) # filtered by categories
Brown Corpus - Code Example
Some options to get the words:
>>> brown.words()
>>> brown.words(fileids=["ca01"]) # filtered by fileid
>>> brown.words(categories=["news"]) # filtered by categories
Brown Corpus - Code Example
Let’s explore the use of the modal verbs (must, shall, will, should, would, can, could, may, and might) in Brown Corpus.
Use of modal verbs in news:
>>> modal_verbs = ["can", "could", "may", "might", "must", "will"]
>>> fdist = nltk.FreqDist([w.lower() for w in brown.words(categories="news")])
>>> for m in modals:
print(“{} : {}”.format(m, fdist[m))
Brown Corpus - Code Example
Let’s explore the use of the modal verbs (must, shall, will, should, would, can, could, may, and might) in Brown Corpus.
Trying to find different behaviour by category:
>>> modal_verbs = ["can", "could", "may", "might", "must", "will"]
>>> cfd = nltk.ConditionalFreqDist(
(genre, word)
for genre in brown.categories()
for word in brown.words(categories=genre))
>>> cfd.tabulate(samples=modal_verbs)
Main corpus methods (⅓)
Method Description
fileids() The files of the corpus
fileids([categories]) The files of the corpus corresponding to these categories
categories() The categories of the corpus
categories([fileids]) The categories of the corpus corresponding to these files
raw() The raw content of the corpus
raw(fileids=[f1,f2,f3]) The raw content of the specified files
raw(categories=[c1,c2]) The raw content of the specified categories
Main corpus methods (⅔)
Method Description
words() The words of the whole corpus
words(fileids=[f1,f2,f3]) The words of the specified files
words(categories=[c1,c1]) The words of the specified categories
sents() The sentences of the corpus
sents(fileids=[f1,f2,f3]) The sentences of the specified files
sents(categories=[c1,c1]) The sentences of the specified categories
Main corpus methods (3/3)
Method Description
abspath(fileid) The location of the given file on disk
encoding(fileid) The encoding of the file (if known)
open(fileid) Open a stream for reading the given corpus file
root() The sentences of the corpus
readme() The contents of the README file of the corpus