Prévia do material em texto
Text Mining Pipeline Types of Data Depending of the data and his use case, we can storage in different format. ● Structured Data ● Unstructured Data ● Semi-Structured data Structured Data Structured data consists of clearly defined data types with patterns that make them easily searchable. Storage: Usually in relational databases Examples of adequate data: ● Zip codes ● Phone numbers ● Email addresses Main advantages: ● Easy to use ● Convenient Storage ● Instant Usability Disadvantages: ● Limitations on use ● Limited storage ● High Overhead Unstructured Data Unstructured data has an internal structure but is not structured via predefined data models or schema. It may be textual or non-textual Storage: Usually in non-relational database like NoSQL Examples of adequate data: ● Text Files ● Websites ● Media ● Sensor data Main advantages: ● Limitless Use ● Greater Insights ● Low Overhead Disadvantages: ● Hard to Analyse ● Data Analytic Tools ● Numerous Formats Semi-Structured data Semi-structured data is the “bridge” between structured and unstructured data. It does not have a predefined data model and is more complex than structured data, yet easier to store than unstructured data. Storage: Usually using formats like JSON or XML ● Example of metadata usage: An online article displays a headline, a snippet, a featured image, image alt-text, slug, etc., which helps differentiate one piece of web content from similar pieces. ● Example of semi-structured data vs. structured data: A tab-delimited file containing customer data versus a database containing CRM tables. ● Example of semi-structured data vs. unstructured data: A tab-delimited file versus a list of comments from a customer’s Instagram. Language Syntax When we talk about a language, we have a hierarchical structure as in the following image. In each phase of the modulation we have to decide which part of the hierarchy we want to use. NLTK Natural Language Toolkit (NLTK) ● A collection of python programs, modules, datasets to support the development of Text mining ● Written by Seven Bird, Edvard Loper and Ewan Klien ● Completely free and open-source ● Lots of documentation available https://www.nltk.org/ NLTK - Install You can install NLTK just as a normal python library, suggest to install also numpy. Using command line: > pip install numpy > pip install nltk NLTK - Install Data NLTK datasets are not installed by default. Before trying to use any dataset you should install it. Easier way to install is actually running a python code: > import nltk > nltk.download() Text Corpus A text corpus is a large body of text, usually divided into groups (books, chapters, any categorization) In NLTK we can import the corpus module to download texts and use lots of different useful functions to manipulate the data. NLTK Corpus Some of the most known corpus available in nltk (½): Corpus Contents Brown Corpus 15 genres, 1.15M words, tagged, categorized SentiWordNet sentiment scores for 145k WordNet synonym sets Floresta Treebank 9k sentences, tagged and parsed (Portuguese) MacMorpho Corpus 1M words, tagged (Portuguese - Brazilian) Gutenberg (selections) 18 texts, 2M words Inaugural Address Corpus US Presidential Inaugural Addresses (1789-present) Movie Reviews 2k movie reviews with sentiment polarity classification NLTK Corpus Some of the most known corpus available in nltk (2/2): Corpus Contents Names Corpus 8k male and female names NPS Chat Corpus 10k IM chat posts, POS-tagged and dialogue-act tagged Reuters Corpus 1.3M words, 10k news documents, categorized Shakespeare texts (selections) 8 books Stopwords Corpus 2,400 stopwords for 11 languages Gazetteer Lists Lists of cities and countries SEMCOR 880k words, part-of-speech and sense tagged Brown Corpus The Browns Corpus is one of the most popular in the world. It was the first million-word electronic corpus of English, created in 1961 at Brown University. It contains more than 500 sources, and all categorized by genre as news, editorial, reviews, humor, romance, …. Let's explore the dataset!! Brown Corpus - Code Example The first step is load the “Brown Corpus” >>> from nltk.corpus import brown From now on we have a "brown" variable that we can use to analyse the texts, for example we can check which categories of texts we have available We can start by checking the available categories in the Brown Corpus >>> print(f"Number of categories: {len(brown.categories())}\nList of categories: {brown.categories()}") Number of categories: 15 List of categories: ['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction'] Brown Corpus - Code Example Some options to get the paragraphs: >>> brown.paras() >>> brown.paras(fileids=["ca01"]) # filtered by fileid >>> brown.paras(categories=["news"]) # filtered by categories Brown Corpus - Code Example Some options to get the sentences: >>> brown.sents() >>> brown.sents(fileids=["ca01"]) # filtered by fileid >>> brown.sents(categories=["news"]) # filtered by categories Brown Corpus - Code Example Some options to get the words: >>> brown.words() >>> brown.words(fileids=["ca01"]) # filtered by fileid >>> brown.words(categories=["news"]) # filtered by categories Brown Corpus - Code Example Let’s explore the use of the modal verbs (must, shall, will, should, would, can, could, may, and might) in Brown Corpus. Use of modal verbs in news: >>> modal_verbs = ["can", "could", "may", "might", "must", "will"] >>> fdist = nltk.FreqDist([w.lower() for w in brown.words(categories="news")]) >>> for m in modals: print(“{} : {}”.format(m, fdist[m)) Brown Corpus - Code Example Let’s explore the use of the modal verbs (must, shall, will, should, would, can, could, may, and might) in Brown Corpus. Trying to find different behaviour by category: >>> modal_verbs = ["can", "could", "may", "might", "must", "will"] >>> cfd = nltk.ConditionalFreqDist( (genre, word) for genre in brown.categories() for word in brown.words(categories=genre)) >>> cfd.tabulate(samples=modal_verbs) Main corpus methods (⅓) Method Description fileids() The files of the corpus fileids([categories]) The files of the corpus corresponding to these categories categories() The categories of the corpus categories([fileids]) The categories of the corpus corresponding to these files raw() The raw content of the corpus raw(fileids=[f1,f2,f3]) The raw content of the specified files raw(categories=[c1,c2]) The raw content of the specified categories Main corpus methods (⅔) Method Description words() The words of the whole corpus words(fileids=[f1,f2,f3]) The words of the specified files words(categories=[c1,c1]) The words of the specified categories sents() The sentences of the corpus sents(fileids=[f1,f2,f3]) The sentences of the specified files sents(categories=[c1,c1]) The sentences of the specified categories Main corpus methods (3/3) Method Description abspath(fileid) The location of the given file on disk encoding(fileid) The encoding of the file (if known) open(fileid) Open a stream for reading the given corpus file root() The sentences of the corpus readme() The contents of the README file of the corpus