data mining and data warehousing henryk maciejewski text

Data Mining and Data Warehousing

Henryk Maciejewski

Text Mining

Schedule

• Text Mining vs Natural Language Processing (NLP)

• Classification of text documents – idea, methods

Example – Automatic classification of emails… Currently not

interested …

Contact me after

… the offer seems

interesting

Ready for

interview…

… offer Interesting,

relocation

required…

Example – Automatic classification of emails… Currently not

interested …

Contact me after

… the offer seems

interesting …

Ready for

interview…

… the offer seems

interesting, however

relocation

required…

Subject /

category of

message

Text mining• Classification of text documents

– Based on terms/topics in documents

– Predictive TM

• Clustering documents– Associate clusters with

characteristic terms-topics

– Descriptive TM

Text mining vs NLP• Classification of text documents

– Based on terms/topics in documents

– Predictive TM

• Clustering documents– Associate clusters with

characteristic terms-topics

– Descriptive TM

• Text preprocessing(tokenization, lemmatization, word sense disambiguation)

• Information extraction

• Machine translation

• Q-A Systems

• Dialog systems

• …

Text mining vs NLP• TM – preprocessing of documents (or collections of documents)

in order to derive features for classification/clustering tasks, using NLP methods

• NLP (Natural Language Processing)– Methods of processing of „raw” text / natural language (tokenization,

normalization, stemming, lemmatization)– Part of speech tagging (POS tagging)– Classification of documents– Information extraction, chunking, named entity recognition, understanding

structure of sentences,…)– Automatic understanding of text (responding system; machine translation)

NLP resources: text corpora• Brown Corpus

collection of tagged documents (news, hobbies, religion, adventure, …)

• Reuters Corpusshort news, tagged with ca. 90 categories (e.g., coffee, crude, export, …)

• Annotated sources – Movie reviews (sentiment polarity class) – Names Corpus, …

• Wordnet database– synonyms– semantic relationships between words.

Text corpora• Brown Corpus Francis, Kucera 15 genres, 1.15M words, tagged, categorized

• CESS Treebanks CLiC-UB 1M words, tagged and parsed (Catalan, Spanish)

• Chat-80 Data Files Pereira & Warren World Geographic Database

• CMU Pronouncing Dictionary CMU 127k entries

• CoNLL 2000 Chunking Data CoNLL 270k words, tagged and chunked

• CoNLL 2002 Named Entity CoNLL 700k words, POS and named entity tagged (Dutch, Spanish)

• CoNLL 2007 Dependency Parsed Treebanks

• Floresta Treebank Diana Santos et al. 9k sentences, tagged and parsed (Portuguese)

• Gazetteer Lists Various Lists of cities and countries

• Genesis Corpus Misc web sources 6 texts, 200k words, 6 languages

• Gutenberg (selections) Hart, Newby, et al. 18 texts, 2M words

• Inaugural Address Corpus CSpan U.S. Presidential Inaugural Addresses (1789–present)

• Indian POS Tagged Corpus Kumaran et al. 60k words, tagged (Bangla, Hindi, Marathi, Telugu)

• MacMorpho Corpus NILC, USP, Brazil 1M words, tagged (Brazilian Portuguese)

• Movie Reviews Pang, Lee 2k movie reviews with sentiment polarity classification

• Names Corpus Kantrowitz, Ross 8k male and female names

Text corpora• NPS Chat Corpus Forsyth, Martell 10k IM chat posts, POS and dialogue-act tagged

• Penn Treebank (selections) LDC 40k words, tagged and parsed

• PP Attachment Corpus Ratnaparkhi 28k prepositional phrases, tagged as noun or verb modifiers

• Proposition Bank Palmer 113k propositions, 3,300 verb frames

• Question Classification Li, Roth 6k questions, categorized

• Reuters Corpus Reuters 1.3M words, 10k news documents, categorized

• Roget’s Thesaurus Project Gutenberg 200k words, formatted text

• RTE Textual Entailment Dagan et al. 8k sentence pairs, categorized

• SEMCOR Rus, Mihalcea 880k words, POS and sense tagged

• Senseval 2 Corpus Pedersen 600k words, POS and sense tagged

• Shakespeare texts (selections) Bosak 8 books in XML format

• Stopwords Corpus Porter et al. 2,400 stopwords for 11 languages

• Swadesh Corpus Wiktionary Comparative wordlists in 24 languages

• VerbNet 2.1 Palmer et al. 5k verbs, hierarchically organized, linked to WordNet

• WordNet 3.0 (English) Miller, Fellbaum 145k synonym sets

Wordnet DB: relationships of words• Hyponyms – hypernyms

hypernym – more general wordhyponyms – more specific terms

E.g., metal -- iron, silver, copper, gold,…

• Holonyms – meronyms

meronyms for tree – e.g., branch, trunk -- parts of treeholonyms for tree – e.g., forest -- bigger set, with trees as elements

• Synonyms (synsets)

NLP tools for preprocessing of text• Text obtained from text documents/ Web crawlers, HTML / search engines / binary docs

• Tokenization– Extract words, punctuation– Categories of words (e.g., alfa, num, mixed, words with digits, proper names,

named entities)

• Normalization– Lowercase– Stemming (extract stem – (useful for indexing of docs – handling different forms of

word))– Lemmatization – resulting stem is in the dictionary (e.g. plural singular, base

form of verb…)

• POS Tagging

NLP tools – taggers• Tagging – determine part of speech (POS tagging, lexical

categories)– based on tagged corpora– using regexps– look-up taggers (look-up table with e.g., 100 most frequent words)– bigram, n-gram taggers– Hidden Markov Models– CRF (conditional random fields)– …

Schedule

• Text Mining vs Natural Language Processing (NLP)

• Classification of text documents – idea, methods

Classification of documents –applications

• Subject classification of e-mail messages (head-hunting campaigns; („interested in offer”, „uninterested”, …)

• Automation of search for candidate profiles in social media (LinkedIn,…)

• Classification of user requests to the tech-support• Subject (topic) classification of news articles / scientific papers• Opinion mining (sentiment analysis) – posts, reviews, etc.• Authorship attribution• …

Subject classification of text documents

• Rule-based systems

• Systems based on supervised learninginput: (d1,c1),…,(dm,cm) – training set, m docs d, with subject category ciC

output:classifier f(d) cCestimation of prediction error for new data

Generating features from text documents

• „Bag of words”|D| = m documents|W| = n different words in collection D

• Frequency matrix – occurence of words in docsM = [fik], i=1,…,m, k=1,…,n

fik = frequency of word wk in doc di (term frequency, tf)

Row of matrix M fi – will be used as a feature vector for doc di

Modifications of frequency matrix

• Binary featuresfik=1 if wk occurs in doc di, fik=0 – otherwise

• Weighted features: (vk – weight / significance of word wk)

vkfikvk = log2(P(wk)

-1)+1 - (inverse document frequency, idf)P(wk) – proportion of docs with word wk

(other versions: vk – entropy, mutual information criterion, …)

Training classifiers based on matrix M

• Std classifiers: NB, SVM, decision trees, logistic regression, …

• Estimate predictive performance (based on the test partition)

........................... ..................

w1,…,wn

Training classifiers based on matrix M

........................... ..................

w1,…,wn

• n»m problem

• Try to reduce the number of words for which M is built

• Use algorithms of dimensionality reduction

• Use classifiers based on regularization techniques (ridge regression, Lasso, ElasticNet, SVM)

Performance of classifiers in TM

Classifier decision

(a=TP, b=FN, c=FP)

Performance of classifiers in TM

R=0.88

F=0.94

How to select words (terms) Wk?

• n»m problem

• Morphosyntactic tagging to restrict number of words in Mmxn

– Use one base form rather than several original (orth) forms

– Filter by POS (e.g., use only N, ADJ, V)

– Use synset in place of several synset members

........................... ..................

w1,…,wn

How to select words wk - problems

• Tokenization

• Many forms of one word (lemmatization)

• Word sense disambiguation

• Part of speech (POS) tagging

20.07.2012r.

β-blokerów,

(α2-mimetyki)-

żyła

żyły.

żywności

żywność

żywiołowej.

żywieniowych

żywieniowego.

żywieniowe

żywienia

żywe,

żywa,

How to select words wk - problems

• Tokenization

• Many forms of one word (lemmatization)

• Word sense disambiguation

• Part of speech (POS) tagging

Psa mamy już od ponad roku…

Psa mamy widziano ostatnio w parku.

The boy leapt from the bank into the cold water.

The van pulled up outside the bank and three masked men got out.

SAS Text Miner – Example • Classification of short messages

(News)

Text parsing• Preprocessing of text

– Remove non-words, – POS tagging, filtering by parts of speech– Normalization, stemming, synonyms

detection– …

Text parsing

Text filter• Create terms x docs matrix

wifij – weighed frequency fij of term i in doc j(weights wi – based on frequency and distribution of term in corpus)

• Reduce dimensionality of frequency matrix (remove irrelevant terms)

Text cluster• SVD transformation of the terms x docs –

matrix – dimensionality reduction

• Cluster documents (in the principal components space)

• Find descriptive terms which represent clusters

Text cluster

Text topic• Generate topics (groups of terms) which

often come together

• Topics: user defined or generated automaticly

Text topic

Classification of documents • Vectors of features –

– based on topics or– based on SVD components of the frequency

matrix

• Classification with std methods of data mining/machine learning

• In example Decision Tree based on text topics Text Topics (next slide)

Klasyfikacja• W oparciu o wykryte tematy (topics) lub wg macierzy częstości (komponentów SVD)

• Standardowe metody data mining

Klasyfikacja

Tools / references

• Steven Bird, Ewan Klein, and Edward Loper, Natural Language Processing with Python – Analyzing Text with the Natural Language Toolkit. http://www.nltk.org/book/

• NLTK 3.0 documentationhttp://www.nltk.org/

• Jacob Perkins, Python 3 Text Processing with NLTK 3 Cookbook

Tools / references

• Dan Jurafsky, Christopher Manning, Natural Language Processing - wykładyhttps://class.coursera.org/nlp/lecture

• Trevor Hastie, Robert Tibshirani, Jerome Friedman, The Elements of Statistical Learning, Second Edition. Springerhttp://statweb.stanford.edu/~tibs/ElemStatLearn/

Tools for Polish

• NLP tools for Polish: nlp.pwr.edu.plhttp://nlp.pwr.edu.pl/narzedzia-i-zasoby http://nlp.pwr.edu.pl/pl/narzedzia-i-zasoby/narzedzia-przetwarzania-morfosyntaktycznegohttp://nlp.pwr.edu.pl/pl/narzedzia-i-zasoby/wcrft-tagger

• Narodowy Korpus Języka Polskiego, http://nkjp.pl/

data mining and data warehousing henryk maciejewski text

Documents

photographer henryk lobaczewski

Łukasz machaj, marek maciejewski: ku europejskiej jedności

henryk sienkiewicz

jeanine maciejewski cardenas 1 k

data mining and data warehousing henryk maciejewski data

major henryk sucharski

hania henryk sienkiewicz

janusz maciejewski norwid a pozytywizm : rekonesans

komisarz maciejewski

henryk samsonowicz

henryk konrad

henryk grynberg • pamiętnik

henryk atemborski

h.j. siegel tony maciejewski purdue university

dariusz maciejewski. wentylacja szumowa

henryk sienkiewicz krizari

r_kocyba henryk karol

henryk grossman

i kredyt 466(), 2015, 523-564 asymmetric shocks and...

henryk dietel