data mining and data warehousing henryk maciejewski text
Post on 22-Mar-2022
2 Views
Preview:
TRANSCRIPT
Schedule
• Text Mining vs Natural Language Processing (NLP)
• Classification of text documents – idea, methods
Example – Automatic classification of emails… Currently not
interested …
Contact me after
…
… the offer seems
interesting
Ready for
interview…
… offer Interesting,
relocation
required…
Example – Automatic classification of emails… Currently not
interested …
Contact me after
…
… the offer seems
interesting …
Ready for
interview…
… the offer seems
interesting, however
relocation
required…
Subject /
category of
message
Text mining• Classification of text documents
– Based on terms/topics in documents
– Predictive TM
• Clustering documents– Associate clusters with
characteristic terms-topics
– Descriptive TM
Text mining vs NLP• Classification of text documents
– Based on terms/topics in documents
– Predictive TM
• Clustering documents– Associate clusters with
characteristic terms-topics
– Descriptive TM
• Text preprocessing(tokenization, lemmatization, word sense disambiguation)
• Information extraction
• Machine translation
• Q-A Systems
• Dialog systems
• …
7
Text mining vs NLP• TM – preprocessing of documents (or collections of documents)
in order to derive features for classification/clustering tasks, using NLP methods
• NLP (Natural Language Processing)– Methods of processing of „raw” text / natural language (tokenization,
normalization, stemming, lemmatization)– Part of speech tagging (POS tagging)– Classification of documents– Information extraction, chunking, named entity recognition, understanding
structure of sentences,…)– Automatic understanding of text (responding system; machine translation)
8
NLP resources: text corpora• Brown Corpus
collection of tagged documents (news, hobbies, religion, adventure, …)
• Reuters Corpusshort news, tagged with ca. 90 categories (e.g., coffee, crude, export, …)
• Annotated sources – Movie reviews (sentiment polarity class) – Names Corpus, …
• Wordnet database– synonyms– semantic relationships between words.
9
Text corpora• Brown Corpus Francis, Kucera 15 genres, 1.15M words, tagged, categorized
• CESS Treebanks CLiC-UB 1M words, tagged and parsed (Catalan, Spanish)
• Chat-80 Data Files Pereira & Warren World Geographic Database
• CMU Pronouncing Dictionary CMU 127k entries
• CoNLL 2000 Chunking Data CoNLL 270k words, tagged and chunked
• CoNLL 2002 Named Entity CoNLL 700k words, POS and named entity tagged (Dutch, Spanish)
• CoNLL 2007 Dependency Parsed Treebanks
• Floresta Treebank Diana Santos et al. 9k sentences, tagged and parsed (Portuguese)
• Gazetteer Lists Various Lists of cities and countries
• Genesis Corpus Misc web sources 6 texts, 200k words, 6 languages
• Gutenberg (selections) Hart, Newby, et al. 18 texts, 2M words
• Inaugural Address Corpus CSpan U.S. Presidential Inaugural Addresses (1789–present)
• Indian POS Tagged Corpus Kumaran et al. 60k words, tagged (Bangla, Hindi, Marathi, Telugu)
• MacMorpho Corpus NILC, USP, Brazil 1M words, tagged (Brazilian Portuguese)
• Movie Reviews Pang, Lee 2k movie reviews with sentiment polarity classification
• Names Corpus Kantrowitz, Ross 8k male and female names
10
Text corpora• NPS Chat Corpus Forsyth, Martell 10k IM chat posts, POS and dialogue-act tagged
• Penn Treebank (selections) LDC 40k words, tagged and parsed
• PP Attachment Corpus Ratnaparkhi 28k prepositional phrases, tagged as noun or verb modifiers
• Proposition Bank Palmer 113k propositions, 3,300 verb frames
• Question Classification Li, Roth 6k questions, categorized
• Reuters Corpus Reuters 1.3M words, 10k news documents, categorized
• Roget’s Thesaurus Project Gutenberg 200k words, formatted text
• RTE Textual Entailment Dagan et al. 8k sentence pairs, categorized
• SEMCOR Rus, Mihalcea 880k words, POS and sense tagged
• Senseval 2 Corpus Pedersen 600k words, POS and sense tagged
• Shakespeare texts (selections) Bosak 8 books in XML format
• Stopwords Corpus Porter et al. 2,400 stopwords for 11 languages
• Swadesh Corpus Wiktionary Comparative wordlists in 24 languages
• VerbNet 2.1 Palmer et al. 5k verbs, hierarchically organized, linked to WordNet
• WordNet 3.0 (English) Miller, Fellbaum 145k synonym sets
11
Wordnet DB: relationships of words• Hyponyms – hypernyms
hypernym – more general wordhyponyms – more specific terms
E.g., metal -- iron, silver, copper, gold,…
• Holonyms – meronyms
meronyms for tree – e.g., branch, trunk -- parts of treeholonyms for tree – e.g., forest -- bigger set, with trees as elements
• Synonyms (synsets)
12
NLP tools for preprocessing of text• Text obtained from text documents/ Web crawlers, HTML / search engines / binary docs
/ …
• Tokenization– Extract words, punctuation– Categories of words (e.g., alfa, num, mixed, words with digits, proper names,
named entities)
• Normalization– Lowercase– Stemming (extract stem – (useful for indexing of docs – handling different forms of
word))– Lemmatization – resulting stem is in the dictionary (e.g. plural singular, base
form of verb…)
• POS Tagging
13
NLP tools – taggers• Tagging – determine part of speech (POS tagging, lexical
categories)– based on tagged corpora– using regexps– look-up taggers (look-up table with e.g., 100 most frequent words)– bigram, n-gram taggers– Hidden Markov Models– CRF (conditional random fields)– …
Schedule
• Text Mining vs Natural Language Processing (NLP)
• Classification of text documents – idea, methods
Classification of documents –applications
• Subject classification of e-mail messages (head-hunting campaigns; („interested in offer”, „uninterested”, …)
• Automation of search for candidate profiles in social media (LinkedIn,…)
• Classification of user requests to the tech-support• Subject (topic) classification of news articles / scientific papers• Opinion mining (sentiment analysis) – posts, reviews, etc.• Authorship attribution• …
Subject classification of text documents
• Rule-based systems
• Systems based on supervised learninginput: (d1,c1),…,(dm,cm) – training set, m docs d, with subject category ciC
output:classifier f(d) cCestimation of prediction error for new data
Generating features from text documents
• „Bag of words”|D| = m documents|W| = n different words in collection D
• Frequency matrix – occurence of words in docsM = [fik], i=1,…,m, k=1,…,n
fik = frequency of word wk in doc di (term frequency, tf)
Row of matrix M fi – will be used as a feature vector for doc di
Modifications of frequency matrix
• Binary featuresfik=1 if wk occurs in doc di, fik=0 – otherwise
• Weighted features: (vk – weight / significance of word wk)
vkfikvk = log2(P(wk)
-1)+1 - (inverse document frequency, idf)P(wk) – proportion of docs with word wk
(other versions: vk – entropy, mutual information criterion, …)
Training classifiers based on matrix M
• Std classifiers: NB, SVM, decision trees, logistic regression, …
• Estimate predictive performance (based on the test partition)
...
...
...
...
...
...
...
........................... ..................
...
w1,…,wn
m d
ocum
ents
Y
Training classifiers based on matrix M
...
...
...
...
...
...
...
........................... ..................
...
w1,…,wn
m d
ocum
ents
Y
• n»m problem
• Try to reduce the number of words for which M is built
• Use algorithms of dimensionality reduction
• Use classifiers based on regularization techniques (ridge regression, Lasso, ElasticNet, SVM)
How to select words (terms) Wk?
• n»m problem
• Morphosyntactic tagging to restrict number of words in Mmxn
– Use one base form rather than several original (orth) forms
– Filter by POS (e.g., use only N, ADJ, V)
– Use synset in place of several synset members
...
...
...
...
...
...
...
........................... ..................
...
w1,…,wn
m d
ocum
ents
Y
How to select words wk - problems
• Tokenization
• Many forms of one word (lemmatization)
• Word sense disambiguation
• Part of speech (POS) tagging
20.07.2012r.
β-blokerów,
(α2-mimetyki)-
żyła
żyły.
żywności
żywność
żywiołowej.
żywieniowych
żywieniowego.
żywieniowe
żywienia
żywe,
żywe
żywa,
How to select words wk - problems
• Tokenization
• Many forms of one word (lemmatization)
• Word sense disambiguation
• Part of speech (POS) tagging
Psa mamy już od ponad roku…
Psa mamy widziano ostatnio w parku.
The boy leapt from the bank into the cold water.
The van pulled up outside the bank and three masked men got out.
29
Text parsing• Preprocessing of text
– Remove non-words, – POS tagging, filtering by parts of speech– Normalization, stemming, synonyms
detection– …
31
Text filter• Create terms x docs matrix
wifij – weighed frequency fij of term i in doc j(weights wi – based on frequency and distribution of term in corpus)
• Reduce dimensionality of frequency matrix (remove irrelevant terms)
32
Text cluster• SVD transformation of the terms x docs –
matrix – dimensionality reduction
• Cluster documents (in the principal components space)
• Find descriptive terms which represent clusters
34
Text topic• Generate topics (groups of terms) which
often come together
• Topics: user defined or generated automaticly
36
Classification of documents • Vectors of features –
– based on topics or– based on SVD components of the frequency
matrix
• Classification with std methods of data mining/machine learning
• In example Decision Tree based on text topics Text Topics (next slide)
37
Klasyfikacja• W oparciu o wykryte tematy (topics) lub wg macierzy częstości (komponentów SVD)
• Standardowe metody data mining
Tools / references
• Steven Bird, Ewan Klein, and Edward Loper, Natural Language Processing with Python – Analyzing Text with the Natural Language Toolkit. http://www.nltk.org/book/
• NLTK 3.0 documentationhttp://www.nltk.org/
• Jacob Perkins, Python 3 Text Processing with NLTK 3 Cookbook
Tools / references
• Dan Jurafsky, Christopher Manning, Natural Language Processing - wykładyhttps://class.coursera.org/nlp/lecture
• Trevor Hastie, Robert Tibshirani, Jerome Friedman, The Elements of Statistical Learning, Second Edition. Springerhttp://statweb.stanford.edu/~tibs/ElemStatLearn/
top related