text mining & retrieval. text information thoughts opinions feelings stories documentaries news...
TRANSCRIPT
TEXT
Mining & Retrieval
TEXTINFORMATION
THOUGHTS
OPINIONSFEELINGS
STORIES
DOCUMENTARIES
NEWS
LANGUAGES
EVERY DAY LIFE
SCIENCE
DISCUSSIONS
POLITICS PERSONALITIES
SOCIAL CONNECTIONS
DISASTERS
COMMERSE
Text Mining / Retrieval
• RETRIEVAL: discovery of text relevant to an information need
• MINING: discovery of new information in text (or reformulating information already there)
• natural language processing• computational linguistics• data mining, statistics
Text: Peculiarities
Text: Peculiarities
• Unstructured• Word dependencies (context, grammars)• Different languages, styles• Noisy (misspellings, typos, scanning errors…)• Burdensome formatting (HTML, XML…)• Humor, sarcasm, ambiguity, etc…
Representing Text
• “Bag of words”, i.e. Vector Space Model
break the document into its constituent words and put them in a table
Indexing for Retrieval
Doc Term
D1 Apple, Pear, Pear
D2 Cat, Dog
D3 Cat, Cat, Tiger
… …
Term Doc
Apple D1 1
Pear D1 2
Cat D2 1, D3 2
… …
Document Collection Forward Index Inverted Index
Conceptually, document is a vector of terms
Apple Pear Cat Tiger …
1 2 0 0 …
Representing Text
• Preprocessing– Clean-up
• remove formatting, tables, HTML…
– Remove stopwords• the, of, to, a, in, and, that, for, is
– Stem words• Porter Stemmer – heuristic• statistical, brute-force (lookup tables)
Representing Text
• Preserving some meaning of the words:– Part of Speech tagging– Word Sense Disambiguation– Semantic annotation
EYE DROPS OFF SHELFPROSTITUTES APPEAL TO POPEKIDS MAKE NUTRITIOUS SNACKS
STOLEN PAINTING FOUND BY TREELUNG CANCER IN WOMEN MUSHROOMS
QUEEN MARY HAVING BOTTOM SCRAPEDDEALERS WILL HEAR CAR TALK AT NOONMINERS REFUSE TO WORK AFTER DEATH
MILK DRINKERS ARE TURNING TO POWDERDRUNK GETS NINE MONTHS IN VIOLIN CASE
JUVENILE COURT TO TRY SHOOTING DEFENDANTCOMPLAINTS ABOUT NBA REFEREES GROWING UGLY
PANDA MATING FAILS; VETERINARIAN TAKES OVERMAN EATING PIRANHA MISTAKENLY SOLD AS PET FISHASTRONAUT TAKES BLAME FOR GAS IN SPACECRAFT
QUARTER OF A MILLION CHINESE LIVE ON WATERINCLUDE YOUR CHILDREN WHEN BAKING COOKIESOLD SCHOOL PILLARS ARE REPLACED BY ALUMNI
GRANDMOTHER OF EIGHT MAKES HOLE IN ONEHOSPITALS ARE SUED BY 7 FOOT DOCTORSLAWMEN FROM MEXICO BARBECUE GUESTS
TWO SOVIET SHIPS COLLIDE, ONE DIESENRAGED COW INJURES FARMER WITH AX
LACK OF BRAINS HINDERS RESEARCHRED TAPE HOLDS UP NEW BRIDGE
SQUAD HELPS DOG BITE VICTIMIRAQI HEAD SEEKS ARMSHERSHEY BARS PROTEST
Representing Text
• Vector Space Model:
D = (t1, wd1; t2, wd2; …, tv, wdv)
w: binary, count, TFIDF
Apple Pear Cat Tiger …
1 2 0 0 …
TFIDF
Problems
• Synonymy– multiple words that have similar meanings
• Polysemy– words that have more than one meaning
Latent Semantic Indexing
• Index by the hidden “meaning” of text
“words that are used in the same contexts tend to have similar meanings”
• using Singular Value Decomposition– a linear algebra technique for factorization of
matrixes
Latent Semantic Indexing
X = RSTT
distribution of terms for a concept(concept language model)
distribution of concepts in a document
importance ofeach concept
Latent Semantic Indexing
1. Index using concepts instead of terms
2. Query represented like another document
3. Retrieve documents “closest” to query
Latent Semantic Analysis
• Document categorization (plagiarism)• Comparing terms (synonymy)• Works with any language• Tolerant of noise (misspellings)
• Faults:– requires lots of memory– how many concepts should we use?
Probabilistic Text Retrieval
[http://nlp.stanford.edu]
Language ModelGenerative Model
Probabilistic Text Retrieval
using chain rule:P(t1t2t3t4) = P(t1)P(t2|t1)P(t3|t2t1)P(t4|t3t2t1)
unigram language model:P(t1t2t3t4) = P(t1)P(t2)P(t3)P(t4)
bigram language model:P(t1t2t3t4) = P(t1)P(t2|t1)P(t3|t2)P(t4|t3)
Probabilistic Text Retrieval
• Query likelihood model
Each document d has language model Md
P(d|q) = P(q|d)P(d)/P(q)
Naïve Bayes with each document as a class
P(q|Md) ≈ ΠtεV P(t|Md)tf(t,d)
Probabilistic Text Retrieval
• Estimating P(t|Md):
P(t|Md) = tf(t,d) / Lengthd
• Prior for terms not appearing in the document (smoothing):
P’(t|Md) = collectionFreq(t) / collectionSize
Probabilistic Text Retrieval
• In practice: mixture between document language model Md and collection language model Mc
P(t|d) = λP(t|Md) + (1 – λ) P(t|Mc)
Probabilistic Text Retrieval
• In summary:
P(d|q) = P(d) Πtεq (λP(t|Md) + (1 – λ) P(t|Mc))
• Rank the documents by P(d|q) • Return few top results
Extensions
• Latent Dirichlet Allocation++
ww
T1T1
BB
T2T2
T3T3T4T4 T5T5Topics
GeneralEnglish
MdMdDocument-specific topic
Modeling Text (an aside)
• Generate your own Computer Science paper:
http://pdos.csail.mit.edu/scigen
Text Retrieval
Information Need
Query
Text Collection
Search Results
Start Here
Query Expansion
• Fixing spelling errors• Stemming• Alternative query suggestion
– Query log mining
• Synonyms from a thesaurus– Medical terms: MESH (Medical Subject Headings)– Manually or automatically created thesauruses
Pseudo-relevance Feedback
1. Assume top retrieved documents are relevant OR ask user to rate returned documents
2. Extract important words from these documents
3. Append to the query4. Try again
Pseudo-relevance Feedback
• Rocchio algorithm for relevance feedback
qopt = argmaxq [sim(q,Cr) – sim(q,Cn)]
Retrieval Evaluation
• Want– Results that address my information need best– These results should be on top of the returned list– Diverse set of results to choose from– Timely?
• Relevance– What user says it is
Retrieval Evaluation
Retrieval Evaluation
Retrieval Evaluation
• Mean Average Precision
[area under the Precision-Recall curve]
Retrieval Evaluation
• In web search results users usually don’t look past the top 5 results– Use cutoff: Metric @ 5 or Metric @ 10
• Comparison between systems:– Control dataset, queries, relevance judgments– Text Retrieval Conference (TREC)
Beyond Retrieval
• Named entity recognition• Summarization• Template filling• Text categorization• Sentiment analysis• Taxonomy extraction• Hypothesis formation• Social network extraction/analysis
Text Categorization
• Spam detection• News monitoring• Faceted search• Automated labeling• Authorship attribution
Text Categorization
• Classes already known:– Naïve Bayes– SVM– kNearestNeighbor– Neural Nets
• Discovering classes:– kNN Clustering– LSI
featureextraction
Text Categorization
positivereviews
negativereviews
classifiertraining
MpMp
MnMn
unlabeledreviews
classifyinginstances
Who likes my product?What features do they like?Do people like my competitor’s product?What experiences do people have with my product?
1 2 3
4
Text Categorization
Text Summarization
• Information overload
• Article summaries• Cliff Notes• TV Guide• Medical summary• Document “preview”
Text Summarization
• Extraction– copying most important parts of the document
• Abstraction– paraphrasing sections of the document
• Single document vs multiple documents• Generic vs query-focused
Text Summarization
• Finding important text:
– position– cue phrase indicators– word/phrase frequency – query and title overlap– discourse structure criteria– formatting
Named Entity Extraction
People, locations, companies, events…
Alberto Maria SegreDr Segre
Professor Segrealberto
AMSA. M. Segre
Named Entity Extraction
• Vocabulary matching– Problem: vocabulary transfer
• Rule-based– Regular expressions, rules of thumb
• Bootstrapping– Using “seed” Nes to find rules
• Machine learning– SVM, HMM, Decision Trees, Maximum Entropy…
[Nadeau & Sekine: Survey (2006)]
Named Entity Extraction• Features:
– Case– Digit– Character– Punctuation– Morphology– Part-of-speech– Dictionary entry– Meta information– Corpus frequency
As Gulf spill spreads, blame game begins
When BP looks at the spreading oil slick in the Gulf of Mexico that now threatens flora, fauna and livelihoods along the coasts of Louisiana, Mississippi, Alabama and Florida, it's really seeing money floating away on the tide.That's why it may be trying to shift some of the blame for the massive undersea leak to Transocean, which was running the rig that exploded on April 20 and eventually sank, leaving one of the worst oil spills in history in its wake."It wasn't our accident, but we are absolutely responsible for the oil, for cleaning it up, and that's what we intend to do," BP Group CEO Tony Hayward told NBC's "TODAY" show.
http://www.msnbc.msn.com/id/36917929/ns/business-us_business/
Entity Disambiguation
Entities can be referred to differently
Alberto Maria SegreDr Segre
Professor Segrealberto
AMSA. M. SegreAI Professor
Masters students adviser at UI
Entity Disambiguation
• Rules– Name use, emails, greetings, templates
• Outside sources– Wikipedia, ontologies, dictionaries…
• Entity profiles– Context
Web Mining
• Peculiarities:– Linked structure– Multimedia– Spam– Huge dataset– Much used– Variety of topics– Variety of authors
Web Crawling
Web Crawling
• Selection policy: which web pages to crawl?
• Focused crawlers– Relevance to the query
• Exploratory crawlers– Depth-first, breadth-first, URL, anchor text, quality
of in-link, number of in-links, PageRank
Web Crawling
PageRank measures the importance of a pageimportant pages point to other important pages
number of times you visit a page on a random walk
Web Crawling
• Other policy considerations– re-visit policy– politeness policy (robots.txt)
• Robot Exclusion Protocol
– parallelization policy
• Identify yourself as a bot
Web Graph Mining
• Authority (search results)
• Overview sites• Social analysis• Relationships
between topics (site maps)
Web Content Mining
• Sociology• Epidemiology• Marketing• Disaster detection• Finding people• Finding information• …