text mining & retrieval. text information thoughts opinions feelings stories documentaries news...

TEXT

Mining & Retrieval

TEXTINFORMATION

THOUGHTS

OPINIONSFEELINGS

STORIES

DOCUMENTARIES

NEWS

LANGUAGES

EVERY DAY LIFE

SCIENCE

DISCUSSIONS

POLITICS PERSONALITIES

SOCIAL CONNECTIONS

DISASTERS

COMMERSE

Text Mining / Retrieval

• RETRIEVAL: discovery of text relevant to an information need

• MINING: discovery of new information in text (or reformulating information already there)

• natural language processing• computational linguistics• data mining, statistics

Text: Peculiarities

Text: Peculiarities

• Unstructured• Word dependencies (context, grammars)• Different languages, styles• Noisy (misspellings, typos, scanning errors…)• Burdensome formatting (HTML, XML…)• Humor, sarcasm, ambiguity, etc…

Representing Text

• “Bag of words”, i.e. Vector Space Model

break the document into its constituent words and put them in a table

Indexing for Retrieval

Doc Term

D1 Apple, Pear, Pear

D2 Cat, Dog

D3 Cat, Cat, Tiger

… …

Term Doc

Apple D1 1

Pear D1 2

Cat D2 1, D3 2

… …

Document Collection Forward Index Inverted Index

Conceptually, document is a vector of terms

Apple Pear Cat Tiger …

1 2 0 0 …

Representing Text

• Preprocessing– Clean-up

• remove formatting, tables, HTML…

– Remove stopwords• the, of, to, a, in, and, that, for, is

– Stem words• Porter Stemmer – heuristic• statistical, brute-force (lookup tables)

Representing Text

• Preserving some meaning of the words:– Part of Speech tagging– Word Sense Disambiguation– Semantic annotation

EYE DROPS OFF SHELFPROSTITUTES APPEAL TO POPEKIDS MAKE NUTRITIOUS SNACKS

STOLEN PAINTING FOUND BY TREELUNG CANCER IN WOMEN MUSHROOMS

QUEEN MARY HAVING BOTTOM SCRAPEDDEALERS WILL HEAR CAR TALK AT NOONMINERS REFUSE TO WORK AFTER DEATH

MILK DRINKERS ARE TURNING TO POWDERDRUNK GETS NINE MONTHS IN VIOLIN CASE

JUVENILE COURT TO TRY SHOOTING DEFENDANTCOMPLAINTS ABOUT NBA REFEREES GROWING UGLY

PANDA MATING FAILS; VETERINARIAN TAKES OVERMAN EATING PIRANHA MISTAKENLY SOLD AS PET FISHASTRONAUT TAKES BLAME FOR GAS IN SPACECRAFT

QUARTER OF A MILLION CHINESE LIVE ON WATERINCLUDE YOUR CHILDREN WHEN BAKING COOKIESOLD SCHOOL PILLARS ARE REPLACED BY ALUMNI

GRANDMOTHER OF EIGHT MAKES HOLE IN ONEHOSPITALS ARE SUED BY 7 FOOT DOCTORSLAWMEN FROM MEXICO BARBECUE GUESTS

TWO SOVIET SHIPS COLLIDE, ONE DIESENRAGED COW INJURES FARMER WITH AX

LACK OF BRAINS HINDERS RESEARCHRED TAPE HOLDS UP NEW BRIDGE

SQUAD HELPS DOG BITE VICTIMIRAQI HEAD SEEKS ARMSHERSHEY BARS PROTEST

Representing Text

• Vector Space Model:

D = (t1, wd1; t2, wd2; …, tv, wdv)

w: binary, count, TFIDF

Apple Pear Cat Tiger …

1 2 0 0 …

Problems

• Synonymy– multiple words that have similar meanings

• Polysemy– words that have more than one meaning

Latent Semantic Indexing

• Index by the hidden “meaning” of text

“words that are used in the same contexts tend to have similar meanings”

• using Singular Value Decomposition– a linear algebra technique for factorization of

matrixes


X = RSTT

distribution of terms for a concept(concept language model)

distribution of concepts in a document

importance ofeach concept


1. Index using concepts instead of terms

2. Query represented like another document

3. Retrieve documents “closest” to query

Latent Semantic Analysis

• Document categorization (plagiarism)• Comparing terms (synonymy)• Works with any language• Tolerant of noise (misspellings)

• Faults:– requires lots of memory– how many concepts should we use?

Probabilistic Text Retrieval

[http://nlp.stanford.edu]

Language ModelGenerative Model


using chain rule:P(t1t2t3t4) = P(t1)P(t2|t1)P(t3|t2t1)P(t4|t3t2t1)

unigram language model:P(t1t2t3t4) = P(t1)P(t2)P(t3)P(t4)

bigram language model:P(t1t2t3t4) = P(t1)P(t2|t1)P(t3|t2)P(t4|t3)


• Query likelihood model

Each document d has language model Md

P(d|q) = P(q|d)P(d)/P(q)

Naïve Bayes with each document as a class

P(q|Md) ≈ ΠtεV P(t|Md)tf(t,d)


• Estimating P(t|Md):

P(t|Md) = tf(t,d) / Lengthd

• Prior for terms not appearing in the document (smoothing):

P’(t|Md) = collectionFreq(t) / collectionSize


• In practice: mixture between document language model Md and collection language model Mc

P(t|d) = λP(t|Md) + (1 – λ) P(t|Mc)


• In summary:

P(d|q) = P(d) Πtεq (λP(t|Md) + (1 – λ) P(t|Mc))

• Rank the documents by P(d|q) • Return few top results

Extensions

• Latent Dirichlet Allocation++

ww

T1T1

BB

T2T2

T3T3T4T4 T5T5Topics

GeneralEnglish

MdMdDocument-specific topic

Modeling Text (an aside)

• Generate your own Computer Science paper:

http://pdos.csail.mit.edu/scigen

http://pdos.csail.mit.edu/scigen

Text Retrieval

Information Need

Query

Text Collection

Search Results

Start Here

Query Expansion

• Fixing spelling errors• Stemming• Alternative query suggestion

– Query log mining

• Synonyms from a thesaurus– Medical terms: MESH (Medical Subject Headings)– Manually or automatically created thesauruses

Pseudo-relevance Feedback

1. Assume top retrieved documents are relevant OR ask user to rate returned documents

2. Extract important words from these documents

3. Append to the query4. Try again

Pseudo-relevance Feedback

• Rocchio algorithm for relevance feedback

qopt = argmaxq [sim(q,Cr) – sim(q,Cn)]

Retrieval Evaluation

• Want– Results that address my information need best– These results should be on top of the returned list– Diverse set of results to choose from– Timely?

• Relevance– What user says it is


• Mean Average Precision

[area under the Precision-Recall curve]


• In web search results users usually don’t look past the top 5 results– Use cutoff: Metric @ 5 or Metric @ 10

• Comparison between systems:– Control dataset, queries, relevance judgments– Text Retrieval Conference (TREC)

Beyond Retrieval

• Named entity recognition• Summarization• Template filling• Text categorization• Sentiment analysis• Taxonomy extraction• Hypothesis formation• Social network extraction/analysis

Text Categorization

• Spam detection• News monitoring• Faceted search• Automated labeling• Authorship attribution

Text Categorization

• Classes already known:– Naïve Bayes– SVM– kNearestNeighbor– Neural Nets

• Discovering classes:– kNN Clustering– LSI

featureextraction

Text Categorization

positivereviews

negativereviews

classifiertraining

MpMp

MnMn

unlabeledreviews

classifyinginstances

Who likes my product?What features do they like?Do people like my competitor’s product?What experiences do people have with my product?

1 2 3

4

Text Categorization

Text Summarization

• Information overload

• Article summaries• Cliff Notes• TV Guide• Medical summary• Document “preview”

Text Summarization

• Extraction– copying most important parts of the document

• Abstraction– paraphrasing sections of the document

• Single document vs multiple documents• Generic vs query-focused

Text Summarization

• Finding important text:

– position– cue phrase indicators– word/phrase frequency – query and title overlap– discourse structure criteria– formatting

Named Entity Extraction

People, locations, companies, events…

Alberto Maria SegreDr Segre

Professor Segrealberto

AMSA. M. Segre

Named Entity Extraction

• Vocabulary matching– Problem: vocabulary transfer

• Rule-based– Regular expressions, rules of thumb

• Bootstrapping– Using “seed” Nes to find rules

• Machine learning– SVM, HMM, Decision Trees, Maximum Entropy…

[Nadeau & Sekine: Survey (2006)]

Named Entity Extraction• Features:

– Case– Digit– Character– Punctuation– Morphology– Part-of-speech– Dictionary entry– Meta information– Corpus frequency

As Gulf spill spreads, blame game begins

When BP looks at the spreading oil slick in the Gulf of Mexico that now threatens flora, fauna and livelihoods along the coasts of Louisiana, Mississippi, Alabama and Florida, it's really seeing money floating away on the tide.That's why it may be trying to shift some of the blame for the massive undersea leak to Transocean, which was running the rig that exploded on April 20 and eventually sank, leaving one of the worst oil spills in history in its wake."It wasn't our accident, but we are absolutely responsible for the oil, for cleaning it up, and that's what we intend to do," BP Group CEO Tony Hayward told NBC's "TODAY" show.

http://www.msnbc.msn.com/id/36917929/ns/business-us_business/

Entity Disambiguation

Entities can be referred to differently

Alberto Maria SegreDr Segre

Professor Segrealberto

AMSA. M. SegreAI Professor

Masters students adviser at UI

Entity Disambiguation

• Rules– Name use, emails, greetings, templates

• Outside sources– Wikipedia, ontologies, dictionaries…

• Entity profiles– Context

Web Mining

• Peculiarities:– Linked structure– Multimedia– Spam– Huge dataset– Much used– Variety of topics– Variety of authors

Web Crawling

Web Crawling

• Selection policy: which web pages to crawl?

• Focused crawlers– Relevance to the query

• Exploratory crawlers– Depth-first, breadth-first, URL, anchor text, quality

of in-link, number of in-links, PageRank

Web Crawling

PageRank measures the importance of a pageimportant pages point to other important pages

number of times you visit a page on a random walk

Web Crawling

• Other policy considerations– re-visit policy– politeness policy (robots.txt)

• Robot Exclusion Protocol

– parallelization policy

• Identify yourself as a bot

Web Graph Mining

• Authority (search results)

• Overview sites• Social analysis• Relationships

between topics (site maps)

Web Content Mining

• Sociology• Epidemiology• Marketing• Disaster detection• Finding people• Finding information• …

text mining & retrieval. text information thoughts opinions feelings stories documentaries news...

Documents

hidden meaning of text

constituent words

discovery of new information

discovery of text relevant

reformulating information

information needmining

textvector space model

new bridgesquad