text mining & retrieval. text information thoughts opinions feelings stories documentaries news...

57
TEXT Mining & Retrieval

Upload: rosanna-holt

Post on 27-Dec-2015

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS

TEXT

Mining & Retrieval

Page 2: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS
Page 3: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS

TEXTINFORMATION

THOUGHTS

OPINIONSFEELINGS

STORIES

DOCUMENTARIES

NEWS

LANGUAGES

EVERY DAY LIFE

SCIENCE

DISCUSSIONS

POLITICS PERSONALITIES

SOCIAL CONNECTIONS

DISASTERS

COMMERSE

Page 4: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS
Page 5: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS

Text Mining / Retrieval

• RETRIEVAL: discovery of text relevant to an information need

• MINING: discovery of new information in text (or reformulating information already there)

• natural language processing• computational linguistics• data mining, statistics

Page 6: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS

Text: Peculiarities

Page 7: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS

Text: Peculiarities

• Unstructured• Word dependencies (context, grammars)• Different languages, styles• Noisy (misspellings, typos, scanning errors…)• Burdensome formatting (HTML, XML…)• Humor, sarcasm, ambiguity, etc…

Page 8: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS

Representing Text

• “Bag of words”, i.e. Vector Space Model

break the document into its constituent words and put them in a table

Page 9: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS

Indexing for Retrieval

Doc Term

D1 Apple, Pear, Pear

D2 Cat, Dog

D3 Cat, Cat, Tiger

… …

Term Doc

Apple D1 1

Pear D1 2

Cat D2 1, D3 2

… …

Document Collection Forward Index Inverted Index

Conceptually, document is a vector of terms

Apple Pear Cat Tiger …

1 2 0 0 …

Page 10: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS

Representing Text

• Preprocessing– Clean-up

• remove formatting, tables, HTML…

– Remove stopwords• the, of, to, a, in, and, that, for, is

– Stem words• Porter Stemmer – heuristic• statistical, brute-force (lookup tables)

Page 11: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS

Representing Text

• Preserving some meaning of the words:– Part of Speech tagging– Word Sense Disambiguation– Semantic annotation

Page 12: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS

EYE DROPS OFF SHELFPROSTITUTES APPEAL TO POPEKIDS MAKE NUTRITIOUS SNACKS

STOLEN PAINTING FOUND BY TREELUNG CANCER IN WOMEN MUSHROOMS

QUEEN MARY HAVING BOTTOM SCRAPEDDEALERS WILL HEAR CAR TALK AT NOONMINERS REFUSE TO WORK AFTER DEATH

MILK DRINKERS ARE TURNING TO POWDERDRUNK GETS NINE MONTHS IN VIOLIN CASE

JUVENILE COURT TO TRY SHOOTING DEFENDANTCOMPLAINTS ABOUT NBA REFEREES GROWING UGLY

PANDA MATING FAILS; VETERINARIAN TAKES OVERMAN EATING PIRANHA MISTAKENLY SOLD AS PET FISHASTRONAUT TAKES BLAME FOR GAS IN SPACECRAFT

QUARTER OF A MILLION CHINESE LIVE ON WATERINCLUDE YOUR CHILDREN WHEN BAKING COOKIESOLD SCHOOL PILLARS ARE REPLACED BY ALUMNI

GRANDMOTHER OF EIGHT MAKES HOLE IN ONEHOSPITALS ARE SUED BY 7 FOOT DOCTORSLAWMEN FROM MEXICO BARBECUE GUESTS

TWO SOVIET SHIPS COLLIDE, ONE DIESENRAGED COW INJURES FARMER WITH AX

LACK OF BRAINS HINDERS RESEARCHRED TAPE HOLDS UP NEW BRIDGE

SQUAD HELPS DOG BITE VICTIMIRAQI HEAD SEEKS ARMSHERSHEY BARS PROTEST

Page 13: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS

Representing Text

• Vector Space Model:

D = (t1, wd1; t2, wd2; …, tv, wdv)

w: binary, count, TFIDF

Apple Pear Cat Tiger …

1 2 0 0 …

Page 14: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS

TFIDF

Page 15: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS
Page 16: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS

Problems

• Synonymy– multiple words that have similar meanings

• Polysemy– words that have more than one meaning

Page 17: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS

Latent Semantic Indexing

• Index by the hidden “meaning” of text

“words that are used in the same contexts tend to have similar meanings”

• using Singular Value Decomposition– a linear algebra technique for factorization of

matrixes

Page 18: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS

Latent Semantic Indexing

X = RSTT

distribution of terms for a concept(concept language model)

distribution of concepts in a document

importance ofeach concept

Page 19: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS

Latent Semantic Indexing

1. Index using concepts instead of terms

2. Query represented like another document

3. Retrieve documents “closest” to query

Page 20: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS

Latent Semantic Analysis

• Document categorization (plagiarism)• Comparing terms (synonymy)• Works with any language• Tolerant of noise (misspellings)

• Faults:– requires lots of memory– how many concepts should we use?

Page 21: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS

Probabilistic Text Retrieval

[http://nlp.stanford.edu]

Language ModelGenerative Model

Page 22: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS

Probabilistic Text Retrieval

using chain rule:P(t1t2t3t4) = P(t1)P(t2|t1)P(t3|t2t1)P(t4|t3t2t1)

unigram language model:P(t1t2t3t4) = P(t1)P(t2)P(t3)P(t4)

bigram language model:P(t1t2t3t4) = P(t1)P(t2|t1)P(t3|t2)P(t4|t3)

Page 23: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS

Probabilistic Text Retrieval

• Query likelihood model

Each document d has language model Md

P(d|q) = P(q|d)P(d)/P(q)

Naïve Bayes with each document as a class

P(q|Md) ≈ ΠtεV P(t|Md)tf(t,d)

Page 24: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS

Probabilistic Text Retrieval

• Estimating P(t|Md):

P(t|Md) = tf(t,d) / Lengthd

• Prior for terms not appearing in the document (smoothing):

P’(t|Md) = collectionFreq(t) / collectionSize

Page 25: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS

Probabilistic Text Retrieval

• In practice: mixture between document language model Md and collection language model Mc

P(t|d) = λP(t|Md) + (1 – λ) P(t|Mc)

Page 26: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS

Probabilistic Text Retrieval

• In summary:

P(d|q) = P(d) Πtεq (λP(t|Md) + (1 – λ) P(t|Mc))

• Rank the documents by P(d|q) • Return few top results

Page 27: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS

Extensions

• Latent Dirichlet Allocation++

ww

T1T1

BB

T2T2

T3T3T4T4 T5T5Topics

GeneralEnglish

MdMdDocument-specific topic

Page 28: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS

Modeling Text (an aside)

• Generate your own Computer Science paper:

http://pdos.csail.mit.edu/scigen

Page 29: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS

Text Retrieval

Information Need

Query

Text Collection

Search Results

Start Here

Page 30: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS

Query Expansion

• Fixing spelling errors• Stemming• Alternative query suggestion

– Query log mining

• Synonyms from a thesaurus– Medical terms: MESH (Medical Subject Headings)– Manually or automatically created thesauruses

Page 31: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS

Pseudo-relevance Feedback

1. Assume top retrieved documents are relevant OR ask user to rate returned documents

2. Extract important words from these documents

3. Append to the query4. Try again

Page 32: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS

Pseudo-relevance Feedback

• Rocchio algorithm for relevance feedback

qopt = argmaxq [sim(q,Cr) – sim(q,Cn)]

Page 33: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS

Retrieval Evaluation

• Want– Results that address my information need best– These results should be on top of the returned list– Diverse set of results to choose from– Timely?

• Relevance– What user says it is

Page 34: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS

Retrieval Evaluation

Page 35: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS

Retrieval Evaluation

Page 36: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS

Retrieval Evaluation

• Mean Average Precision

[area under the Precision-Recall curve]

Page 37: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS

Retrieval Evaluation

• In web search results users usually don’t look past the top 5 results– Use cutoff: Metric @ 5 or Metric @ 10

• Comparison between systems:– Control dataset, queries, relevance judgments– Text Retrieval Conference (TREC)

Page 38: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS

Beyond Retrieval

• Named entity recognition• Summarization• Template filling• Text categorization• Sentiment analysis• Taxonomy extraction• Hypothesis formation• Social network extraction/analysis

Page 39: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS

Text Categorization

• Spam detection• News monitoring• Faceted search• Automated labeling• Authorship attribution

Page 40: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS

Text Categorization

• Classes already known:– Naïve Bayes– SVM– kNearestNeighbor– Neural Nets

• Discovering classes:– kNN Clustering– LSI

Page 41: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS

featureextraction

Text Categorization

positivereviews

negativereviews

classifiertraining

MpMp

MnMn

unlabeledreviews

classifyinginstances

Who likes my product?What features do they like?Do people like my competitor’s product?What experiences do people have with my product?

1 2 3

4

Page 42: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS

Text Categorization

Page 43: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS

Text Summarization

• Information overload

• Article summaries• Cliff Notes• TV Guide• Medical summary• Document “preview”

Page 44: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS

Text Summarization

• Extraction– copying most important parts of the document

• Abstraction– paraphrasing sections of the document

• Single document vs multiple documents• Generic vs query-focused

Page 45: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS

Text Summarization

• Finding important text:

– position– cue phrase indicators– word/phrase frequency – query and title overlap– discourse structure criteria– formatting

Page 46: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS

Named Entity Extraction

People, locations, companies, events…

Alberto Maria SegreDr Segre

Professor Segrealberto

AMSA. M. Segre

Page 47: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS

Named Entity Extraction

• Vocabulary matching– Problem: vocabulary transfer

• Rule-based– Regular expressions, rules of thumb

• Bootstrapping– Using “seed” Nes to find rules

• Machine learning– SVM, HMM, Decision Trees, Maximum Entropy…

[Nadeau & Sekine: Survey (2006)]

Page 48: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS

Named Entity Extraction• Features:

– Case– Digit– Character– Punctuation– Morphology– Part-of-speech– Dictionary entry– Meta information– Corpus frequency

As Gulf spill spreads, blame game begins

When BP looks at the spreading oil slick in the Gulf of Mexico that now threatens flora, fauna and livelihoods along the coasts of Louisiana, Mississippi, Alabama and Florida, it's really seeing money floating away on the tide.That's why it may be trying to shift some of the blame for the massive undersea leak to Transocean, which was running the rig that exploded on April 20 and eventually sank, leaving one of the worst oil spills in history in its wake."It wasn't our accident, but we are absolutely responsible for the oil, for cleaning it up, and that's what we intend to do," BP Group CEO Tony Hayward told NBC's "TODAY" show.

http://www.msnbc.msn.com/id/36917929/ns/business-us_business/

Page 49: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS

Entity Disambiguation

Entities can be referred to differently

Alberto Maria SegreDr Segre

Professor Segrealberto

AMSA. M. SegreAI Professor

Masters students adviser at UI

Page 50: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS

Entity Disambiguation

• Rules– Name use, emails, greetings, templates

• Outside sources– Wikipedia, ontologies, dictionaries…

• Entity profiles– Context

Page 51: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS

Web Mining

• Peculiarities:– Linked structure– Multimedia– Spam– Huge dataset– Much used– Variety of topics– Variety of authors

Page 52: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS

Web Crawling

Page 53: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS

Web Crawling

• Selection policy: which web pages to crawl?

• Focused crawlers– Relevance to the query

• Exploratory crawlers– Depth-first, breadth-first, URL, anchor text, quality

of in-link, number of in-links, PageRank

Page 54: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS

Web Crawling

PageRank measures the importance of a pageimportant pages point to other important pages

number of times you visit a page on a random walk

Page 55: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS

Web Crawling

• Other policy considerations– re-visit policy– politeness policy (robots.txt)

• Robot Exclusion Protocol

– parallelization policy

• Identify yourself as a bot

Page 56: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS

Web Graph Mining

• Authority (search results)

• Overview sites• Social analysis• Relationships

between topics (site maps)

Page 57: TEXT Mining & Retrieval. TEXT INFORMATION THOUGHTS OPINIONS FEELINGS STORIES DOCUMENTARIES NEWS LANGUAGES EVERY DAY LIFE SCIENCE DISCUSSIONS POLITICS

Web Content Mining

• Sociology• Epidemiology• Marketing• Disaster detection• Finding people• Finding information• …