1 hicss keynote talk, jan 2008 © padhraic smyth, uc irvine: from gauss to google: data analysis in...

76
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: 1 From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer Science University of California, Irvine

Post on 15-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

1

From Gauss to Google: Data Analysis in the Digital Age

Padhraic SmythDepartment of Computer Science

University of California, Irvine

Page 2: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

2

Opening Comments

• From Gauss to Google– Not just Gauss…– Not just Google….

• Broad interpretation of “Web data”, e.g., will include email, etc

• Many topics in Web data analysis will not be discussed

• Data mining, machine learning, and statistics?– All pursuing the same goals, but with different agendas/biases

Page 3: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

3

The Internet Archive

Non-profit organization with a broad goal to crawl and archive the Web

As of June 2007:- 96 billion Web pages archived since Oct

1996 - 49 billion unique documents- 500 terabytes of data

Source: Internet Archive ACM/IEEE JCDL Conference Tutorial, June 2007

Page 4: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

4

Computer Architecture 101

CPU RAMDisk

Page 5: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

5

How Far Away are the Data?

CPU RAMDisk

10-8 seconds 10-3 seconds

Random Access Times

Page 6: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

6

How Far Away are the Data?

CPU RAMDisk

1 meter 100 km

Effective Distances

Page 7: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

7

What do we mean by “Web data”?

• Data– Objects of interest– Measurements we can make on objects

• Examples– Object = Web document

• Measurements = text content, traffic, edit history,..

– Object = Network • Measurements = nodes, links, time-stamps, content, …

– Object = Human • Measurements = browsing behavior, queries, demographics…

Page 8: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

8

Data as a matrix…..Rows -> objectsColumns -> measurements

ID Income Age …. Monthly Debt Good Risk?

18276 65,000 55 …. 2200 Yes72514 28,000 19 …. 1500 No28163 120,000 62 …. 1800 Yes17265 90,000 35 …. 4500 No… … … …. … …… … … …. … …61524 35,000 22 …. 900 Yes

In fact Web data is very different:- sequential record of events per user - vastly different amounts of data per user

- many categorical variables (e.g., query terms) - and so on….

Page 9: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

9

Email Network over 3 monthsfrom Hewlett Packard Research Labs

Example Research Question:What is the “best” way to detect

significant changes in such a network over time?

Page 10: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

10

Discovering Organizational Structure from Email Network O’ Madadhain and Smyth,

2005

Page 11: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

11

Networks of Instant MessagersLeskovec and Horvitz, 2007

• Network Data– 240 IM users over 1 month– 1 billion conversations per day– 1.3 billion edges in the graph

Example Research Question:How do these spatial patterns

depend on social and economic factors?

Page 12: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

12

Linking Demograpics with IM UsageLeskovec and Horvitz, 2007

Page 13: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

13

Query Data(source: Dan Russell, Google)

Research Question:Predict the age and

gender of an individual given their query history

More difficult:Predict how many

people are using one account, and their ages and genders

Page 14: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

14

Eye-Tracking: The Golden Triangle for Search from Hotchkiss, Alston, Edwards, 2005;

EnquiroResearch

Research Question:Build a probabilistic model that characterizes these patterns at individual and

population levels

Page 15: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

15

The State of Web Data

• Text Content, Networks, and Human Behavior

• Complex

• Non-stationary

• Observational versus experimental

• Measurement is non-trivial

• Vast Scale

So should we just forget about statistics? Do we need fundamentally new ways to analyze this type of

data?

Page 16: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

16

Key Ideas from Statistics

• Regularity in the aggregate– The Normal curve and central limit theorem– Ubiquity of power-laws

• …but diversity in individual behavior– extremes are prevalent in very large data sets

• Observed versus unobserved variables– Using unobserved variables to explain observed data

Page 17: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

17

1800: The Birth of Modern Statistical Thinking

• Before 1800– Long history of inferring patterns from data – but largely ad

hoc– More recent history of probability – limited to games of chance

• Around 1800– New data analysis problems in science (astronomy), commerce

(navigation), and social sciences– Realization of the importance of deriving a systematic

approach to data analysis– Work of Laplace, Legendre, Gauss, etc, was fundamental

Source: Stephen Stigler, The History of Statistics: The Measurement of Uncertainty before 1900, Harvard University Press, 1986.

Page 18: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

18

A Simple Example from 1800

y1 = x + e1

y2 = x + e2

y3 = x + e3

…….

• e.g, astronomy: taking measurements from a telescope yi = observed position of a planet in the sky for measurement i

x = the true position ei = random measurement error

• Combining multiple measurements: major open problem in 1800

Page 19: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

19

A Simple Example from 1800

y1 = x + e1

y2 = x + e2

y3 = x + e3

…….

• Key insights from Laplace, Legendre, Gauss:

– If e’s are normal/Gaussian, we can estimate x by least-squares

– We can also make statements about P(x | {y}, e)

– We can generalize to multiple variables• y = x +v + z + + e

Page 20: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

20

ProbabilisticModel

ObservedData

y = x + e

Page 21: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

21

ProbabilisticModel

ObservedData

y = x + e

Least squares

Page 22: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

22

The Average Man

• Early applications of statistical thinking were restricted to scientific problems where x’s were physical quantities

• 1835: enter Adolphe Quetelet– L’homme moyen – the average man– We can apply ideas like Normal curves to human

characteristics and behavior• Heights, birth rates, growth curves

– Introduced statistical thinking to social sciences

Page 23: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

23

Key Concepts from Quetelet

• Conditional dependence– E.g., P( height | male) versus P( height | female)

• Latent hidden variables– E.g., tendency to commit a crime

• The regularity of human behavior: “The constancy with which the same crimes repeat

themselves every year with the same frequency … is one of the most curious facts we learn from the statistics of the courts;”

Do we see such regularities in Web data?

Page 24: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

24

100

101

102

10-6

10-5

10-4

10-3

10-2

10-1

100

Session Length L

Em

piri

cal F

req

ue

ncy

of L

Histogram of session lengthfor visitors to department Web siteover 1 week (robots removed)[on a log-log scale]

Page 25: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

25

Query Distribution of MSN and AOL search logs

from Adar, Weld, Bershad, Gribble, 2007

Page 26: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

26

Conversation Duration for Instant Messenger Sessions

from Leskovec and Horvitz, 2007

Page 27: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

27

Login and Logout Durations for Instant Messenger Sessions

from Leskovec and Horvitz, 2007

Regularities such as power-laws are abundant in Web data

Highly non-Gaussian

Aggregate behavior – very predictable

Individual behavior – much less so

Page 28: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

28

ProbabilisticModel

ObservedData

y = x + e

Least squares

Contribution of Gauss et al

Page 29: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

29

ProbabilisticModel

ObservedData

P(data | model)

P(model | data)

Inverse Probability

Page 30: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

30

UCLA, 1988: Judea Pearl and Graphical Models

• A “language” for modeling dependencies among sets of random variables

• Graphical model– Nodes = variables– Edges = direct dependencies

Page 31: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

31

UCLA, 1988: Judea Pearl and Graphical Models

• A “language” for modeling dependencies among sets of random variables

• Graphical model– Nodes = variables– Edges = direct dependencies

• Leverages the idea of conditional independence

Age

ReadingAbility

Height

Reading and heightare modeled as conditionally independent given age

But if age is unknown, they aredependent!

Page 32: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

32

Classifying Documents, e.g., Spam Email Filtering

wwWord 1

Class

wwwwWord i Word n

Class = {spam, non-spam}

“All models are wrong, but some are useful” from G. E. P. Box

Page 33: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

33

Specifying the Forward Model

wwWord 1

Class

wwwwWord i Word n

are the parameters

of the model, e.g.,P(w = free| class = spam)

Page 34: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

34

Specifying the Forward Model

wwWord 1

Class

wwwwWord i Word n

P( w | class, )

Page 35: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

35

A Compact Notation: Plates

wwii

Class

i = 1:n

Plate = replicates of a node

Nodes within plates areconditionally independentgiven parent nodes

Page 36: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

36

Plates with Multiple Documents

wwii

Classj

i = 1:n

Assumes documents areconditionally independentgiven model parameters

j = 1:D

Page 37: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

37

Learning Parameters

wwii

Classj

i = 1:n

Use “inverse probability”(Bayes rule) to learn

the ’s

In essence, informationflows from the observednodes to the unobserved

j = 1:D

are in fact unknown

Page 38: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

38

Being Bayesian

wwii

Classj

i = 1:n

Priors can help smoothout data-driven estimates,

e.g., dictionary-derived

j = 1:D

Prior

Page 39: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

39

Making Predictions

wwii

Classj

i = 1:n

Again use inverseprobability (Bayes rule),

j = 1:D

Now we have newdocuments where“class” is unknown

Page 40: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

40

Why is Graphical Modeling Important?

• A systematic stochastic modeling framework– handles parameters, variables, and data

• Links modeling with computation– In other words, it links statistics and computer science

• Allows us to use computers to help build complex models

Page 41: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

41

Page 42: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

42

Graphical Model for Markov Chains

ccii

Pages

Page 43: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

43

Multiple Users…One Common Markov Chain

ccii

Pages

Users

Page 44: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

44

Multiple Users…One Chain per User

ccii

Pages

Users

Page 45: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

45

One Chain per Cluster of Users

ccii

Clusterj

Pages

Users

Cadez, Meek, Heckerman, Smyth, 2003

Page 46: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

46

Clusters of Probabilistic State Machines

B

E

C

A

B

E

C

ACluster 1

Motivation:approximate the heterogeneity of Web surfing behavior

Cluster 2

Page 47: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

47

This is the sequence-mining algorithm in SQL-

server

Page 48: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

48

NYT

330,000 articles

Enron

250,000 emails

16 million Medline articles

NSF/ NIH

100,000 grants

Statistical Text Mining

CiteSeer

600,000 abstracts

Pennsylvania Gazette

80,000 articles

1728-1800

Page 49: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

49

Problems of Interest

– What topics do these documents “span”?

– Which documents are about a particular topic?

– How have topics changed over time?

– What does author X write about?

– Who is likely to write about topic Y?

– and so on…..

Page 50: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

50

Collaborators in Text Research

Mark Steyvers, UCIMark Steyvers, UCIChaitanyaChaitanya

Chemudugunta, UCIChemudugunta, UCI

Michal Rosen-Zvi, IBMMichal Rosen-Zvi, IBM

Dave Newman, UCIDave Newman, UCI

Tom Griffiths, UC BerkeleyTom Griffiths, UC Berkeley

Page 51: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

51

ProbabilisticModel

Words in Documents

P(Data | Parameters)

Page 52: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

52

ProbabilisticModel

Words in Documents

P(Data | Parameters)

P(Parameters | Data)

Page 53: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

53

The Multinomial Model for Words

wwii

Words

Page 54: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

54

Multiple Documents: One Multinomial

wwii

Words

Documents

Page 55: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

55

One Multinomial per Document

wwii

Words

Documents

Page 56: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

56

Clusters of Documents

wwii

Words

Documents

z

Page 57: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

57

The Statistical Topic Model

wwii

Words

Documents

z

Page 58: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

58

The Statistical Topic Model

wwii

Words

Documents

z

P(word|topic)

P(topic|doc)

Page 59: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

59

Topic Models

Documents = mixtures of topicsTopics = probability distributions over words

• Model = joint distribution over words, topics, docs

• Answering queries = computing conditional probabilities

• Topics are learned completely automatically from data (no human intervention)

Page 60: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

60

Enron email data

250,000 emails250,000 emails

28,000 authors28,000 authors

1999-20021999-2002

Page 61: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

61

Enron email: business topics

WORD PROB. WORD PROB. WORD PROB. WORD PROB.

FEEDBACK 0.0781 PROJECT 0.0514 FERC 0.0554 ENVIRONMENTAL 0.0291

PERFORMANCE 0.0462 PLANT 0.028 MARKET 0.0328 AIR 0.0232

PROCESS 0.0455 COST 0.0182 ISO 0.0226 MTBE 0.019

PEP 0.0446 CONSTRUCTION 0.0169 COMMISSION 0.0215 EMISSIONS 0.017

MANAGEMENT 0.03 UNIT 0.0166 ORDER 0.0212 CLEAN 0.0143

COMPLETE 0.0205 FACILITY 0.0165 FILING 0.0149 EPA 0.0133

QUESTIONS 0.0203 SITE 0.0136 COMMENTS 0.0116 PENDING 0.0129

SELECTED 0.0187 PROJECTS 0.0117 PRICE 0.0116 SAFETY 0.0104

COMPLETED 0.0146 CONTRACT 0.011 CALIFORNIA 0.0110 WATER 0.0092

SYSTEM 0.0146 UNITS 0.0106 FILED 0.0110 GASOLINE 0.0086

SENDER PROB. SENDER PROB. SENDER PROB. SENDER PROB.

perfmgmt 0.2195 *** 0.0288 *** 0.0532 *** 0.1339

perf eval process 0.0784 *** 0.022 *** 0.0454 *** 0.0275

enron announcements 0.0489 *** 0.0123 *** 0.0384 *** 0.0205

*** 0.0089 *** 0.0111 *** 0.0334 *** 0.0166

*** 0.0048 *** 0.0108 *** 0.0317 *** 0.0129

TOPIC 23TOPIC 36 TOPIC 72 TOPIC 54

Page 62: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

62

Enron: non-work topics…

WORD PROB. WORD PROB. WORD PROB. WORD PROB.

HOLIDAY 0.0857 TEXANS 0.0145 GOD 0.0357 AMAZON 0.0312

PARTY 0.0368 WIN 0.0143 LIFE 0.0272 GIFT 0.0226

YEAR 0.0316 FOOTBALL 0.0137 MAN 0.0116 CLICK 0.0193

SEASON 0.0305 FANTASY 0.0129 PEOPLE 0.0103 SAVE 0.0147

COMPANY 0.0255 SPORTSLINE 0.0129 CHRIST 0.0092 SHOPPING 0.0140

CELEBRATION 0.0199 PLAY 0.0123 FAITH 0.0083 OFFER 0.0124

ENRON 0.0198 TEAM 0.0114 LORD 0.0079 HOLIDAY 0.0122

TIME 0.0194 GAME 0.0112 JESUS 0.0075 RECEIVE 0.0102

RECOGNIZE 0.019 SPORTS 0.011 SPIRITUAL 0.0066 SHIPPING 0.0100

MONTH 0.018 GAMES 0.0109 VISIT 0.0065 FLOWERS 0.0099

SENDER PROB. SENDER PROB. SENDER PROB. SENDER PROB.

chairman & ceo 0.131 cbs sportsline com 0.0866 crosswalk com 0.2358 amazon com 0.1344

*** 0.0102 houston texans 0.0267 wordsmith 0.0208 jos a bank 0.0266

*** 0.0046 houstontexans 0.0203 *** 0.0107 sharperimageoffers 0.0136

*** 0.0022 sportsline rewards 0.0175 doctor dictionary 0.0101 travelocity com 0.0094

general announcement 0.0017 pro football 0.0136 *** 0.0061 barnes & noble com 0.0089

TOPIC 109TOPIC 66 TOPIC 182 TOPIC 113

Page 63: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

63

Enron: public-interest topics...

WORD PROB. WORD PROB. WORD PROB. WORD PROB.

POWER 0.0915 STATE 0.0253 COMMITTEE 0.0197 LAW 0.0380

CALIFORNIA 0.0756 PLAN 0.0245 BILL 0.0189 TESTIMONY 0.0201

ELECTRICITY 0.0331 CALIFORNIA 0.0137 HOUSE 0.0169 ATTORNEY 0.0164

UTILITIES 0.0253 POLITICIAN Y 0.0137 WASHINGTON 0.0140 SETTLEMENT 0.0131

PRICES 0.0249 RATE 0.0131 SENATE 0.0135 LEGAL 0.0100

MARKET 0.0244 BANKRUPTCY 0.0126 POLITICIAN X 0.0114 EXHIBIT 0.0098

PRICE 0.0207 SOCAL 0.0119 CONGRESS 0.0112 CLE 0.0093

UTILITY 0.0140 POWER 0.0114 PRESIDENT 0.0105 SOCALGAS 0.0093

CUSTOMERS 0.0134 BONDS 0.0109 LEGISLATION 0.0099 METALS 0.0091

ELECTRIC 0.0120 MOU 0.0107 DC 0.0093 PERSON Z 0.0083

SENDER PROB. SENDER PROB. SENDER PROB. SENDER PROB.

*** 0.1160 *** 0.0395 *** 0.0696 *** 0.0696

*** 0.0518 *** 0.0337 *** 0.0453 *** 0.0453

*** 0.0284 *** 0.0295 *** 0.0255 *** 0.0255

*** 0.0272 *** 0.0251 *** 0.0173 *** 0.0173

*** 0.0266 *** 0.0202 *** 0.0317 *** 0.0317

TOPIC 194TOPIC 18 TOPIC 22 TOPIC 114

Page 64: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

64

Examples of CiteSeer Topics

WORD PROB. WORD PROB. WORD PROB. WORD PROB.

SPEECH 0.1134 PROBABILISTIC 0.0778 USER 0.2541 STARS 0.0164

RECOGNITION 0.0349 BAYESIAN 0.0671 INTERFACE 0.1080 OBSERVATIONS 0.0150

WORD 0.0295 PROBABILITY 0.0532 USERS 0.0788 SOLAR 0.0150

SPEAKER 0.0227 CARLO 0.0309 INTERFACES 0.0433 MAGNETIC 0.0145

ACOUSTIC 0.0205 MONTE 0.0308 GRAPHICAL 0.0392 RAY 0.0144

RATE 0.0134 DISTRIBUTION 0.0257 INTERACTIVE 0.0354 EMISSION 0.0134

SPOKEN 0.0132 INFERENCE 0.0253 INTERACTION 0.0261 GALAXIES 0.0124

SOUND 0.0127 PROBABILITIES 0.0253 VISUAL 0.0203 OBSERVED 0.0108

TRAINING 0.0104 CONDITIONAL 0.0229 DISPLAY 0.0128 SUBJECT 0.0101

MUSIC 0.0102 PRIOR 0.0219 MANIPULATION 0.0099 STAR 0.0087

AUTHOR PROB. AUTHOR PROB. AUTHOR PROB. AUTHOR PROB.

Waibel_A 0.0156 Friedman_N 0.0094 Shneiderman_B 0.0060 Linsky_J 0.0143

Gauvain_J 0.0133 Heckerman_D 0.0067 Rauterberg_M 0.0031 Falcke_H 0.0131

Lamel_L 0.0128 Ghahramani_Z 0.0062 Lavana_H 0.0024 Mursula_K 0.0089

Woodland_P 0.0124 Koller_D 0.0062 Pentland_A 0.0021 Butler_R 0.0083

Ney_H 0.0080 Jordan_M 0.0059 Myers_B 0.0021 Bjorkman_K 0.0078

Hansen_J 0.0078 Neal_R 0.0055 Minas_M 0.0021 Knapp_G 0.0067

Renals_S 0.0072 Raftery_A 0.0054 Burnett_M 0.0021 Kundu_M 0.0063

Noth_E 0.0071 Lukasiewicz_T 0.0053 Winiwarter_W 0.0020 Christensen-J 0.0059

Boves_L 0.0070 Halpern_J 0.0052 Chang_S 0.0019 Cranmer_S 0.0055

Young_S 0.0069 Muller_P 0.0048 Korvemaker_B 0.0019 Nagar_N 0.0050

TOPIC 10 TOPIC 209 TOPIC 87 TOPIC 20

Page 65: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

65

1990 1992 1994 1996 1998 2000 20020

0.002

0.004

0.006

0.008

0.01

0.012

Year

To

pic

Pro

ba

bili

tyCHANGING TRENDS IN COMPUTER SCIENCE

INFORMATIONRETRIEVAL

WWW

Page 66: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

66

1990 1992 1994 1996 1998 2000 20020

0.002

0.004

0.006

0.008

0.01

0.012

Year

To

pic

Pro

ba

bili

tyCHANGING TRENDS IN COMPUTER SCIENCE

OPERATINGSYSTEMS

INFORMATIONRETRIEVAL

WWW

PROGRAMMINGLANGUAGES

Page 67: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

67

1990 1992 1994 1996 1998 2000 20021

2

3

4

5

6

7

8x 10

-3

HOT TOPICS: MACHINE LEARNING/DATA MINING

Year

Top

ic P

roba

bilit

y

REGRESSION

DATA MINING

CLASSIFICATION

Page 68: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

68

1990 1992 1994 1996 1998 2000 20021.5

2

2.5

3

3.5

4

4.5

5

5.5x 10

-3

BAYES MARCHES ON

Year

Top

ic P

roba

bilit

y

BAYESIAN

PROBABILITY

STATISTICALPREDICTION

Page 69: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

69

1990 1992 1994 1996 1998 2000 20020

0.002

0.004

0.006

0.008

0.01

0.012

INTERESTING "TOPICS"

Year

Top

ic P

roba

bilit

y

FRENCH WORDS:LA, LES, UNE, NOUS, EST

MATH SYMBOLS:GAMMA, DELTA, OMEGA

DARPA

Page 70: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

70

Topic trends from New York Times

TOURRIDER

LANCE_ARMSTRONGTEAMBIKERACE

FRANCE

Jan00 Jul00 Jan01 Jul01 Jan02 Jul02 Jan030

5

10

15Tour-de-France

Jan00 Jul00 Jan01 Jul01 Jan02 Jul02 Jan030

10

20

30COMPANYQUARTERPERCENTANALYSTSHARESALES

EARNING

Quarterly Earnings

ANTHRAXLETTER

MAILWORKEROFFICESPORESPOSTAL

BUILDING Jan00 Jul00 Jan01 Jul01 Jan02 Jul02 Jan03

0

50

100 Anthrax

330,000 330,000 articlesarticles

2000-20022000-2002

Page 71: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

71

Pennsylvania Gazette

Size Most likely words in topic Labels Added

6.3% country public great people men liberty many let life friend spirit government

Republicanism

5.7% say might thing think without against own did know make well reason good

Rhetoric

4.9% away servant reward old jacket whoever pair named paid run hat coat master

Runaways

4.1% silk cotton ditto white black linen women cloth blue worsted men thread fine

Cloth for Sale

3.8% acres good land meadow plantation containing sold tract miles well premise

Real Estate

Joint work with Sharon Block, UC Irvine History Dept

Page 72: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

72

Page 73: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

73

[SENTIMENT] (3.4%) felt comfort feeling feel spirit mind heart point moment ill letter beyond mother state never event evil fear

impossible hope time idea left situation poor distress possible hour end loss relief dearest suffering

Analyzing Austen novels

00

10

20

30

40Emma

time -->

wor

ds

00

20

40

60Mansfield Park

time -->

wor

ds

00

10

20

30

40Northanger Abbey

time -->

wor

ds

00

10

20

30Persuasion

time -->

wor

ds

00

10

20

30

40Pride and Prejudice

time -->

wor

ds

00

10

20

30

40Sense and Sensibility

time -->

wor

ds

Page 74: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

74

Applications

• Automatically building domain-specific browsers “on the fly”– Burns et al (2007) constructed an interactive visual browser,

based on topics, for papers at the Annual Society for Neuroscience Conference

– Kumar (2006) developed a browser for 40,00 MEDLINE documents about schizophrenia

– Others in development

• Automated indexing in digital libraries– REXA system uses topics to automatically index 1 million

computer science papers (McCallum et al, U Mass) – California Digital Library (Newman et al, 2006)

• Exploratory analysis “beyond keywords”

Page 75: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

75

Concluding Comments

• Web data analysis– Tremendous opportunities and interesting problems– Rich measurement of human behavior on a large scale– In terms of Web data analysis, its about 1820-1850

• Probability and statistics remain highly relevant

• We need a new breed of “data scientist”– fluent in both computer science and statistics– not enough attention being paid to this in education

Page 76: 1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer

HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:

76

Further Reading

• Web Data AnalysisP. Baldi, P. Frasconi, and P. Smyth, Modeling the Internet and the Web: Probabilistic Methods and

AlgorithmsWiley, 2003

S. ChakrabartiMining the Web: Discovering Knowledge from Hypertext DataMorgan Kaufmann, 2002

• Topic Modeling M. Steyvers and T. Griffiths Probabilistic topic models, 2006 (Good introductory article, available from Mark Steyvers’ Web page)