1 hicss keynote talk, jan 2008 © padhraic smyth, uc irvine: from gauss to google: data analysis in...
Post on 15-Jan-2016
218 views
TRANSCRIPT
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
1
From Gauss to Google: Data Analysis in the Digital Age
Padhraic SmythDepartment of Computer Science
University of California, Irvine
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
2
Opening Comments
• From Gauss to Google– Not just Gauss…– Not just Google….
• Broad interpretation of “Web data”, e.g., will include email, etc
• Many topics in Web data analysis will not be discussed
• Data mining, machine learning, and statistics?– All pursuing the same goals, but with different agendas/biases
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
3
The Internet Archive
Non-profit organization with a broad goal to crawl and archive the Web
As of June 2007:- 96 billion Web pages archived since Oct
1996 - 49 billion unique documents- 500 terabytes of data
Source: Internet Archive ACM/IEEE JCDL Conference Tutorial, June 2007
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
4
Computer Architecture 101
CPU RAMDisk
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
5
How Far Away are the Data?
CPU RAMDisk
10-8 seconds 10-3 seconds
Random Access Times
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
6
How Far Away are the Data?
CPU RAMDisk
1 meter 100 km
Effective Distances
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
7
What do we mean by “Web data”?
• Data– Objects of interest– Measurements we can make on objects
• Examples– Object = Web document
• Measurements = text content, traffic, edit history,..
– Object = Network • Measurements = nodes, links, time-stamps, content, …
– Object = Human • Measurements = browsing behavior, queries, demographics…
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
8
Data as a matrix…..Rows -> objectsColumns -> measurements
ID Income Age …. Monthly Debt Good Risk?
18276 65,000 55 …. 2200 Yes72514 28,000 19 …. 1500 No28163 120,000 62 …. 1800 Yes17265 90,000 35 …. 4500 No… … … …. … …… … … …. … …61524 35,000 22 …. 900 Yes
In fact Web data is very different:- sequential record of events per user - vastly different amounts of data per user
- many categorical variables (e.g., query terms) - and so on….
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
9
Email Network over 3 monthsfrom Hewlett Packard Research Labs
Example Research Question:What is the “best” way to detect
significant changes in such a network over time?
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
10
Discovering Organizational Structure from Email Network O’ Madadhain and Smyth,
2005
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
11
Networks of Instant MessagersLeskovec and Horvitz, 2007
• Network Data– 240 IM users over 1 month– 1 billion conversations per day– 1.3 billion edges in the graph
Example Research Question:How do these spatial patterns
depend on social and economic factors?
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
12
Linking Demograpics with IM UsageLeskovec and Horvitz, 2007
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
13
Query Data(source: Dan Russell, Google)
Research Question:Predict the age and
gender of an individual given their query history
More difficult:Predict how many
people are using one account, and their ages and genders
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
14
Eye-Tracking: The Golden Triangle for Search from Hotchkiss, Alston, Edwards, 2005;
EnquiroResearch
Research Question:Build a probabilistic model that characterizes these patterns at individual and
population levels
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
15
The State of Web Data
• Text Content, Networks, and Human Behavior
• Complex
• Non-stationary
• Observational versus experimental
• Measurement is non-trivial
• Vast Scale
So should we just forget about statistics? Do we need fundamentally new ways to analyze this type of
data?
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
16
Key Ideas from Statistics
• Regularity in the aggregate– The Normal curve and central limit theorem– Ubiquity of power-laws
• …but diversity in individual behavior– extremes are prevalent in very large data sets
• Observed versus unobserved variables– Using unobserved variables to explain observed data
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
17
1800: The Birth of Modern Statistical Thinking
• Before 1800– Long history of inferring patterns from data – but largely ad
hoc– More recent history of probability – limited to games of chance
• Around 1800– New data analysis problems in science (astronomy), commerce
(navigation), and social sciences– Realization of the importance of deriving a systematic
approach to data analysis– Work of Laplace, Legendre, Gauss, etc, was fundamental
Source: Stephen Stigler, The History of Statistics: The Measurement of Uncertainty before 1900, Harvard University Press, 1986.
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
18
A Simple Example from 1800
y1 = x + e1
y2 = x + e2
y3 = x + e3
…….
• e.g, astronomy: taking measurements from a telescope yi = observed position of a planet in the sky for measurement i
x = the true position ei = random measurement error
• Combining multiple measurements: major open problem in 1800
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
19
A Simple Example from 1800
y1 = x + e1
y2 = x + e2
y3 = x + e3
…….
• Key insights from Laplace, Legendre, Gauss:
– If e’s are normal/Gaussian, we can estimate x by least-squares
– We can also make statements about P(x | {y}, e)
– We can generalize to multiple variables• y = x +v + z + + e
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
20
ProbabilisticModel
ObservedData
y = x + e
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
21
ProbabilisticModel
ObservedData
y = x + e
Least squares
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
22
The Average Man
• Early applications of statistical thinking were restricted to scientific problems where x’s were physical quantities
• 1835: enter Adolphe Quetelet– L’homme moyen – the average man– We can apply ideas like Normal curves to human
characteristics and behavior• Heights, birth rates, growth curves
– Introduced statistical thinking to social sciences
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
23
Key Concepts from Quetelet
• Conditional dependence– E.g., P( height | male) versus P( height | female)
• Latent hidden variables– E.g., tendency to commit a crime
• The regularity of human behavior: “The constancy with which the same crimes repeat
themselves every year with the same frequency … is one of the most curious facts we learn from the statistics of the courts;”
Do we see such regularities in Web data?
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
24
100
101
102
10-6
10-5
10-4
10-3
10-2
10-1
100
Session Length L
Em
piri
cal F
req
ue
ncy
of L
Histogram of session lengthfor visitors to department Web siteover 1 week (robots removed)[on a log-log scale]
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
25
Query Distribution of MSN and AOL search logs
from Adar, Weld, Bershad, Gribble, 2007
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
26
Conversation Duration for Instant Messenger Sessions
from Leskovec and Horvitz, 2007
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
27
Login and Logout Durations for Instant Messenger Sessions
from Leskovec and Horvitz, 2007
Regularities such as power-laws are abundant in Web data
Highly non-Gaussian
Aggregate behavior – very predictable
Individual behavior – much less so
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
28
ProbabilisticModel
ObservedData
y = x + e
Least squares
Contribution of Gauss et al
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
29
ProbabilisticModel
ObservedData
P(data | model)
P(model | data)
Inverse Probability
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
30
UCLA, 1988: Judea Pearl and Graphical Models
• A “language” for modeling dependencies among sets of random variables
• Graphical model– Nodes = variables– Edges = direct dependencies
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
31
UCLA, 1988: Judea Pearl and Graphical Models
• A “language” for modeling dependencies among sets of random variables
• Graphical model– Nodes = variables– Edges = direct dependencies
• Leverages the idea of conditional independence
Age
ReadingAbility
Height
Reading and heightare modeled as conditionally independent given age
But if age is unknown, they aredependent!
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
32
Classifying Documents, e.g., Spam Email Filtering
wwWord 1
Class
wwwwWord i Word n
Class = {spam, non-spam}
“All models are wrong, but some are useful” from G. E. P. Box
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
33
Specifying the Forward Model
wwWord 1
Class
wwwwWord i Word n
are the parameters
of the model, e.g.,P(w = free| class = spam)
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
34
Specifying the Forward Model
wwWord 1
Class
wwwwWord i Word n
P( w | class, )
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
35
A Compact Notation: Plates
wwii
Class
i = 1:n
Plate = replicates of a node
Nodes within plates areconditionally independentgiven parent nodes
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
36
Plates with Multiple Documents
wwii
Classj
i = 1:n
Assumes documents areconditionally independentgiven model parameters
j = 1:D
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
37
Learning Parameters
wwii
Classj
i = 1:n
Use “inverse probability”(Bayes rule) to learn
the ’s
In essence, informationflows from the observednodes to the unobserved
j = 1:D
are in fact unknown
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
38
Being Bayesian
wwii
Classj
i = 1:n
Priors can help smoothout data-driven estimates,
e.g., dictionary-derived
j = 1:D
Prior
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
39
Making Predictions
wwii
Classj
i = 1:n
Again use inverseprobability (Bayes rule),
j = 1:D
Now we have newdocuments where“class” is unknown
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
40
Why is Graphical Modeling Important?
• A systematic stochastic modeling framework– handles parameters, variables, and data
• Links modeling with computation– In other words, it links statistics and computer science
• Allows us to use computers to help build complex models
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
41
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
42
Graphical Model for Markov Chains
ccii
Pages
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
43
Multiple Users…One Common Markov Chain
ccii
Pages
Users
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
44
Multiple Users…One Chain per User
ccii
Pages
Users
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
45
One Chain per Cluster of Users
ccii
Clusterj
Pages
Users
Cadez, Meek, Heckerman, Smyth, 2003
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
46
Clusters of Probabilistic State Machines
B
E
C
A
B
E
C
ACluster 1
Motivation:approximate the heterogeneity of Web surfing behavior
Cluster 2
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
47
This is the sequence-mining algorithm in SQL-
server
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
48
NYT
330,000 articles
Enron
250,000 emails
16 million Medline articles
NSF/ NIH
100,000 grants
Statistical Text Mining
CiteSeer
600,000 abstracts
Pennsylvania Gazette
80,000 articles
1728-1800
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
49
Problems of Interest
– What topics do these documents “span”?
– Which documents are about a particular topic?
– How have topics changed over time?
– What does author X write about?
– Who is likely to write about topic Y?
– and so on…..
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
50
Collaborators in Text Research
Mark Steyvers, UCIMark Steyvers, UCIChaitanyaChaitanya
Chemudugunta, UCIChemudugunta, UCI
Michal Rosen-Zvi, IBMMichal Rosen-Zvi, IBM
Dave Newman, UCIDave Newman, UCI
Tom Griffiths, UC BerkeleyTom Griffiths, UC Berkeley
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
51
ProbabilisticModel
Words in Documents
P(Data | Parameters)
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
52
ProbabilisticModel
Words in Documents
P(Data | Parameters)
P(Parameters | Data)
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
53
The Multinomial Model for Words
wwii
Words
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
54
Multiple Documents: One Multinomial
wwii
Words
Documents
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
55
One Multinomial per Document
wwii
Words
Documents
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
56
Clusters of Documents
wwii
Words
Documents
z
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
57
The Statistical Topic Model
wwii
Words
Documents
z
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
58
The Statistical Topic Model
wwii
Words
Documents
z
P(word|topic)
P(topic|doc)
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
59
Topic Models
Documents = mixtures of topicsTopics = probability distributions over words
• Model = joint distribution over words, topics, docs
• Answering queries = computing conditional probabilities
• Topics are learned completely automatically from data (no human intervention)
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
60
Enron email data
250,000 emails250,000 emails
28,000 authors28,000 authors
1999-20021999-2002
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
61
Enron email: business topics
WORD PROB. WORD PROB. WORD PROB. WORD PROB.
FEEDBACK 0.0781 PROJECT 0.0514 FERC 0.0554 ENVIRONMENTAL 0.0291
PERFORMANCE 0.0462 PLANT 0.028 MARKET 0.0328 AIR 0.0232
PROCESS 0.0455 COST 0.0182 ISO 0.0226 MTBE 0.019
PEP 0.0446 CONSTRUCTION 0.0169 COMMISSION 0.0215 EMISSIONS 0.017
MANAGEMENT 0.03 UNIT 0.0166 ORDER 0.0212 CLEAN 0.0143
COMPLETE 0.0205 FACILITY 0.0165 FILING 0.0149 EPA 0.0133
QUESTIONS 0.0203 SITE 0.0136 COMMENTS 0.0116 PENDING 0.0129
SELECTED 0.0187 PROJECTS 0.0117 PRICE 0.0116 SAFETY 0.0104
COMPLETED 0.0146 CONTRACT 0.011 CALIFORNIA 0.0110 WATER 0.0092
SYSTEM 0.0146 UNITS 0.0106 FILED 0.0110 GASOLINE 0.0086
SENDER PROB. SENDER PROB. SENDER PROB. SENDER PROB.
perfmgmt 0.2195 *** 0.0288 *** 0.0532 *** 0.1339
perf eval process 0.0784 *** 0.022 *** 0.0454 *** 0.0275
enron announcements 0.0489 *** 0.0123 *** 0.0384 *** 0.0205
*** 0.0089 *** 0.0111 *** 0.0334 *** 0.0166
*** 0.0048 *** 0.0108 *** 0.0317 *** 0.0129
TOPIC 23TOPIC 36 TOPIC 72 TOPIC 54
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
62
Enron: non-work topics…
WORD PROB. WORD PROB. WORD PROB. WORD PROB.
HOLIDAY 0.0857 TEXANS 0.0145 GOD 0.0357 AMAZON 0.0312
PARTY 0.0368 WIN 0.0143 LIFE 0.0272 GIFT 0.0226
YEAR 0.0316 FOOTBALL 0.0137 MAN 0.0116 CLICK 0.0193
SEASON 0.0305 FANTASY 0.0129 PEOPLE 0.0103 SAVE 0.0147
COMPANY 0.0255 SPORTSLINE 0.0129 CHRIST 0.0092 SHOPPING 0.0140
CELEBRATION 0.0199 PLAY 0.0123 FAITH 0.0083 OFFER 0.0124
ENRON 0.0198 TEAM 0.0114 LORD 0.0079 HOLIDAY 0.0122
TIME 0.0194 GAME 0.0112 JESUS 0.0075 RECEIVE 0.0102
RECOGNIZE 0.019 SPORTS 0.011 SPIRITUAL 0.0066 SHIPPING 0.0100
MONTH 0.018 GAMES 0.0109 VISIT 0.0065 FLOWERS 0.0099
SENDER PROB. SENDER PROB. SENDER PROB. SENDER PROB.
chairman & ceo 0.131 cbs sportsline com 0.0866 crosswalk com 0.2358 amazon com 0.1344
*** 0.0102 houston texans 0.0267 wordsmith 0.0208 jos a bank 0.0266
*** 0.0046 houstontexans 0.0203 *** 0.0107 sharperimageoffers 0.0136
*** 0.0022 sportsline rewards 0.0175 doctor dictionary 0.0101 travelocity com 0.0094
general announcement 0.0017 pro football 0.0136 *** 0.0061 barnes & noble com 0.0089
TOPIC 109TOPIC 66 TOPIC 182 TOPIC 113
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
63
Enron: public-interest topics...
WORD PROB. WORD PROB. WORD PROB. WORD PROB.
POWER 0.0915 STATE 0.0253 COMMITTEE 0.0197 LAW 0.0380
CALIFORNIA 0.0756 PLAN 0.0245 BILL 0.0189 TESTIMONY 0.0201
ELECTRICITY 0.0331 CALIFORNIA 0.0137 HOUSE 0.0169 ATTORNEY 0.0164
UTILITIES 0.0253 POLITICIAN Y 0.0137 WASHINGTON 0.0140 SETTLEMENT 0.0131
PRICES 0.0249 RATE 0.0131 SENATE 0.0135 LEGAL 0.0100
MARKET 0.0244 BANKRUPTCY 0.0126 POLITICIAN X 0.0114 EXHIBIT 0.0098
PRICE 0.0207 SOCAL 0.0119 CONGRESS 0.0112 CLE 0.0093
UTILITY 0.0140 POWER 0.0114 PRESIDENT 0.0105 SOCALGAS 0.0093
CUSTOMERS 0.0134 BONDS 0.0109 LEGISLATION 0.0099 METALS 0.0091
ELECTRIC 0.0120 MOU 0.0107 DC 0.0093 PERSON Z 0.0083
SENDER PROB. SENDER PROB. SENDER PROB. SENDER PROB.
*** 0.1160 *** 0.0395 *** 0.0696 *** 0.0696
*** 0.0518 *** 0.0337 *** 0.0453 *** 0.0453
*** 0.0284 *** 0.0295 *** 0.0255 *** 0.0255
*** 0.0272 *** 0.0251 *** 0.0173 *** 0.0173
*** 0.0266 *** 0.0202 *** 0.0317 *** 0.0317
TOPIC 194TOPIC 18 TOPIC 22 TOPIC 114
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
64
Examples of CiteSeer Topics
WORD PROB. WORD PROB. WORD PROB. WORD PROB.
SPEECH 0.1134 PROBABILISTIC 0.0778 USER 0.2541 STARS 0.0164
RECOGNITION 0.0349 BAYESIAN 0.0671 INTERFACE 0.1080 OBSERVATIONS 0.0150
WORD 0.0295 PROBABILITY 0.0532 USERS 0.0788 SOLAR 0.0150
SPEAKER 0.0227 CARLO 0.0309 INTERFACES 0.0433 MAGNETIC 0.0145
ACOUSTIC 0.0205 MONTE 0.0308 GRAPHICAL 0.0392 RAY 0.0144
RATE 0.0134 DISTRIBUTION 0.0257 INTERACTIVE 0.0354 EMISSION 0.0134
SPOKEN 0.0132 INFERENCE 0.0253 INTERACTION 0.0261 GALAXIES 0.0124
SOUND 0.0127 PROBABILITIES 0.0253 VISUAL 0.0203 OBSERVED 0.0108
TRAINING 0.0104 CONDITIONAL 0.0229 DISPLAY 0.0128 SUBJECT 0.0101
MUSIC 0.0102 PRIOR 0.0219 MANIPULATION 0.0099 STAR 0.0087
AUTHOR PROB. AUTHOR PROB. AUTHOR PROB. AUTHOR PROB.
Waibel_A 0.0156 Friedman_N 0.0094 Shneiderman_B 0.0060 Linsky_J 0.0143
Gauvain_J 0.0133 Heckerman_D 0.0067 Rauterberg_M 0.0031 Falcke_H 0.0131
Lamel_L 0.0128 Ghahramani_Z 0.0062 Lavana_H 0.0024 Mursula_K 0.0089
Woodland_P 0.0124 Koller_D 0.0062 Pentland_A 0.0021 Butler_R 0.0083
Ney_H 0.0080 Jordan_M 0.0059 Myers_B 0.0021 Bjorkman_K 0.0078
Hansen_J 0.0078 Neal_R 0.0055 Minas_M 0.0021 Knapp_G 0.0067
Renals_S 0.0072 Raftery_A 0.0054 Burnett_M 0.0021 Kundu_M 0.0063
Noth_E 0.0071 Lukasiewicz_T 0.0053 Winiwarter_W 0.0020 Christensen-J 0.0059
Boves_L 0.0070 Halpern_J 0.0052 Chang_S 0.0019 Cranmer_S 0.0055
Young_S 0.0069 Muller_P 0.0048 Korvemaker_B 0.0019 Nagar_N 0.0050
TOPIC 10 TOPIC 209 TOPIC 87 TOPIC 20
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
65
1990 1992 1994 1996 1998 2000 20020
0.002
0.004
0.006
0.008
0.01
0.012
Year
To
pic
Pro
ba
bili
tyCHANGING TRENDS IN COMPUTER SCIENCE
INFORMATIONRETRIEVAL
WWW
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
66
1990 1992 1994 1996 1998 2000 20020
0.002
0.004
0.006
0.008
0.01
0.012
Year
To
pic
Pro
ba
bili
tyCHANGING TRENDS IN COMPUTER SCIENCE
OPERATINGSYSTEMS
INFORMATIONRETRIEVAL
WWW
PROGRAMMINGLANGUAGES
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
67
1990 1992 1994 1996 1998 2000 20021
2
3
4
5
6
7
8x 10
-3
HOT TOPICS: MACHINE LEARNING/DATA MINING
Year
Top
ic P
roba
bilit
y
REGRESSION
DATA MINING
CLASSIFICATION
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
68
1990 1992 1994 1996 1998 2000 20021.5
2
2.5
3
3.5
4
4.5
5
5.5x 10
-3
BAYES MARCHES ON
Year
Top
ic P
roba
bilit
y
BAYESIAN
PROBABILITY
STATISTICALPREDICTION
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
69
1990 1992 1994 1996 1998 2000 20020
0.002
0.004
0.006
0.008
0.01
0.012
INTERESTING "TOPICS"
Year
Top
ic P
roba
bilit
y
FRENCH WORDS:LA, LES, UNE, NOUS, EST
MATH SYMBOLS:GAMMA, DELTA, OMEGA
DARPA
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
70
Topic trends from New York Times
TOURRIDER
LANCE_ARMSTRONGTEAMBIKERACE
FRANCE
Jan00 Jul00 Jan01 Jul01 Jan02 Jul02 Jan030
5
10
15Tour-de-France
Jan00 Jul00 Jan01 Jul01 Jan02 Jul02 Jan030
10
20
30COMPANYQUARTERPERCENTANALYSTSHARESALES
EARNING
Quarterly Earnings
ANTHRAXLETTER
MAILWORKEROFFICESPORESPOSTAL
BUILDING Jan00 Jul00 Jan01 Jul01 Jan02 Jul02 Jan03
0
50
100 Anthrax
330,000 330,000 articlesarticles
2000-20022000-2002
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
71
Pennsylvania Gazette
Size Most likely words in topic Labels Added
6.3% country public great people men liberty many let life friend spirit government
Republicanism
5.7% say might thing think without against own did know make well reason good
Rhetoric
4.9% away servant reward old jacket whoever pair named paid run hat coat master
Runaways
4.1% silk cotton ditto white black linen women cloth blue worsted men thread fine
Cloth for Sale
3.8% acres good land meadow plantation containing sold tract miles well premise
Real Estate
Joint work with Sharon Block, UC Irvine History Dept
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
72
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
73
[SENTIMENT] (3.4%) felt comfort feeling feel spirit mind heart point moment ill letter beyond mother state never event evil fear
impossible hope time idea left situation poor distress possible hour end loss relief dearest suffering
Analyzing Austen novels
00
10
20
30
40Emma
time -->
wor
ds
00
20
40
60Mansfield Park
time -->
wor
ds
00
10
20
30
40Northanger Abbey
time -->
wor
ds
00
10
20
30Persuasion
time -->
wor
ds
00
10
20
30
40Pride and Prejudice
time -->
wor
ds
00
10
20
30
40Sense and Sensibility
time -->
wor
ds
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
74
Applications
• Automatically building domain-specific browsers “on the fly”– Burns et al (2007) constructed an interactive visual browser,
based on topics, for papers at the Annual Society for Neuroscience Conference
– Kumar (2006) developed a browser for 40,00 MEDLINE documents about schizophrenia
– Others in development
• Automated indexing in digital libraries– REXA system uses topics to automatically index 1 million
computer science papers (McCallum et al, U Mass) – California Digital Library (Newman et al, 2006)
• Exploratory analysis “beyond keywords”
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
75
Concluding Comments
• Web data analysis– Tremendous opportunities and interesting problems– Rich measurement of human behavior on a large scale– In terms of Web data analysis, its about 1820-1850
• Probability and statistics remain highly relevant
• We need a new breed of “data scientist”– fluent in both computer science and statistics– not enough attention being paid to this in education
HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine:
76
Further Reading
• Web Data AnalysisP. Baldi, P. Frasconi, and P. Smyth, Modeling the Internet and the Web: Probabilistic Methods and
AlgorithmsWiley, 2003
S. ChakrabartiMining the Web: Discovering Knowledge from Hypertext DataMorgan Kaufmann, 2002
• Topic Modeling M. Steyvers and T. Griffiths Probabilistic topic models, 2006 (Good introductory article, available from Mark Steyvers’ Web page)