graphical models for the internet

520
Graphical Models for the Internet Alexander Smola Yahoo! Research, Santa Clara, CA Australian National University [email protected] blog.smola.org credits to Amr Ahmed, Yahoo Research, CA

Upload: hustwj

Post on 23-Jan-2015

1.495 views

Category:

Technology


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Graphical Models for the Internet

Graphical Models for the InternetAlexander Smola

Yahoo! Research, Santa Clara, CAAustralian National University

[email protected] blog.smola.org

credits to Amr Ahmed, Yahoo Research, CA

Page 2: Graphical Models for the Internet

Outline1. Systems

• Hardware (computer archictecture / networks / data centers)• Storage and processing (file systems, MapReduce, Dryad, S4)• Communication and synchronization (star, ring, hashtable, distributed star, tree)

2. Applications on the internet• User modeling (clustering, abuse, profiling)• Content analysis (webpages, links, news)• Search / sponsored search

3. Probabilistic modeling• Basic probability theory• Naive Bayes• Density estimation (exponential families)

4. Directed graphical models• Directed graph semantics (independence, factorization)• Clustering and Markov models (basic model, EM, sampling)• Dirichlet distribution

Page 3: Graphical Models for the Internet

Outline5. Scalable topic modeling

• Latent Dirichlet Allocation• Sampling and parallel communication• Applications to user profiling

6. Applications of latent variable models• Time dependent / context dependent models• News articles / ideology estimation• Recommendation systems

7. Undirected graphical models• Conditional independence and graphs (vertices, chains, trellis)• Message passing (junction trees, variable elimination)• Variational methods, sampling, particle filtering, message passing

8. Applications of undirected graphical models (time permitting)• Conditional random fields• Information extraction• Unlabeled data

Page 4: Graphical Models for the Internet

Part 1 - Systems

Page 5: Graphical Models for the Internet

Hardware

Page 6: Graphical Models for the Internet

• CPU• 8-16 cores (Intel/AMD servers)• 2-3 GHz (close to 1 IPC per core peak) - 10-100 GFlops/socket• 8-16 MB Cache (essentially accessible at clock speed)• Vectorized multimedia instructions (128bit wide, e.g. add, multiply, logical)• Deep pipeline architectures (branching is expensive)

• RAM• 8-128 GB depending on use• 2-3 memory banks (each 32bit wide - atomic writes!)• DDR3 (10GB/s per chip, random access >10x slower)

• Harddisk• 2-3 TB/disk• 100 MB/s sequential read from SATA2• 5ms latency (no change over 10 years), i.e. random access is slow

• Solid State Drives• 500 MB/s sequential read• Random writes are really expensive (read-erase-write cycle for a block)• Latency is 0.5ms or lower (controller & flash cell dependent)

• Anything you can do in bulk & sequence is at least 10x faster

Computers

Page 7: Graphical Models for the Internet

• Network interface• Gigabit network (10-100MB/s throughput)• Copying RAM across network takes 1-10

minutes• Copying 1TB across network takes 1 day• Dozens of computers in a rack (on same switch)

• Power consumption• 200-400W per unit (don’t buy the fastest CPUs)• Energy density big issue (cooling!)• GPUs take much more power (150W each)

• Systems designed to fail• Commodity hardware• Too many machines to ensure 100% uptime• Design software to deal with it (monitoring,

scheduler, data storage)

• Systems designed to grow

Computers

Page 8: Graphical Models for the Internet

Server Centers• 10,000+ servers• Aggregated into rack (same switch)• Reduced bandwidth between racks• Some failure modes

• OS crash• Disk dies (one of many)• Computer dies • Network switch dies (lose rack)• Packet loss / congestion / DNS

• Several server centers worldwide• Applications running distributed (e.g. ranking,

mail, images)• Load balancing• Data transfer between centers

Page 9: Graphical Models for the Internet

Some indicative prices (on Amazon)

server costs

data transfer

storage

Page 10: Graphical Models for the Internet

Processing in the cloud

Page 11: Graphical Models for the Internet

Data storage • Billions of webpages• Billions of queries in query / click log• Millions of data ranked by editors

• Storing data is cheap• less than 100 Billion interesting webpages

• assume 10kB per webpage - 1PB total• Amazon S3 price (1 month) $10k (at $0.01/GB)

• Processing data is expensive• 10 Billion webpages• 10ms per page (only simple algorithms work)• 10 days on 1000 computers ($24k-$240k at $0.1-$1/h)

• Crawling the data is very expensive• Assume 10 Gb/s link - takes >100 days to gather

APCN2 cable has 2.5 Tb/s bandwidth• Amazon EC2 price $100k (at $0.1/GB), with overhead $1M)

Page 12: Graphical Models for the Internet

replicate blocks 3x

File Systems (GoogleFS/HDFS)

replicate blocks 3x

name node server

(replicated)

cheapservers

not quite so cheap

Ghemawat, Gobioff, Leung, 2003

Page 13: Graphical Models for the Internet

File Systems Details• Chunkservers

• store 64MB (or larger) blocks• write to one, replicate transparently (3x)

• Name node• keeps directory of chunks• replicated

• Write• get chunkserver ID from name node• write block(s) to chunkserver (expensive to write parts of chunk)• largely write once scenario (read many times)• distributed write for higher bandwidth (each server 1 chunk)

• Read• from any of the replicated chunks• higher replication for hotspots (many reads of a chunk)

• Elasticity• Add additional chunkservers - name node migrates chunks to new machine• Node failure (name node keeps track of chunk server status) - requests replication

Page 14: Graphical Models for the Internet

Comparison• HDFS/GoogleFS

• Fast block writes / reads• Fault tolerant• Flexible size adjustment• Terrible for random writes / bad for random writes• Not really a filesystem

• NFS & co.• No distributed file management

• Lustre• Proper filesystem• High bandwidth• Explicitly exploits fast interconnects• Cabinet servers replicated with RAID5

• Fails if cabinet dies• Difficult to add more storage

Page 15: Graphical Models for the Internet

MapReduce• Scenario

• Lots of data (much more than what a single computer can process)

• Stored in a distributed fashion• Move computation to data

• MapApply function f to data as distributed over the mappersThis runs (if possible) where data sits

• ReduceCombine data given keys generated in the Map phase

• Fault tolerance• If mapper dies, re-process the data• If reducer dies, re-send the data (cached) from

mappers (requires considerable storage)

Map Reduce

Dean, Ghemawat, 2004

Page 16: Graphical Models for the Internet

Item count in MapReduce• Task: object counting• Map

• Each mapper gets (unsorted) chunk of data

• Preferably local on chunkservers• Perform local item counts• Emit (item, count) data

• Reduce• Aggregate all counts for a given

item (all end up at same reducer)• Emit aggregate counts

(image: gridgain.com)

Page 17: Graphical Models for the Internet

k-means in MapReduce• Initialize random cluster centers• Map

• Assign each data point to a cluster based on current model

• Aggregate data per cluster• Send cluster aggregates to reducer

(e.g. mean, variance, size)• Reduce

• Aggregate all data per cluster• Update cluster centers• Send new cluster centers to new

reducers• Repeat until converged

(image: mathworks.com)

Page 18: Graphical Models for the Internet

k-means in MapReduce• Initialize random cluster centers• Map

• Assign each data point to a cluster based on current model

• Aggregate data per cluster• Send cluster aggregates to reducer

(e.g. mean, variance, size)• Reduce

• Aggregate all data per cluster• Update cluster centers• Send new cluster centers to new

reducers• Repeat until converged

(image: mathworks.com)

needs to re-read data from diskfor each MapReduce iteration

Page 19: Graphical Models for the Internet

Dryad• Data flow graph

(DAG rather than bipartite graph)

• Interface variety• Memory FIFO• Disk• Network

• Modular composition of computation graph

Isard, Budiu, Yu, Birrell & Fetterly, 2007

(image: Microsoft Research)

Page 20: Graphical Models for the Internet

S4 - Online processing

Neumeyer, Robbins, Nair, Kesari, 2011

dispatch withdistributed hash

Page 21: Graphical Models for the Internet

Dataflow Paradigms

Page 22: Graphical Models for the Internet

Pipeline

• Process data sequentially• Parallelizes up to number of tasks

(disk read, feature extraction, logging, output)• Reasonable for a single machine• Parallelization per filter possible

(see e.g. Intel Threading Building Blocks)

CPU CPU CPU CPU

Page 23: Graphical Models for the Internet

Pipeline

• Process data sequentially• Parallelizes up to number of tasks

(disk read, feature extraction, logging, output)• Reasonable for a single machine• Parallelization per filter possible

(see e.g. Intel Threading Building Blocks)

CPU CPU CPU CPUCPUCPUCPU

multicore

Page 24: Graphical Models for the Internet

Tree• Aggregate data hierarchically• Parallelizes at O(log n) cost

(tree traversal)• Communication at the

interface is important (e.g. network latency)

• Good dataflow processing• Poor efficiency for batched

processing O(1/log n)• Poor fault tolerance• Does not map well onto

server centers (need to ensure that we have lower leaves on rack)

CPUCPU CPU

CPU

CPU

CPU

CPU

Page 25: Graphical Models for the Internet

Star• Aggregate data centrally• Does not parallelize at all if

communication cost is high• Perfect parallelization if CPU bound• Latency is O(1) unless the network

is congested. • Network requirements are O(n)• Central node becomes hotspot• Synchronization is very easy• Trivial to add more resources (just

add leaves)• Difficult to parallelize center

CPU

CPU

CPU

CPU

CPU

CPU

CPU

Page 26: Graphical Models for the Internet

Distributed (key,value) storage• Caching problem

• Store many (key,value) pairs• Linear scalability in clients and

servers• Automatic key distribution

mechanism• memcached

• (key,value) servers• client access library distributes

access patterns• randomized O(n) bandwidth• aggregate O(n) bandwidth• load balancing via hashing• no versioned writes

m(key,M) = argminm∈M

h(key,m)

P2P uses similar routing tables

Page 27: Graphical Models for the Internet

Proportional hashing/caching• Machines with different

capacity• Hotspots too hot for a single

machine to handle• Retain nearest neighbor

hashing properties

Chawla, Reed, Juhnke, Syed, 2011, USENIX

• Sparse cover of keyspace proportional to machine capacities• Repeatedly hash a key until it hits a key-range, i.e. hn(key)• Keep busy table and advance to next hash if machine already used

Page 28: Graphical Models for the Internet

Distributed Star

CPU 3

CPU 4

CPU 2

CPU 5

CPU 1

CPU 6

m(key,M) = argminm∈M

h(key,m)

• Aggregate data centrally• Use different center for each key,

as selected by distributed hashing• Linear bandwidth for

synchronization• Perfect scalability O(n) bandwidth

required• Each CPU performs local

computation and stores small fraction of global data

• Works best if all nodes on the same switch / rack

Page 29: Graphical Models for the Internet

Distributed Star• Aggregate data centrally• Use different center for each key,

as selected by distributed hashing• Linear bandwidth for

synchronization• Perfect scalability O(n)

bandwidth required• Each CPU performs local

computation and stores small fraction of global data

• Works best if all nodes on the same switch / rack

CPU 3

CPU 4

CPU 5

CPU 2

CPU 6

m(key,M) = argminm∈M

h(key,m)

CPU 1

Page 30: Graphical Models for the Internet

Distributed Star• Aggregate data centrally• Use different center for each key,

as selected by distributed hashing• Linear bandwidth for

synchronization• Perfect scalability O(n)

bandwidth required• Each CPU performs local

computation and stores small fraction of global data

• Works best if all nodes on the same switch / rack

CPU 4

CPU 5

CPU 3

CPU 6

m(key,M) = argminm∈M

h(key,m)

CPU 1

CPU 2

Page 31: Graphical Models for the Internet

Ringbuffer• Problem

• Disk, RAM and CPU operate at different speeds (>10x difference)

• Want to do maximum data processing (e.g. optimization)

• Idea• Load data from disk into ringbuffer• Process data continuously on buffer• Chain ringbuffers

• Yields consistently maximum throughput for each resource

Page 32: Graphical Models for the Internet

Summary

• HardwareServers, networks, amounts of data

• Processing paradigmsMapReduce, Dryad, S4

• Communication templatesStars, pipelines, distributed hash table, caching

Page 33: Graphical Models for the Internet

Part 2 - Motivation

Page 34: Graphical Models for the Internet

Data on the Internet• Webpages (content, graph)• Clicks (ad, page, social)• Users (OpenID, FB Connect)• e-mails (Hotmail, Y!Mail, Gmail)• Photos, Movies (Flickr, YouTube, Vimeo ...)• Cookies / tracking info (see Ghostery)• Installed apps (Android market etc.)• Location (Latitude, Loopt, Foursquared)• User generated content (Wikipedia & co)• Ads (display, text, DoubleClick, Yahoo)• Comments (Disqus, Facebook)• Reviews (Yelp, Y!Local)• Third party features (e.g. Experian)• Social connections (LinkedIn, Facebook)• Purchase decisions (Netflix, Amazon)• Instant Messages (YIM, Skype, Gtalk)• Search terms (Google, Bing)• Timestamp (everything)• News articles (BBC, NYTimes, Y!News)• Blog posts (Tumblr, Wordpress)• Microblogs (Twitter, Jaiku, Meme)

Finite resources• Editors are expensive• Editors don’t know users• Barrier to i18n• Abuse (intrusions are novel)• Implicit feedback• Data analysis (find interesting stuff

rather than find x)• Integrating many systems• Modular design for data integration• Integrate with given prediction tasks

Invest in modeling and naming rather than data generation

Page 35: Graphical Models for the Internet

Data on the Internet• Webpages (content, graph)• Clicks (ad, page, social)• Users (OpenID, FB Connect)• e-mails (Hotmail, Y!Mail, Gmail)• Photos, Movies (Flickr, YouTube, Vimeo ...)• Cookies / tracking info (see Ghostery)• Installed apps (Android market etc.)• Location (Latitude, Loopt, Foursquared)• User generated content (Wikipedia & co)• Ads (display, text, DoubleClick, Yahoo)• Comments (Disqus, Facebook)• Reviews (Yelp, Y!Local)• Third party features (e.g. Experian)• Social connections (LinkedIn, Facebook)• Purchase decisions (Netflix, Amazon)• Instant Messages (YIM, Skype, Gtalk)• Search terms (Google, Bing)• Timestamp (everything)• News articles (BBC, NYTimes, Y!News)• Blog posts (Tumblr, Wordpress)• Microblogs (Twitter, Jaiku, Meme)

Finite resources• Editors are expensive• Editors don’t know users• Barrier to i18n• Abuse (intrusions are novel)• Implicit feedback• Data analysis (find interesting stuff

rather than find x)• Integrating many systems• Modular design for data integration• Integrate with given prediction tasks

Invest in modeling and naming rather than data generation

unlimited amounts of data

Page 36: Graphical Models for the Internet

Unsupervised Modeling

Page 37: Graphical Models for the Internet

Hierarchical Clustering

NIPS 2010Adams,Ghahramani,Jordan

Page 38: Graphical Models for the Internet

Topics in text

Latent Dirichlet Allocation; Blei, Ng, Jordan, JMLR 2003

Page 39: Graphical Models for the Internet

Word segmentation

Mochihashi, Yamada, Ueda, ACL 2009

Page 40: Graphical Models for the Internet

Language model

automatically synthesizedfrom Penn Treebank

Mochihashi, Yamada, UedaACL 2009

Page 41: Graphical Models for the Internet

User model over time

0 10 20 30 400

0.1

0.2

0.3

Prop

otio

nDay

Baseball

Finance

Jobs

Dating

0 10 20 30 400

0.1

0.2

0.3

0.4

0.5

Prop

otio

n

Day

Baseball

Dating

Celebrity

Health

SnookiTom  CruiseKatie

Holmes  PinkettKudrow

Hollywood

League  baseball  basketball,  doubleheadBergesenGriffeybullpen  Greinke

skinbody  fingers  cells  toes  

wrinkle  layers

women  mendating  singles  

personals  seeking  match

Dating Baseball Celebrity Health

job  careerbusinessassistanthiring

part-­‐timereceptionist

financial  Thomson  chart  real  StockTradingcurrency

Jobs Finance

Ahmed et al., KDD 2011

Page 42: Graphical Models for the Internet

User model over time

0 10 20 30 400

0.1

0.2

0.3

Prop

otio

nDay

Baseball

Finance

Jobs

Dating

0 10 20 30 400

0.1

0.2

0.3

0.4

0.5

Prop

otio

n

Day

Baseball

Dating

Celebrity

Health

SnookiTom  CruiseKatie

Holmes  PinkettKudrow

Hollywood

League  baseball  basketball,  doubleheadBergesenGriffeybullpen  Greinke

skinbody  fingers  cells  toes  

wrinkle  layers

women  mendating  singles  

personals  seeking  match

Dating Baseball Celebrity Health

job  careerbusinessassistanthiring

part-­‐timereceptionist

financial  Thomson  chart  real  StockTradingcurrency

Jobs Finance

Ahmed et al., KDD 2011

Page 43: Graphical Models for the Internet

Face recognition from captions

Jain, Learned-Miller, McCallum, ICCV 2007

Page 44: Graphical Models for the Internet

Storylines from news

Ahmed et al, AISTATS 2011

Page 45: Graphical Models for the Internet

Ideology detection

Ahmed et al, 2010; Bitterlemons collection

Page 46: Graphical Models for the Internet

Hypertext topic extraction

Gruber, Rosen-Zvi, Weiss; UAI 2008

Page 47: Graphical Models for the Internet

Supervised Modeling

Page 48: Graphical Models for the Internet

Ontologies

• continuous maintenance

• no guarantee of coverage

• difficult categories

Page 49: Graphical Models for the Internet

Ontologies

• continuous maintenance

• no guarantee of coverage

• difficult categories

Page 50: Graphical Models for the Internet

Face Classification/Recognition

• 100-1000 people

• 10k faces• curated

(not realistic)• expensive to

generate

Page 51: Graphical Models for the Internet

Topic Detection & Tracking

• editorially curated training data

• expensive to generate

• subjective in selection of threads

• language specific

Page 52: Graphical Models for the Internet

Advertising Targeting

• Needs training data in every language• Is it really relevant for better ads?• Does it cover relevant areas?

Page 53: Graphical Models for the Internet

Advertising Targeting

• Needs training data in every language• Is it really relevant for better ads?• Does it cover relevant areas?

Page 54: Graphical Models for the Internet

Collaborative Filtering

Page 55: Graphical Models for the Internet

Challenges• Scale

• Millions to billions of instances(documents, clicks, users, messages, ads)

• Rich structure of data (ontology, categories, tags)• Model description typically larger than memory of single workstation

• Modeling• Usually clustering or topic models do not solve the problem• Temporal structure of data

• Side information for variables

• Solve problem. Don’t simply apply a model!• Inference

• 10k-100k clusters for hierarchical model• 1M-100M words• Communication is an issue for large state space

Page 56: Graphical Models for the Internet

Summary• Essentially infinite amount of data• Labeling is (in many cases) prohibitively expensive• Editorial data not scalable for i18n• Even for supervised problems unlabeled data

abounds. Use it.• User-understandable structure for representation

purposes• Solutions are often customized to problem

We can only cover building blocks in tutorial.

Page 57: Graphical Models for the Internet

Part 3 - Basic Tools

Page 58: Graphical Models for the Internet

Statistics 101

Page 59: Graphical Models for the Internet

Probability• Space of events X

• server working; slow response; server broken• income of the user (e.g. $95,000)• query text for search (e.g. “statistics tutorial”)

• Probability axioms (Kolmogorov)

• Example queries• P(server working) = 0.999• P(90,000 < income < 100,000) = 0.1

Pr(X) ∈ [0, 1], Pr(X ) = 1Pr(∪iXi) =

�i Pr(Xi) if Xi ∩Xj = ∅

Page 60: Graphical Models for the Internet

All events

Venn Diagram

Page 61: Graphical Models for the Internet

All events

X

X �

Venn Diagram

X ∩X �

Page 62: Graphical Models for the Internet

All events

X

X �

Venn Diagram

X ∩X �

Pr(X ∪X �) = Pr(X) + Pr(X �)− Pr(X ∩X �)

Page 63: Graphical Models for the Internet

(In)dependence• Independence

• Login behavior of two users (approximately)• Disk crash in different colos (approximately)

• Dependent events• Emails• Queries• News stream / Buzz / Tweets• IM communication• Russian Roulette

Pr(x, y) = Pr(x) · Pr(y)

Pr(x, y) �= Pr(x) · Pr(y)

Page 64: Graphical Models for the Internet

(In)dependence• Independence

• Login behavior of two users (approximately)• Disk crash in different colos (approximately)

• Dependent events• Emails• Queries• News stream / Buzz / Tweets• IM communication• Russian Roulette

Pr(x, y) = Pr(x) · Pr(y)

Pr(x, y) �= Pr(x) · Pr(y)

Page 65: Graphical Models for the Internet

(In)dependence• Independence

• Login behavior of two users (approximately)• Disk crash in different colos (approximately)

• Dependent events• Emails• Queries• News stream / Buzz / Tweets• IM communication• Russian Roulette

Pr(x, y) = Pr(x) · Pr(y)

Pr(x, y) �= Pr(x) · Pr(y)

Everywhere!

Page 66: Graphical Models for the Internet

Independence

0.25 0.25

0.25 0.25

Page 67: Graphical Models for the Internet

Dependence

0.45 0.05

0.05 0.45

Page 68: Graphical Models for the Internet

A Graphical Model

Spam Mail

p(spam, mail) = p(spam) p(mail|spam)

Page 69: Graphical Models for the Internet

Bayes Rule

• Joint Probability

• Bayes Rule

• Hypothesis testing• Reverse conditioning

Pr(X|Y ) =Pr(Y |X) · Pr(X)

Pr(Y )

Pr(X,Y ) = Pr(X|Y ) Pr(Y ) = Pr(Y |X) Pr(X)

Page 70: Graphical Models for the Internet

AIDS test (Bayes rule)• Data

• Approximately 0.1% are infected• Test detects all infections• Test reports positive for 1% healthy people

• Probability of having AIDS if test is positive

Page 71: Graphical Models for the Internet

AIDS test (Bayes rule)• Data

• Approximately 0.1% are infected• Test detects all infections• Test reports positive for 1% healthy people

• Probability of having AIDS if test is positive

Pr(a = 1|t) =Pr(t|a = 1) · Pr(a = 1)

Pr(t)

=Pr(t|a = 1) · Pr(a = 1)

Pr(t|a = 1) · Pr(a = 1) + Pr(t|a = 0) · Pr(a = 0)

=1 · 0.001

1 · 0.001 + 0.01 · 0.999= 0.091

Page 72: Graphical Models for the Internet

Improving the diagnosis

Page 73: Graphical Models for the Internet

Improving the diagnosis• Use a follow-up test

• Test 2 reports positive for 90% infections• Test 2 reports positive for 5% healthy people

0.01 · 0.05 · 0.9991 · 0.9 · 0.001 + 0.01 · 0.05 · 0.999

= 0.357

Page 74: Graphical Models for the Internet

Improving the diagnosis• Use a follow-up test

• Test 2 reports positive for 90% infections• Test 2 reports positive for 5% healthy people

• Why can’t we use Test 1 twice?Outcomes are not independent but tests 1 and 2 are conditionally independent

0.01 · 0.05 · 0.9991 · 0.9 · 0.001 + 0.01 · 0.05 · 0.999

= 0.357

Page 75: Graphical Models for the Internet

Improving the diagnosis• Use a follow-up test

• Test 2 reports positive for 90% infections• Test 2 reports positive for 5% healthy people

• Why can’t we use Test 1 twice?Outcomes are not independent but tests 1 and 2 are conditionally independent

0.01 · 0.05 · 0.9991 · 0.9 · 0.001 + 0.01 · 0.05 · 0.999

= 0.357

p(t1, t2|a) = p(t1|a) · p(t2|a)

Page 76: Graphical Models for the Internet

Logarithms are good• Floating point numbers

• Probabilities can be very small. In particular products of many probabilities. Underflow!

• Store data in mantissa, not exponent

• Known bug e.g. in Mahout Dirichlet clustering

52 11 1

mantissasign

exponentπ = log p

i

pi →�

i

πi

i

pi → maxπ + log�

i

exp [πi −maxπ]

Page 77: Graphical Models for the Internet

Application: Naive Bayes

Page 78: Graphical Models for the Internet

Naive Bayes Spam Filter

Page 79: Graphical Models for the Internet

Naive Bayes Spam Filter• Key assumption

Words occur independently of each other given the label of the document

p(w1, . . . , wn|spam) =n�

i=1

p(wi|spam)

Page 80: Graphical Models for the Internet

Naive Bayes Spam Filter• Key assumption

Words occur independently of each other given the label of the document

• Spam classification via Bayes Rulep(w1, . . . , wn|spam) =

n�

i=1

p(wi|spam)

p(spam|w1, . . . , wn) ∝ p(spam)n�

i=1

p(wi|spam)

Page 81: Graphical Models for the Internet

Naive Bayes Spam Filter• Key assumption

Words occur independently of each other given the label of the document

• Spam classification via Bayes Rule

• Parameter estimationCompute spam probability and word distributions for spam and ham

p(w1, . . . , wn|spam) =n�

i=1

p(wi|spam)

p(spam|w1, . . . , wn) ∝ p(spam)n�

i=1

p(wi|spam)

Page 82: Graphical Models for the Internet

Naive Bayes Spam Filter

• Get rich quick. Buy WWW stock.• Buy Viagra. Make your WWW experience last longer.• You deserve a PhD from WWW University.

We recognize your expertise.

• Make your rich WWW PhD experience last longer.

Equally likely phrases

Page 83: Graphical Models for the Internet

Naive Bayes Spam Filter

• Get rich quick. Buy WWW stock.• Buy Viagra. Make your WWW experience last longer.• You deserve a PhD from WWW University.

We recognize your expertise.

• Make your rich WWW PhD experience last longer.

Equally likely phrases

Page 84: Graphical Models for the Internet

A Graphical Model

spam

w1 w2 . . . wn

Page 85: Graphical Models for the Internet

A Graphical Model

spam

w1 w2 . . . wn

p(w1, . . . , wn|spam) =n�

i=1

p(wi|spam)

Page 86: Graphical Models for the Internet

A Graphical Model

spam

w1 w2 . . . wn

p(w1, . . . , wn|spam) =n�

i=1

p(wi|spam)

spam

wi

i=1..n

Page 87: Graphical Models for the Internet

A Graphical Model

spam

w1 w2 . . . wn

p(w1, . . . , wn|spam) =n�

i=1

p(wi|spam)

spam

wi

i=1..n

how to estimate p(w|spam)

Page 88: Graphical Models for the Internet

Simple Algorithm

p(y|x) ∝ p(y)�

j

p(xj |y)

• For each document (x,y) do• Aggregate label counts given y• For each feature xi in x do

• Aggregate statistic for (xi, y) for each y• For y estimate distribution p(y)• For each (xi,y) pair do

Estimate distribution p(xi|y), e.g. Parzen Windows, Exponential family (Gauss, Laplace, Poisson, ...), Mixture

• Given new instance compute

Page 89: Graphical Models for the Internet

Simple Algorithm

p(y|x) ∝ p(y)�

j

p(xj |y)

trivially parallel

pass over all data• For each document (x,y) do

• Aggregate label counts given y• For each feature xi in x do

• Aggregate statistic for (xi, y) for each y• For y estimate distribution p(y)• For each (xi,y) pair do

Estimate distribution p(xi|y), e.g. Parzen Windows, Exponential family (Gauss, Laplace, Poisson, ...), Mixture

• Given new instance compute

Page 90: Graphical Models for the Internet

• Map(document (x,y))• For each mapper for each feature xi in x do

• Aggregate statistic for (xi, y) for each y• Send aggregate statistics to reducer

• Reduce(xi, y)• Aggregate over all messages from mappers• Estimate distribution p(xi|y), e.g. Parzen Windows,

Exponential family (Gauss, Laplace, Poisson, ...), Mixture• Send coordinate-wise model to global storage

• Given new instance compute

MapReduce Algorithm

p(y|x) ∝ p(y)�

j

p(xj |y)

Page 91: Graphical Models for the Internet

• Map(document (x,y))• For each mapper for each feature xi in x do

• Aggregate statistic for (xi, y) for each y• Send aggregate statistics to reducer

• Reduce(xi, y)• Aggregate over all messages from mappers• Estimate distribution p(xi|y), e.g. Parzen Windows,

Exponential family (Gauss, Laplace, Poisson, ...), Mixture• Send coordinate-wise model to global storage

• Given new instance compute

MapReduce Algorithm

p(y|x) ∝ p(y)�

j

p(xj |y)

local perchunkserver

only aggregates needed

Page 92: Graphical Models for the Internet

spam probability

Naive NaiveBayes Classifier• Two classes (spam/ham)• Binary features (e.g. presence of $$$, viagra)• Simplistic Algorithm

• Count occurrences of feature for spam/ham• Count number of spam/ham mails

p(xi = TRUE|y) = n(i, y)

n(y)and p(y) =

n(y)

n

feature probability

p(y|x) ∝ n(y)

n

i:xi=TRUE

n(i, y)

n(y)

i:xi=FALSE

n(y)− n(i, y)

n(y)

Page 93: Graphical Models for the Internet

Naive NaiveBayes Classifier

p(y|x) ∝ n(y)

n

i:xi=TRUE

n(i, y)

n(y)

i:xi=FALSE

n(y)− n(i, y)

n(y)

what if n(i,y)=0?

what if n(i,y)=n(y)?

Page 94: Graphical Models for the Internet

Naive NaiveBayes Classifier

p(y|x) ∝ n(y)

n

i:xi=TRUE

n(i, y)

n(y)

i:xi=FALSE

n(y)− n(i, y)

n(y)

what if n(i,y)=0?

what if n(i,y)=n(y)?

Page 95: Graphical Models for the Internet

Estimating Probabilities

Page 96: Graphical Models for the Internet

Binomial Distribution• Two outcomes (head, tail); (0,1)• Data likelihood

• Maximum Likelihood Estimation• Constrained optimization problem • Incorporate constraint via• Taking derivatives yields

p(X;π) = πn1(1− π)n0

θ = logn1

n0 + n1⇐⇒ p(x = 1) =

n1

n0 + n1

π ∈ [0, 1]

p(x; θ) =exθ

1 + eθ

Page 97: Graphical Models for the Internet

... in detail ...p(X; θ) =

n�

i=1

p(xi; θ) =n�

i=1

eθxi

1 + eθ

=⇒ log p(X; θ) = θn�

i=1

xi − n log�1 + eθ

=⇒ ∂θ log p(X; θ) =n�

i=1

xi − neθ

1 + eθ

⇐⇒ 1n

n�

i=1

xi =eθ

1 + eθ= p(x = 1)

Page 98: Graphical Models for the Internet

... in detail ...p(X; θ) =

n�

i=1

p(xi; θ) =n�

i=1

eθxi

1 + eθ

=⇒ log p(X; θ) = θn�

i=1

xi − n log�1 + eθ

=⇒ ∂θ log p(X; θ) =n�

i=1

xi − neθ

1 + eθ

⇐⇒ 1n

n�

i=1

xi =eθ

1 + eθ= p(x = 1)

empirical probability of x=1

Page 99: Graphical Models for the Internet

Discrete Distribution• n outcomes (e.g. USA, Canada, India, UK, NZ)• Data likelihood

• Maximum Likelihood Estimation• Constrained optimization problem ... or ...• Incorporate constraint via• Taking derivatives yields

p(x; θ) =exp θx�x� exp θx�

p(X;π) =�

i

πnii

θi = logni�j nj

⇐⇒ p(x = i) =ni�j nj

Page 100: Graphical Models for the Internet

Tossing a Dice

24

12060

12

Page 101: Graphical Models for the Internet

Tossing a Dice

24

12060

12

Page 102: Graphical Models for the Internet

Exponential Families

Page 103: Graphical Models for the Internet

Exponential Families

Page 104: Graphical Models for the Internet

Exponential Families

• Density function

p(x; θ) = exp (�φ(x), θ� − g(θ))

where g(θ) = log�

x�

exp (�φ(x�), θ�)

Page 105: Graphical Models for the Internet

Exponential Families

• Density function

• Log partition function generates cumulants

p(x; θ) = exp (�φ(x), θ� − g(θ))

where g(θ) = log�

x�

exp (�φ(x�), θ�)

∂θg(θ) = E [φ(x)]

∂2θg(θ) = Var [φ(x)]

Page 106: Graphical Models for the Internet

Exponential Families

• Density function

• Log partition function generates cumulants

• g is convex (second derivative is p.s.d.)

p(x; θ) = exp (�φ(x), θ� − g(θ))

where g(θ) = log�

x�

exp (�φ(x�), θ�)

∂θg(θ) = E [φ(x)]

∂2θg(θ) = Var [φ(x)]

Page 107: Graphical Models for the Internet

Examples

• Binomial Distribution• Discrete Distribution

(ex is unit vector for x)• Gaussian• Poisson (counting measure 1/x!)• Dirichlet, Beta, Gamma, Wishart, ...

φ(x) = x

φ(x) = ex

φ(x) =�

x,12xx�

φ(x) = x

Page 108: Graphical Models for the Internet

Normal Distribution

Page 109: Graphical Models for the Internet

Poisson Distribution

p(x;λ) =λxe−λ

x!

Page 110: Graphical Models for the Internet

Beta Distribution

p(x;α,β) =xα−1(1− x)β−1

B(α,β)

Page 111: Graphical Models for the Internet

Dirichlet Distribution

... this is a distribution over distributions ...

Page 112: Graphical Models for the Internet

Maximum Likelihood

Page 113: Graphical Models for the Internet

Maximum Likelihood• Negative log-likelihood

− log p(X; θ) =n�

i=1

g(θ) − �φ(xi), θ�

Page 114: Graphical Models for the Internet

Maximum Likelihood• Negative log-likelihood

• Taking derivatives

We pick the parameter such that the distribution matches the empirical average.

− log p(X; θ) =n�

i=1

g(θ) − �φ(xi), θ�

−∂θ log p(X; θ) = m

�E[φ(x)]− 1

m

n�

i=1

φ(xi)

empiricalaveragemean

Page 115: Graphical Models for the Internet

Conjugate Priors• Unless we have lots of data estimates are weak• Usually we have an idea of what to expect

we might even have ‘seen’ such data before• Solution: add ‘fake’ observations

• Inference (generalized Laplace smoothing)

p(θ|X) ∝ p(X|θ) · p(θ)

1n

n�

i=1

φ(xi) −→1

n + m

n�

i=1

φ(xi) +m

n + mµ0

p(θ) ∝ p(Xfake|θ) hence p(θ|X) ∝ p(X|θ)p(Xfake|θ) = p(X ∪Xfake|θ)

fake mean

fake count

Page 116: Graphical Models for the Internet

Example: Gaussian Estimation• Sufficient statistics: • Mean and variance given by

• Maximum Likelihood Estimate

• Maximum a Posteriori Estimate

x, x2

µ = Ex[x] and σ2 = Ex[x2]−E2

x[x]

µ̂ =1

n

n�

i=1

xi and σ2 =1

n

n�

i=1

x2i − µ̂2

µ̂ =1

n+ n0

n�

i=1

xi and σ2 =1

n+ n0

n�

i=1

x2i +

n0

n+ n01− µ̂2

smoother

smoother

Page 117: Graphical Models for the Internet

Collapsing• Conjugate priors

Hence we know how to compute normalization• Prediction

p(θ) ∝ p(Xfake|θ)

p(x|X) =

�p(x|θ)p(θ|X)dθ

∝�

p(x|θ)p(X|θ)p(Xfake|θ)dθ

=

�p({x} ∪X ∪Xfake|θ)dθ

look up closedform expansions

(Beta, binomial)(Dirichlet, multinomial)

(Gamma, Poisson)(Wishart, Gauss)

http://en.wikipedia.org/wiki/Exponential_family

Page 118: Graphical Models for the Internet

Conjugate Prior in action

p(x = i) =ni

n−→ p(x = i) =

ni + mi

n + m

Outcome 1 2 3 4 5 6

Counts 3 6 2 1 4 4

MLE 0.15 0.30 0.10 0.05 0.20 0.20

MAP (m0 = 6) 0.15 0.27 0.12 0.08 0.19 0.19

MAP (m0 = 100) 0.16 0.19 0.16 0.15 0.17 0.17

mi = m · [µ0]i

Page 119: Graphical Models for the Internet

Conjugate Prior in action• Discrete Distribution

• Tossing a dicep(x = i) =

ni

n−→ p(x = i) =

ni + mi

n + m

Outcome 1 2 3 4 5 6

Counts 3 6 2 1 4 4

MLE 0.15 0.30 0.10 0.05 0.20 0.20

MAP (m0 = 6) 0.15 0.27 0.12 0.08 0.19 0.19

MAP (m0 = 100) 0.16 0.19 0.16 0.15 0.17 0.17

mi = m · [µ0]i

Page 120: Graphical Models for the Internet

Conjugate Prior in action• Discrete Distribution

• Tossing a dice

• Rule of thumbneed 10 data points (or prior) per parameter

p(x = i) =ni

n−→ p(x = i) =

ni + mi

n + m

Outcome 1 2 3 4 5 6

Counts 3 6 2 1 4 4

MLE 0.15 0.30 0.10 0.05 0.20 0.20

MAP (m0 = 6) 0.15 0.27 0.12 0.08 0.19 0.19

MAP (m0 = 100) 0.16 0.19 0.16 0.15 0.17 0.17

mi = m · [µ0]i

Page 121: Graphical Models for the Internet

Honest dice

MLE

MAP

Page 122: Graphical Models for the Internet

Tainted dice

MLE

MAP

Page 123: Graphical Models for the Internet

Priors (part deux)• Parameter smoothing

• Posterior

• Convex optimization problem (MAP estimation)

p(θ|x) ∝m�

i=1

p(xi|θ)p(θ)

∝ exp

�m�

i=1

�φ(xi), θ� −mg(θ)− 1

2σ2�θ�22

minimizeθ

g(θ)−�

1

m

m�

i=1

φ(xi), θ

�+

1

2mσ2�θ�22

p(θ) ∝ exp(−λ �θ�1) or p(θ) ∝ exp(−λ �θ�22)

Page 124: Graphical Models for the Internet

Summary

• Basic statistics tools• Estimating probabilities (mainly scalar)• Exponential family introduction

Page 125: Graphical Models for the Internet

Part 4: Directed Graphical

Page 126: Graphical Models for the Internet

Basics

Page 127: Graphical Models for the Internet

... some Web 2.0 serviceMySQL Apache

Website

Page 128: Graphical Models for the Internet

... some Web 2.0 service

• Joint distribution (assume a and m are independent)

MySQL Apache

Website

p(m, a,w) = p(w|m, a)p(m)p(a)

Page 129: Graphical Models for the Internet

... some Web 2.0 service

• Joint distribution (assume a and m are independent)

• Explaining away

a and m are dependent conditioned on w

MySQL Apache

Website

p(m, a,w) = p(w|m, a)p(m)p(a)

p(m, a|w) =p(w|m, a)p(m)p(a)�

m�,a� p(w|m�, a�)p(m�)p(a�)

Page 130: Graphical Models for the Internet

... some Web 2.0 serviceMySQL Apache

Website

Page 131: Graphical Models for the Internet

... some Web 2.0 service

is working

MySQL is workingApache is working

MySQL Apache

Website

Page 132: Graphical Models for the Internet

... some Web 2.0 service

is working

MySQL is workingApache is working

is broken

At least one of the two services is broken

(not independent)

MySQL Apache

Website

Page 133: Graphical Models for the Internet

Directed graphical model

• Easier estimation• 15 parameters for full joint distribution• 1+1+3+1 for factorizing distribution

• Causal relations• Inference for unobserved variables

m a

w

m a

w

m a

w

uuser

action

Page 134: Graphical Models for the Internet

No loops allowed

Page 135: Graphical Models for the Internet

No loops allowed

Page 136: Graphical Models for the Internet

No loops allowed

Page 137: Graphical Models for the Internet

No loops allowed

Page 138: Graphical Models for the Internet

No loops allowed

p(c|e)p(e|c)

Page 139: Graphical Models for the Internet

No loops allowed

p(c|e)p(e) or p(e|c)p(c)

p(c|e)p(e|c)

Page 140: Graphical Models for the Internet

No loops allowed

p(c|e)p(e) or p(e|c)p(c)

p(c|e)p(e|c)

Page 141: Graphical Models for the Internet

Directed Graphical Model

• Joint probability distribution

• Parameter estimation• If x is fully observed the likelihood breaks up

• If x is partially observed things get interestingmaximization, EM, variational, sampling ...

p(x) =�

i

p(xi|xparents(i))

log p(x|θ) =�

i

log p(xi|xparents(i), θ)

Page 142: Graphical Models for the Internet

ChainsMarkov Chain

past past present future future

Page 143: Graphical Models for the Internet

ChainsMarkov Chain

past past present future future

Plate

Page 144: Graphical Models for the Internet

ChainsMarkov Chain

past past present future future

Plate

Hidden Markov Chain

observeduser action

user’smindset

Page 145: Graphical Models for the Internet

ChainsMarkov Chain

past past present future future

Plate

Hidden Markov Chain

observeduser action

user’smindset

user model for traversal through search results

Page 146: Graphical Models for the Internet

ChainsMarkov Chain

past past present future future

Plate

Hidden Markov Chain

observeduser action

user’smindset

user model for traversal through search results

Page 147: Graphical Models for the Internet

ChainsMarkov Chain

Hidden Markov Chain

observeduser action

user’smindset

user model for traversal through search results

p(x, y; θ) = p(x0; θ)n−1�

i=1

p(xi+1|xi; θ)n�

i=1

p(yi|xi)

p(x; θ) = p(x0; θ)n−1�

i=1

p(xi+1|xi; θ)

Plate

Page 148: Graphical Models for the Internet

Factor GraphsLatent Factors

ObservedEffects

Page 149: Graphical Models for the Internet

Factor Graphs

• Observed effectsClick behavior, queries, watched news, emails

Latent Factors

ObservedEffects

Page 150: Graphical Models for the Internet

Factor Graphs

• Observed effectsClick behavior, queries, watched news, emails

• Latent factorsUser profile, news content, hot keywords, social connectivity graph, events

Latent Factors

ObservedEffects

Page 151: Graphical Models for the Internet

Recommender Systems

r

u m

Page 152: Graphical Models for the Internet

Recommender Systems

• Users u• Movies m• Ratings r (but only for a subset of users)

r

u m

Page 153: Graphical Models for the Internet

Recommender Systems

• Users u• Movies m• Ratings r (but only for a subset of users)

r

u m

... intersecting plates ...(like nested for loops)

Page 154: Graphical Models for the Internet

Recommender Systems

• Users u• Movies m• Ratings r (but only for a subset of users)

r

u m

... intersecting plates ...(like nested for loops)

news, SearchMonkey

answerssocial

rankingOMG

personals

Page 155: Graphical Models for the Internet

Challengesyour job

my job

Page 156: Graphical Models for the Internet

Challenges• How to design models

• Common (engineering) sense• Computational tractability

your job

my job

Page 157: Graphical Models for the Internet

Challenges• How to design models

• Common (engineering) sense• Computational tractability

• Dependency analysis• Bayes ball (not in this lecture)

your job

my job

Page 158: Graphical Models for the Internet

Challenges• How to design models

• Common (engineering) sense• Computational tractability

• Dependency analysis• Bayes ball (not in this lecture)

• Inference• Easy for fully observed situations• Many algorithms if not fully observed• Dynamic programming / message passing

your job

my job

Page 159: Graphical Models for the Internet

Dynamic Programming 101

Page 160: Graphical Models for the Internet

Chains

x

p(x; θ) = p(x0; θ)n−1�

i=1

p(xi+1|xi; θ) x0 x1 x2 x3

p(xi) =�

x0,...xi−1,xi+1...xn

p(x0)� �� �:=l0(x0)

n�

j=1

p(xj |xj−1)

=�

x1,...xi−1,xi+1...xn

x0

[l0(x0)p(x1|x0)]

� �� �:=l1(x1)

n�

j=2

p(xj |xj−1)

=�

x2,...xi−1,xi+1...xn

x1

[l1(x1)p(x2|x1)]

� �� �:=l2(x2)

n�

j=3

p(xj |xj−1)

Page 161: Graphical Models for the Internet

Chains

x

p(x; θ) = p(x0; θ)n−1�

i=1

p(xi+1|xi; θ) x0 x1 x2 x3

p(xi) =�

x0,...xi−1,xi+1...xn

p(x0)� �� �:=l0(x0)

n�

j=1

p(xj |xj−1)

=�

x1,...xi−1,xi+1...xn

x0

[l0(x0)p(x1|x0)]

� �� �:=l1(x1)

n�

j=2

p(xj |xj−1)

=�

x2,...xi−1,xi+1...xn

x1

[l1(x1)p(x2|x1)]

� �� �:=l2(x2)

n�

j=3

p(xj |xj−1)

Page 162: Graphical Models for the Internet

Chains

x

p(x; θ) = p(x0; θ)n−1�

i=1

p(xi+1|xi; θ) x0 x1 x2 x3

p(xi) =�

x0,...xi−1,xi+1...xn

p(x0)� �� �:=l0(x0)

n�

j=1

p(xj |xj−1)

=�

x1,...xi−1,xi+1...xn

x0

[l0(x0)p(x1|x0)]

� �� �:=l1(x1)

n�

j=2

p(xj |xj−1)

=�

x2,...xi−1,xi+1...xn

x1

[l1(x1)p(x2|x1)]

� �� �:=l2(x2)

n�

j=3

p(xj |xj−1)

Page 163: Graphical Models for the Internet

Chains

x

p(x; θ) = p(x0; θ)n−1�

i=1

p(xi+1|xi; θ) x0 x1 x2 x3

p(xi) =�

x0,...xi−1,xi+1...xn

p(x0)� �� �:=l0(x0)

n�

j=1

p(xj |xj−1)

=�

x1,...xi−1,xi+1...xn

x0

[l0(x0)p(x1|x0)]

� �� �:=l1(x1)

n�

j=2

p(xj |xj−1)

=�

x2,...xi−1,xi+1...xn

x1

[l1(x1)p(x2|x1)]

� �� �:=l2(x2)

n�

j=3

p(xj |xj−1)

Page 164: Graphical Models for the Internet

Chains

x

p(x; θ) = p(x0; θ)n−1�

i=1

p(xi+1|xi; θ) x0 x1 x2 x3

p(xi) = li(xi)�

xi+1...xn

n−1�

j=i

p(xj+1|xj)

= li(xi)�

xi+1...xn−1

n−2�

j=i

p(xj+1|xj)�

xn

p(xn|xn−1)

� �� �:=rn−1(xn−1)

= li(xi)�

xi+1...xn−2

n−3�

j=i

p(xj+1|xj)�

xn−1

p(xn−1|xn−2)rn−1(xn−1)

� �� �:=rn−2(xn−2)

Page 165: Graphical Models for the Internet

Chainsp(x; θ) = p(x0; θ)

n−1�

i=1

p(xi+1|xi; θ) x0 x1 x2 x3

• Forward recursion

• Backward recursion

• Marginalization & conditioning

l0(x0) := p(x0) and li(xi) :=�

xi−1

li−1(xi−1)p(xi|xi−1)

rn(xn) := 1 and ri(xi) :=�

xi+1

ri+1(xi+1)p(xi+1|xi)

p(xi) = li(xi)ri(xi)

p(x−i|xi) =p(x)p(xi)

p(xi, xi+1) = li(xi)p(xi+1|xi)ri(xi+1)

Page 166: Graphical Models for the Internet

Chains

• Send forward messages starting from left node

• Send backward messages starting from right node

x0 x1 x2 x3 x4 x5

mi+1→i(xi) =�

xi+1

mi+2→i+1(xi+1)f(xi, xi+1)

mi−1→i(xi) =�

xi−1

mi−2→i−1(xi−1)f(xi−1, xi)

Page 167: Graphical Models for the Internet

Trees

• Forward/Backward messages as normal for chain• When we have more edges for a vertex use ...

x0 x1 x2

x3 x4 x5

x6 x7 x8

m2→3(x3) =�

x2

m1→2(x2)m6→2(x2)f(x2, x3)

m2→6(x6) =�

x2

m1→2(x2)m3→2(x2)f(x2, x6)

m2→1(x1) =�

x2

m3→2(x2)m6→2(x2)f(x1, x2)

Page 168: Graphical Models for the Internet

Trees

• Forward/Backward messages as normal for chain• When we have more edges for a vertex use ...

x0 x1 x2

x3 x4 x5

x6 x7 x8

m2→3(x3) =�

x2

m1→2(x2)m6→2(x2)f(x2, x3)

m2→6(x6) =�

x2

m1→2(x2)m3→2(x2)f(x2, x6)

m2→1(x1) =�

x2

m3→2(x2)m6→2(x2)f(x1, x2)

Page 169: Graphical Models for the Internet

Trees

• Forward/Backward messages as normal for chain• When we have more edges for a vertex use ...

x0 x1 x2

x3 x4 x5

x6 x7 x8

m2→3(x3) =�

x2

m1→2(x2)m6→2(x2)f(x2, x3)

m2→6(x6) =�

x2

m1→2(x2)m3→2(x2)f(x2, x6)

m2→1(x1) =�

x2

m3→2(x2)m6→2(x2)f(x1, x2)

Page 170: Graphical Models for the Internet

Trees

• Forward/Backward messages as normal for chain• When we have more edges for a vertex use ...

x0 x1 x2

x3 x4 x5

x6 x7 x8

m2→3(x3) =�

x2

m1→2(x2)m6→2(x2)f(x2, x3)

m2→6(x6) =�

x2

m1→2(x2)m3→2(x2)f(x2, x6)

m2→1(x1) =�

x2

m3→2(x2)m6→2(x2)f(x1, x2)

Page 171: Graphical Models for the Internet

Trees

• Forward/Backward messages as normal for chain• When we have more edges for a vertex use ...

x0 x1 x2

x3 x4 x5

x6 x7 x8

m2→3(x3) =�

x2

m1→2(x2)m6→2(x2)f(x2, x3)

m2→6(x6) =�

x2

m1→2(x2)m3→2(x2)f(x2, x6)

m2→1(x1) =�

x2

m3→2(x2)m6→2(x2)f(x1, x2)

Page 172: Graphical Models for the Internet

Trees

• Forward/Backward messages as normal for chain• When we have more edges for a vertex use ...

x0 x1 x2

x3 x4 x5

x6 x7 x8

m2→3(x3) =�

x2

m1→2(x2)m6→2(x2)f(x2, x3)

m2→6(x6) =�

x2

m1→2(x2)m3→2(x2)f(x2, x6)

m2→1(x1) =�

x2

m3→2(x2)m6→2(x2)f(x1, x2)

Page 173: Graphical Models for the Internet

Trees

• Forward/Backward messages as normal for chain• When we have more edges for a vertex use ...

x0 x1 x2

x3 x4 x5

x6 x7 x8

m2→3(x3) =�

x2

m1→2(x2)m6→2(x2)f(x2, x3)

m2→6(x6) =�

x2

m1→2(x2)m3→2(x2)f(x2, x6)

m2→1(x1) =�

x2

m3→2(x2)m6→2(x2)f(x1, x2)

Page 174: Graphical Models for the Internet

Junction Trees

1 23

4

mi→j(xj) =�

xi

f(xi, xj)�

l �=j

ml→i(xj)

clique potential

separator set

1,2 2,4,5

2,3,4

4,5,7

2

2,4

4,5

m245→234(x24)

=�

x5

f(x245)m12→245(x2)m457→245(x45)

clique potential

Page 175: Graphical Models for the Internet

Junction Trees

1 23

4

mi→j(xj) =�

xi

f(xi, xj)�

l �=j

ml→i(xj)

clique potential

separator set

1,2 2,4,5

2,3,4

4,5,7

2

2,4

4,5

m245→234(x24)

=�

x5

f(x245)m12→245(x2)m457→245(x45)

clique potential

Page 176: Graphical Models for the Internet

Junction Trees

1 23

4

mi→j(xj) =�

xi

f(xi, xj)�

l �=j

ml→i(xj)

clique potential

separator set

1,2 2,4,5

2,3,4

4,5,7

2

2,4

4,5

m245→234(x24)

=�

x5

f(x245)m12→245(x2)m457→245(x45)

clique potential

Page 177: Graphical Models for the Internet

Junction Trees

1 23

4

mi→j(xj) =�

xi

f(xi, xj)�

l �=j

ml→i(xj)

clique potential

separator set

1,2 2,4,5

2,3,4

4,5,7

2

2,4

4,5

m245→234(x24)

=�

x5

f(x245)m12→245(x2)m457→245(x45)

clique potential

Page 178: Graphical Models for the Internet

Junction Trees

1 23

4

mi→j(xj) =�

xi

f(xi, xj)�

l �=j

ml→i(xj)

clique potential

separator set

1,2 2,4,5

2,3,4

4,5,7

2

2,4

4,5

m245→234(x24)

=�

x5

f(x245)m12→245(x2)m457→245(x45)

clique potential

Page 179: Graphical Models for the Internet

Junction Trees

1 23

4

mi→j(xj) =�

xi

f(xi, xj)�

l �=j

ml→i(xj)

clique potential

separator set

1,2 2,4,5

2,3,4

4,5,7

2

2,4

4,5

m245→234(x24)

=�

x5

f(x245)m12→245(x2)m457→245(x45)

clique potential

Page 180: Graphical Models for the Internet

Junction Trees

1 23

4

mi→j(xj) =�

xi

f(xi, xj)�

l �=j

ml→i(xj)

clique potential

separator set

1,2 2,4,5

2,3,4

4,5,7

2

2,4

4,5

m245→234(x24)

=�

x5

f(x245)m12→245(x2)m457→245(x45)

clique potential

Page 181: Graphical Models for the Internet

Junction Trees

1 23

4

mi→j(xj) =�

xi

f(xi, xj)�

l �=j

ml→i(xj)

clique potential separator

set

1,2 2,4,5

2,3,4

4,5,7

2

2,4

4,5

m245→234(x24)

=�

x5

f(x245)m12→245(x2)m457→245(x45)

clique potential

Page 182: Graphical Models for the Internet

Junction Trees

1 23

4

mi→j(xj) =�

xi

f(xi, xj)�

l �=j

ml→i(xj)

clique potential separator

set

1,2 2,4,5

2,3,4

4,5,7

2

2,4

4,5

m245→234(x24)

=�

x5

f(x245)m12→245(x2)m457→245(x45)

clique potential

Page 183: Graphical Models for the Internet

Junction Trees

1 23

4

mi→j(xj) =�

xi

f(xi, xj)�

l �=j

ml→i(xj)

clique potential separator

set

1,2 2,4,5

2,3,4

4,5,7

2

2,4

4,5

m245→234(x24)

=�

x5

f(x245)m12→245(x2)m457→245(x45)

clique potential

Page 184: Graphical Models for the Internet

Junction Trees

1 23

4

mi→j(xj) =�

xi

f(xi, xj)�

l �=j

ml→i(xj)

clique potential separator

set

1,2 2,4,5

2,3,4

4,5,7

2

2,4

4,5

m245→234(x24)

=�

x5

f(x245)m12→245(x2)m457→245(x45)

clique potential

Page 185: Graphical Models for the Internet

Junction Trees

1 23

4

mi→j(xj) =�

xi

f(xi, xj)�

l �=j

ml→i(xj)

clique potential separator

set

1,2 2,4,5

2,3,4

4,5,7

2

2,4

4,5

m245→234(x24)

=�

x5

f(x245)m12→245(x2)m457→245(x45)

clique potential

Page 186: Graphical Models for the Internet

Junction Trees

1 23

4

mi→j(xj) =�

xi

f(xi, xj)�

l �=j

ml→i(xj)

clique potential separator

set

1,2 2,4,5

2,3,4

4,5,7

2

2,4

4,5

m245→234(x24)

=�

x5

f(x245)m12→245(x2)m457→245(x45)

clique potential

Page 187: Graphical Models for the Internet

Junction Trees

1 23

4

mi→j(xj) =�

xi

f(xi, xj)�

l �=j

ml→i(xj)

clique potential separator

set

1,2 2,4,5

2,3,4

4,5,7

2

2,4

4,5

m245→234(x24)

=�

x5

f(x245)m12→245(x2)m457→245(x45)

clique potential

Page 188: Graphical Models for the Internet

Junction Trees

1 23

4

mi→j(xj) =�

xi

f(xi, xj)�

l �=j

ml→i(xj)

clique potential separator

set

1,2 2,4,5

2,3,4

4,5,7

2

2,4

4,5

m245→234(x24)

=�

x5

f(x245)m12→245(x2)m457→245(x45)

clique potential

Page 189: Graphical Models for the Internet

Generalized Distributive Law

• Key IdeaDynamic programming uses only sums and multiplications, hence replace them with equivalent operations from other semirings

• Semiring• ‘addition’ and ‘summation’ equivalent• Associative law: (a+b)+c = a+(b+c)• Distributive law: a(b+c) = ab + ac

Page 190: Graphical Models for the Internet

Generalized Distributive Law• Integrating out probabilities (sum, product)

• Finding the maximum (max, +)

• Set algebra (union, intersection)

• Boolean semiring (AND, OR)• Probability semiring (log +, +)• Tropical semiring (min, +)

a · (b+ c) = a · b+ a · c

a+max(b, c) = max(a+ b, a+ c)

A ∪ (B ∩ C) = (A ∪B) ∩ (A ∪ C)

Page 191: Graphical Models for the Internet

Chains ... again

x

x0 x1 x2 x3s̄ = maxx

s(x0) +n−1�

i=1

s(xi+1|xi)

s̄ = maxx0...n

s(x0)� �� �:=l0(x0)

+n�

j=1

s(xj |xj−1)

= maxx1...n

maxx0

[l0(x0)s(x1|x0)]� �� �

:=l1(x1)

+n�

j=2

s(xj |xj−1)

= maxx2...n

maxx1

[l1(x1)s(x2|x1)]� �� �

:=l2(x2)

+n�

j=3

s(xj |xj−1)

Page 192: Graphical Models for the Internet

m245→234(x24)

=maxx5

f(x245) +m12→245(x2) +m457→245(x45)

Junction Trees

1 23

4

clique potential separator

set

1,2 2,4,5

2,3,4

4,5,7

2

2,4

4,5

clique potential

mi→j(xj) = maxxi

f(xi, xj) +�

l �=j

ml→i(xj)

Page 193: Graphical Models for the Internet

m245→234(x24)

=maxx5

f(x245) +m12→245(x2) +m457→245(x45)

Junction Trees

1 23

4

clique potential separator

set

1,2 2,4,5

2,3,4

4,5,7

2

2,4

4,5

clique potential

mi→j(xj) = maxxi

f(xi, xj) +�

l �=j

ml→i(xj)

Page 194: Graphical Models for the Internet

m245→234(x24)

=maxx5

f(x245) +m12→245(x2) +m457→245(x45)

Junction Trees

1 23

4

clique potential separator

set

1,2 2,4,5

2,3,4

4,5,7

2

2,4

4,5

clique potential

mi→j(xj) = maxxi

f(xi, xj) +�

l �=j

ml→i(xj)

Page 195: Graphical Models for the Internet

m245→234(x24)

=maxx5

f(x245) +m12→245(x2) +m457→245(x45)

Junction Trees

1 23

4

clique potential separator

set

1,2 2,4,5

2,3,4

4,5,7

2

2,4

4,5

clique potential

mi→j(xj) = maxxi

f(xi, xj) +�

l �=j

ml→i(xj)

Page 196: Graphical Models for the Internet

m245→234(x24)

=maxx5

f(x245) +m12→245(x2) +m457→245(x45)

Junction Trees

1 23

4

clique potential separator

set

1,2 2,4,5

2,3,4

4,5,7

2

2,4

4,5

clique potential

mi→j(xj) = maxxi

f(xi, xj) +�

l �=j

ml→i(xj)

Page 197: Graphical Models for the Internet

m245→234(x24)

=maxx5

f(x245) +m12→245(x2) +m457→245(x45)

Junction Trees

1 23

4

clique potential separator

set

1,2 2,4,5

2,3,4

4,5,7

2

2,4

4,5

clique potential

mi→j(xj) = maxxi

f(xi, xj) +�

l �=j

ml→i(xj)

Page 198: Graphical Models for the Internet

m245→234(x24)

=maxx5

f(x245) +m12→245(x2) +m457→245(x45)

Junction Trees

1 23

4

clique potential separator

set

1,2 2,4,5

2,3,4

4,5,7

2

2,4

4,5

clique potential

mi→j(xj) = maxxi

f(xi, xj) +�

l �=j

ml→i(xj)

Page 199: Graphical Models for the Internet

m245→234(x24)

=maxx5

f(x245) +m12→245(x2) +m457→245(x45)

Junction Trees

1 23

4

clique potential separator

set

1,2 2,4,5

2,3,4

4,5,7

2

2,4

4,5

clique potential

mi→j(xj) = maxxi

f(xi, xj) +�

l �=j

ml→i(xj)

Page 200: Graphical Models for the Internet

No loops allowed

x1

x2 x3

x4

If we use it anyway --- Loopy Belief Propagation(Turbo Codes, Markov Random Fields, etc.)

s(x1, x2) + s(x2, x3) + s(x3, x4) + s(x4, x1)

Page 201: Graphical Models for the Internet

No loops allowed

x1

x2 x3

x4

If we use it anyway --- Loopy Belief Propagation(Turbo Codes, Markov Random Fields, etc.)

s(x1, x2) + s(x2, x3) + s(x3, x4) + s(x4, x1)

Page 202: Graphical Models for the Internet

No loops allowed

x1

x2 x3

x4

If we use it anyway --- Loopy Belief Propagation(Turbo Codes, Markov Random Fields, etc.)

s(x1, x2) + s(x2, x3) + s(x3, x4) + s(x4, x1)

Page 203: Graphical Models for the Internet

No loops allowed

x1

x2 x3

x4

If we use it anyway --- Loopy Belief Propagation(Turbo Codes, Markov Random Fields, etc.)

s(x1, x2) + s(x2, x3) + s(x3, x4) + s(x4, x1)

Page 204: Graphical Models for the Internet

No loops allowed

x1

x2 x3

x4

If we use it anyway --- Loopy Belief Propagation(Turbo Codes, Markov Random Fields, etc.)

s(x1, x2) + s(x2, x3) + s(x3, x4) + s(x4, x1)

Page 205: Graphical Models for the Internet

No loops allowed

x1

x2 x3

x4

If we use it anyway --- Loopy Belief Propagation(Turbo Codes, Markov Random Fields, etc.)

s(x1, x2) + s(x2, x3) + s(x3, x4) + s(x4, x1)

Page 206: Graphical Models for the Internet

No loops allowed

x1

x2 x3

x4

If we use it anyway --- Loopy Belief Propagation(Turbo Codes, Markov Random Fields, etc.)

s(x1, x2) + s(x2, x3) + s(x3, x4) + s(x4, x1)

s(x1, x2) + s(x4, x1)

s(x2, x3) + s(x3, x4)

Page 207: Graphical Models for the Internet

Clustering

Page 208: Graphical Models for the Internet

Basic Idea

y1

x1

y2

xi

y3

xi

ym

xm

Θ

...

μk,Σk

yi

xi

Θ

μj,Σj

i=1..m

yi

xi

Θ

i=1..m

α

β

j=1..k

μ1,Σ1 μj,Σj

j=1..k

...

Page 209: Graphical Models for the Internet

Basic Idea

y1

x1

y2

xi

y3

xi

ym

xm

Θ

...

μk,Σk

yi

xi

Θ

μj,Σj

i=1..m

yi

xi

Θ

i=1..m

α

β

j=1..k

μ1,Σ1 μj,Σj

j=1..k

...

Page 210: Graphical Models for the Internet

Basic Idea

y1

x1

y2

xi

y3

xi

ym

xm

Θ

...

μk,Σk

yi

xi

Θ

μj,Σj

i=1..m

yi

xi

Θ

i=1..m

α

β

j=1..k

μ1,Σ1 μj,Σj

j=1..k

...

Page 211: Graphical Models for the Internet

Basic Idea

y1

x1

y2

xi

y3

xi

ym

xm

Θ

...

μk,Σk

yi

xi

Θ

μj,Σj

i=1..m

yi

xi

Θ

i=1..m

α

β

j=1..k

μ1,Σ1 μj,Σj

j=1..k

...

p(X,Y |θ,σ, µ) =n�

i=1

p(xi|yi, σ, µ)p(yi|θ)

Page 212: Graphical Models for the Internet

What can we cluster?

Page 213: Graphical Models for the Internet

What can we cluster?

textnews users

mails

queries

urls

ads

products

eventslocations

spammers

abuse

Page 214: Graphical Models for the Internet

Mixture of Gaussians• Draw cluster label y from discrete distribution• Draw data x from Gaussian for given cluster y

• Prior for discrete distribution - Dirichlet• Prior for Gaussians - Gauss-Wishart distribution

• Problem: we don’t know y• If we knew the parameters we could get y• If we knew y we could get the parameters

Page 215: Graphical Models for the Internet

k-means• Fixed uniform variance for all Gaussians• Fixed uniform distribution over clusters• Initialize centers with random subset of points• Find most likely cluster y for x

• Find most likely center for given cluster

• Repeat until converged

yi = argmaxy

p(xi|y,σ, µ)

µy =1ny

i

{yi = y}xi

Page 216: Graphical Models for the Internet

k-means

• Pro• simple algorithm• can be implemented by MapReduce passes

• Con• no proper probabilistic representation• can get stuck easily in local minima

Page 217: Graphical Models for the Internet

k-means

initialization

partitioning

update

partitioning

Page 218: Graphical Models for the Internet

k-means

initialization

partitioning

update

partitioning

Page 219: Graphical Models for the Internet

Expectation Maximization

Page 220: Graphical Models for the Internet

Expectation Maximization

• Optimization Problem

This problem is nonconvex and difficult to solve• Key idea

If we knew p(y|x) we could estimate the remaining parameters easily and vice versa

maximizeθ,µ,σ

p(X|θ,σ, µ) = maximizeθ,µ,σ

Y

n�

i=1

p(xi|yi, σ, µ)p(yi|θ)

Page 221: Graphical Models for the Internet

100 1 2 3 4 5 6 7 8 9

10

0

1

2

3

4

5

6

7

8

9

parameters

nega

tive

Log

-lik

elih

ood

1

2

34

Nonconvex Minimization

•Find convex upper bound•Minimize it

Page 222: Graphical Models for the Internet

Expectation Maximization

Page 223: Graphical Models for the Internet

Expectation Maximization• Variational Bound

This inequality is tight for p(y|x) = q(y)

log p(x; θ) ≥ log p(x; θ)−D(q(y)�p(y|x; θ))

=�

dq(y) [log p(x; θ) + log p(y|x; θ)− log q(y)]

=�

dq(y) log p(x, y; θ)−�

dq(y) log q(y)

Page 224: Graphical Models for the Internet

Expectation Maximization• Variational Bound

This inequality is tight for p(y|x) = q(y)• Expectation step

log p(x; θ) ≥ log p(x; θ)−D(q(y)�p(y|x; θ))

=�

dq(y) [log p(x; θ) + log p(y|x; θ)− log q(y)]

=�

dq(y) log p(x, y; θ)−�

dq(y) log q(y)

q(y) = p(y|x; θ)

Page 225: Graphical Models for the Internet

Expectation Maximization• Variational Bound

This inequality is tight for p(y|x) = q(y)• Expectation step

log p(x; θ) ≥ log p(x; θ)−D(q(y)�p(y|x; θ))

=�

dq(y) [log p(x; θ) + log p(y|x; θ)− log q(y)]

=�

dq(y) log p(x, y; θ)−�

dq(y) log q(y)

q(y) = p(y|x; θ)find bound

Page 226: Graphical Models for the Internet

Expectation Maximization• Variational Bound

This inequality is tight for p(y|x) = q(y)• Expectation step

• Maximization step

log p(x; θ) ≥ log p(x; θ)−D(q(y)�p(y|x; θ))

=�

dq(y) [log p(x; θ) + log p(y|x; θ)− log q(y)]

=�

dq(y) log p(x, y; θ)−�

dq(y) log q(y)

q(y) = p(y|x; θ)

θ∗ = argmaxθ

�dq(y) log p(x, y; θ)

find bound

Page 227: Graphical Models for the Internet

Expectation Maximization• Variational Bound

This inequality is tight for p(y|x) = q(y)• Expectation step

• Maximization step

log p(x; θ) ≥ log p(x; θ)−D(q(y)�p(y|x; θ))

=�

dq(y) [log p(x; θ) + log p(y|x; θ)− log q(y)]

=�

dq(y) log p(x, y; θ)−�

dq(y) log q(y)

q(y) = p(y|x; θ)

θ∗ = argmaxθ

�dq(y) log p(x, y; θ)maximize it

find bound

Page 228: Graphical Models for the Internet

Expectation Step

qi(y) ∝ p(xi|yi, µ,σ)p(yi|θ) hence

miy :=1

(2π) d2 |Σy| 1

2exp

�−1

2(xi − µy)Σ−1

y (xi − µy)�

p(y)

qi(y) =miy�y� miy�

• Factorizing distribution

• E-Stepq(Y ) =

i

qi(y)

Page 229: Graphical Models for the Internet

Maximization Step• Log-likelihood

• Cluster distribution(weighted Gaussian MLE)

• Cluster probabilities

log p(X,Y |θ, µ,σ) =n�

i=1

log p(xi|yi, µ,σ) + log p(yi|θ)

µy =1ny

n�

i=1

qi(y)xi

Σy =1ny

n�

i=1

qi(y)xix�i − µyµ�y

ny =�

i

qi(y)

θ∗ = argmaxθ

n�

i=1

y

qi(y) log p(yi|θ) hence p(y|θ) =ny

n

Page 230: Graphical Models for the Internet

EM Clustering in action

Page 231: Graphical Models for the Internet

Problem

Estimates will diverge (infinite variance, zero probability, tiny clusters)

Page 232: Graphical Models for the Internet

Solution• Use priors for

• Dirichlet distribution for cluster probabilities• Gauss-Wishart for Gaussian

• Cluster distribution

• Cluster probabilities

µ, σ, θ

ny = n0 +�

i

qi(y)µy =

1ny

n�

i=1

qi(y)xi

Σy =1ny

n�

i=1

qi(y)xix�i +

n0

ny1− µyµ�y

p(y|θ) =ny

n + k · n0

Page 233: Graphical Models for the Internet

Variational Approximation

• Lower bound on log-likelihood

• Key insight - inequality holds for any q• Find q within subset Q to tighten inequality• Find parameters to maximize for fixed q

• Inference for graphical models where joint probability computation is infeasible

log p(x; θ) ≥�

dq(y) log p(x, y; θ)−�

dq(y) log q(y)

Page 234: Graphical Models for the Internet

Beyond mixtures

taxonomies

topics

chains

Page 235: Graphical Models for the Internet

Sampling

Page 236: Graphical Models for the Internet

1 2 3 4 5 6 7 8 9 10

0.6

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

parameter1

den

sity

Mod

e

Mod

e

Mea

n

Is maximization (always) good?

p(θ|X) ∝ p(X|θ)p(θ)

Page 237: Graphical Models for the Internet

Sampling• Key idea

• Want accurate distribution of the posterior• Sample from posterior distribution rather than

maximizing it• Problem - direct sampling is usually intractable• Solutions

• Markov Chain Monte Carlo (complicated)• Gibbs Sampling (somewhat simpler)

x ∼ p(x|x�) and then x� ∼ p(x�|x)

Page 238: Graphical Models for the Internet

Gibbs sampling• Gibbs sampling:

• In most cases direct sampling not possible• Draw one set of variables at a time

0.45 0.050.05 0.45

Page 239: Graphical Models for the Internet

Gibbs sampling• Gibbs sampling:

• In most cases direct sampling not possible• Draw one set of variables at a time

0.45 0.050.05 0.45

(b,g) - draw p(.,g)

Page 240: Graphical Models for the Internet

Gibbs sampling• Gibbs sampling:

• In most cases direct sampling not possible• Draw one set of variables at a time

0.45 0.050.05 0.45

(b,g) - draw p(.,g)(g,g) - draw p(g,.)

Page 241: Graphical Models for the Internet

Gibbs sampling• Gibbs sampling:

• In most cases direct sampling not possible• Draw one set of variables at a time

0.45 0.050.05 0.45

(b,g) - draw p(.,g)(g,g) - draw p(g,.)(g,g) - draw p(.,g)

Page 242: Graphical Models for the Internet

Gibbs sampling• Gibbs sampling:

• In most cases direct sampling not possible• Draw one set of variables at a time

0.45 0.050.05 0.45

(b,g) - draw p(.,g)(g,g) - draw p(g,.)(g,g) - draw p(.,g)(b,g) - draw p(b,.)

Page 243: Graphical Models for the Internet

Gibbs sampling• Gibbs sampling:

• In most cases direct sampling not possible• Draw one set of variables at a time

0.45 0.050.05 0.45

(b,g) - draw p(.,g)(g,g) - draw p(g,.)(g,g) - draw p(.,g)(b,g) - draw p(b,.)(b,b) ...

Page 244: Graphical Models for the Internet

Gibbs sampling for clustering

Page 245: Graphical Models for the Internet

Gibbs sampling for clustering

randominitialization

Page 246: Graphical Models for the Internet

Gibbs sampling for clustering

sample cluster labels

Page 247: Graphical Models for the Internet

Gibbs sampling for clustering

resamplecluster model

Page 248: Graphical Models for the Internet

Gibbs sampling for clustering

resamplecluster labels

Page 249: Graphical Models for the Internet

Gibbs sampling for clustering

resamplecluster model

Page 250: Graphical Models for the Internet

Gibbs sampling for clustering

resamplecluster labels

Page 251: Graphical Models for the Internet

Gibbs sampling for clustering

resamplecluster model e.g. Mahout Dirichlet Process Clustering

Page 252: Graphical Models for the Internet

Inference Algorithm ≠ Model

Page 253: Graphical Models for the Internet

Inference Algorithm ≠ ModelCorollary: EM ≠ Clustering

Page 254: Graphical Models for the Internet

Graphical Models Zoology

Page 255: Graphical Models for the Internet
Page 256: Graphical Models for the Internet

Models

InferenceMethods

Statistics

EfficientComputation

Page 257: Graphical Models for the Internet

Models

InferenceMethods

Statistics

EfficientComputation

ExponentialFamilies

ConjugatePrior

l1, l2 Priors

l1, l2 Priors

Page 258: Graphical Models for the Internet

Models

InferenceMethods

Statistics

EfficientComputation

ExponentialFamilies

ConjugatePrior

l1, l2 Priors

l1, l2 Priors

MixturesClusters

ChainsHMM

FactorModels

MatrixFactorization

MRFCRF

directedundirected

Page 259: Graphical Models for the Internet

Models

InferenceMethods

Statistics

EfficientComputation

Exact

GibbsSampling

k-means(direct)

variationalEM

ExponentialFamilies

ConjugatePrior

l1, l2 Priors

l1, l2 Priors

MixturesClusters

ChainsHMM

FactorModels

MatrixFactorization

MRFCRF

directedundirected

Page 260: Graphical Models for the Internet

Models

InferenceMethods

Statistics

EfficientComputation

DynamicProgramming

MessagePassing Convex

Optimization

Exact

GibbsSampling

k-means(direct)

variationalEM

ExponentialFamilies

ConjugatePrior

l1, l2 Priors

l1, l2 Priors

MixturesClusters

ChainsHMM

FactorModels

MatrixFactorization

MRFCRF

directedundirected

Page 261: Graphical Models for the Internet
Page 262: Graphical Models for the Internet

Debugging

SpamFiltering

Clustering

DocumentUnderstanding User

Modeling

Advertising

Exploration

Segmentation

Prediction(time series)

Classification

Annotation

NoveltyDetection

PerformanceTuning

SystemDesign

Page 263: Graphical Models for the Internet

‘Unsupervised’ Models

Θ

x x x

Density Estimation

Θ

x

Novelty Detectionforecasting

intrusion detection

webpagesnewsusersads

queriesimages

Page 264: Graphical Models for the Internet

‘Unsupervised’ Models

Θ

x x x

Density Estimation

Θ

x

Θ

y y y

Clustering

x x xw

Θ

y

xw

Novelty Detectionforecasting

intrusion detection

webpagesnewsusersads

queriesimages

Page 265: Graphical Models for the Internet

‘Unsupervised’ Models

Θ

x x x

Density Estimation

Θ

x

x x’ x’’

FactorAnalysis

Θ’Θ

Page 266: Graphical Models for the Internet

‘Supervised’ ModelsΘClassification

Regression

y y y

w

Θ

yw

x x x xspam filteringtiering

crawlingcategorizationbid estimation

tagging

Page 267: Graphical Models for the Internet

Chains

Θ

x x x

MarkovChain

Θ

x

Page 268: Graphical Models for the Internet

Chains

Θ

x x x

MarkovChain

Θ

x

Θ

y y y

HiddenMarkov ModelKalman Filter

x x xw

Θ

y

xw

Page 269: Graphical Models for the Internet

Collaborative Models

Page 270: Graphical Models for the Internet

Collaborative Models

r

CollaborativeFiltering

u m

Page 271: Graphical Models for the Internet

Collaborative Models

r

CollaborativeFiltering

u m

dWebpageRanking

r

q

Current

Page 272: Graphical Models for the Internet

Collaborative Models

r

CollaborativeFiltering

u m

d’ q’dWebpageRanking

r

q

Page 273: Graphical Models for the Internet

Collaborative Models

r

CollaborativeFiltering

u m

d’ q’dWebpageRanking

r

q

no obviousfeatures

Page 274: Graphical Models for the Internet

Collaborative Models

r

CollaborativeFiltering

u m

d’ q’dWebpageRanking

r

q

no obviousfeatures massive

feature engineering

Page 275: Graphical Models for the Internet

Collaborative Models

r

CollaborativeFiltering

u m

d’ q’dWebpageRanking

r

q

u u’personalized

no obviousfeatures massive

feature engineering

Page 276: Graphical Models for the Internet

Data Integration

uWebpageRanking

d

r

q

Page 277: Graphical Models for the Internet

Data Integration

uWebpageRanking

d

r

q

a

cDisplay

Advertising

Page 278: Graphical Models for the Internet

Data Integration

uWebpageRanking

d

r

q

a

cDisplay

Advertising

Page 279: Graphical Models for the Internet

Data Integration

uWebpageRanking

d

r

q

a

cDisplay

Advertising

nNews d

Page 280: Graphical Models for the Internet

Data Integration

uWebpageRanking

d

r

q

a

cDisplay

Advertising

nNews d

p

Answerst

Page 281: Graphical Models for the Internet

Topic Models

Page 282: Graphical Models for the Internet

Topic Models

β

αTopic

Models

z

θ

w

Ψ

Page 283: Graphical Models for the Internet

Topic Models

β

αTopic

Models

z

θ

w

Ψ

SimplicalMixtures

Page 284: Graphical Models for the Internet

Topic Models

β

αTopic

Models

z

θ

w

Ψ

SimplicalMixtures

UpstreamConditioning

DownstreamConditioning

zd

zu

Page 285: Graphical Models for the Internet

Part 5 - Scalable Topic Models

Page 286: Graphical Models for the Internet

Topic models

Page 287: Graphical Models for the Internet

Grouping objects

Page 288: Graphical Models for the Internet

Grouping objects

Singapore

Page 289: Graphical Models for the Internet

Grouping objects

Page 290: Graphical Models for the Internet

Grouping objects

Page 291: Graphical Models for the Internet

Grouping objects

airline

restaurant

university

Page 292: Graphical Models for the Internet

Grouping objects

Australia

Singapore

USA

Page 293: Graphical Models for the Internet

Topic Models

USA airline

Singaporeairline

Singapore food

USA food

Singapore university

Australia university

Page 294: Graphical Models for the Internet

Clustering & Topic ModelsClustering Topics

?

group objectsby prototypes

decompose objectsinto prototypes

Page 295: Graphical Models for the Internet

Clustering & Topic ModelsClustering Topics

?

group objectsby prototypes

decompose objectsinto prototypes

Page 296: Graphical Models for the Internet

Clustering & Topic Models

x

y

θ

prior

cluster probability

cluster label

instance x

y

θ

prior

topic probability

topic label

instance

clustering Latent Dirichlet Allocation

α α

Page 297: Graphical Models for the Internet

Clustering & Topic Models

DocumentsmembershipCluster/

topicdistributions

x =

clustering: (0, 1) matrixtopic model: stochastic matrixLSI: arbitrary matrices

Page 298: Graphical Models for the Internet

Topics in text

Latent Dirichlet Allocation; Blei, Ng, Jordan, JMLR 2003

Page 299: Graphical Models for the Internet

Collapsed Gibbs Sampler

Page 300: Graphical Models for the Internet

Joint Probability Distribution

xij

zij

θi

language prior

topic probability

topic label

instance

α

ψkβ

p(θ, z,ψ, x|α,β)

=K�

k=1

p(ψk|β)m�

i=1

p(θi|α)

m,mi�

i,j

p(zij |θi)p(xij |zij ,ψ)

Page 301: Graphical Models for the Internet

sample zindependently

sample θindependently

Joint Probability Distribution

xij

zij

θi

language prior

topic probability

topic label

instance

α

ψkβ

p(θ, z,ψ, x|α,β)

=K�

k=1

p(ψk|β)m�

i=1

p(θi|α)

m,mi�

i,j

p(zij |θi)p(xij |zij ,ψ)

sample Ψindependently

Page 302: Graphical Models for the Internet

sample zindependently

sample θindependently

Joint Probability Distribution

xij

zij

θi

language prior

topic probability

topic label

instance

α

ψkβ

p(θ, z,ψ, x|α,β)

=K�

k=1

p(ψk|β)m�

i=1

p(θi|α)

m,mi�

i,j

p(zij |θi)p(xij |zij ,ψ)

sample Ψindependently slow

Page 303: Graphical Models for the Internet

Collapsed Sampler

xij

zij

θi

language prior

topic probability

topic label

instance

α

ψkβ

p(z, x|α,β)

=m�

i=1

p(zi|α)k�

k=1

p({xij |zij = k} |β)

Page 304: Graphical Models for the Internet

sample zsequentially

Collapsed Sampler

xij

zij

θi

language prior

topic probability

topic label

instance

α

ψkβ

p(z, x|α,β)

=m�

i=1

p(zi|α)k�

k=1

p({xij |zij = k} |β)

Page 305: Graphical Models for the Internet

sample zsequentially

Collapsed Sampler

xij

zij

θi

language prior

topic probability

topic label

instance

α

ψkβ

p(z, x|α,β)

=m�

i=1

p(zi|α)k�

k=1

p({xij |zij = k} |β)fast

Page 306: Graphical Models for the Internet

Collapsed Sampler

xij

zij

θi

language prior

topic probability

topic label

instance

α

ψkβ

p(z, x|α,β)

=m�

i=1

p(zi|α)k�

k=1

p({xij |zij = k} |β)

n−ij(t, w) + βt

n−i(t) +�

t βt

n−ij(t, d) + αt

n−i(d) +�

t αt

Griffiths & Steyvers, 2005

Page 307: Graphical Models for the Internet

Collapsed Sampler

xij

zij

θi

language prior

topic probability

topic label

instance

α

ψkβ

p(z, x|α,β)

=m�

i=1

p(zi|α)k�

k=1

p({xij |zij = k} |β)fast

n−ij(t, w) + βt

n−i(t) +�

t βt

n−ij(t, d) + αt

n−i(d) +�

t αt

Griffiths & Steyvers, 2005

Page 308: Graphical Models for the Internet

Sequential Algorithm (Gibbs sampler)

• For 1000 iterations do• For each document do

• For each word in the document do• Resample topic for the word• Update local (document, topic) table• Update CPU local (word, topic) table• Update global (word, topic) table

Page 309: Graphical Models for the Internet

Sequential Algorithm (Gibbs sampler)

this kills parallelism

• For 1000 iterations do• For each document do

• For each word in the document do• Resample topic for the word• Update local (document, topic) table• Update CPU local (word, topic) table• Update global (word, topic) table

Page 310: Graphical Models for the Internet

• For 1000 iterations do• For each document do

• For each word in the document do• Resample topic for the word• Update local (document, topic) table• Update CPU local (word, topic) table

• Update global (word, topic) table

State of the artUMass Mallet, UC Irvine, Google

p(t|wij) ∝ βwαt

n(t) + β̄+ βw

n(t, d = i)n(t) + β̄

+n(t, w = wij) [n(t, d = i) + αt]

n(t) + β̄

Page 311: Graphical Models for the Internet

• For 1000 iterations do• For each document do

• For each word in the document do• Resample topic for the word• Update local (document, topic) table• Update CPU local (word, topic) table

• Update global (word, topic) table

State of the artUMass Mallet, UC Irvine, Google

p(t|wij) ∝ βwαt

n(t) + β̄+ βw

n(t, d = i)n(t) + β̄

+n(t, w = wij) [n(t, d = i) + αt]

n(t) + β̄

slow

Page 312: Graphical Models for the Internet

• For 1000 iterations do• For each document do

• For each word in the document do• Resample topic for the word• Update local (document, topic) table• Update CPU local (word, topic) table

• Update global (word, topic) table

State of the artUMass Mallet, UC Irvine, Google

p(t|wij) ∝ βwαt

n(t) + β̄+ βw

n(t, d = i)n(t) + β̄

+n(t, w = wij) [n(t, d = i) + αt]

n(t) + β̄

slow

changes rapidly

Page 313: Graphical Models for the Internet

• For 1000 iterations do• For each document do

• For each word in the document do• Resample topic for the word• Update local (document, topic) table• Update CPU local (word, topic) table

• Update global (word, topic) table

State of the artUMass Mallet, UC Irvine, Google

p(t|wij) ∝ βwαt

n(t) + β̄+ βw

n(t, d = i)n(t) + β̄

+n(t, w = wij) [n(t, d = i) + αt]

n(t) + β̄

slow

changes rapidly

moderately fast

Page 314: Graphical Models for the Internet

• For 1000 iterations do• For each document do

• For each word in the document do• Resample topic for the word• Update local (document, topic) table• Update CPU local (word, topic) table

• Update global (word, topic) table

State of the artUMass Mallet, UC Irvine, Google

p(t|wij) ∝ βwαt

n(t) + β̄+ βw

n(t, d = i)n(t) + β̄

+n(t, w = wij) [n(t, d = i) + αt]

n(t) + β̄

slow

changes rapidly

moderately fast

table out of sync

blocking

network bound

memoryinefficient

Page 315: Graphical Models for the Internet

Our Approach• For 1000 iterations do (independently per computer)

• For each thread/core do• For each document do

• For each word in the document do• Resample topic for the word• Update local (document, topic) table• Generate computer local (word, topic) message

• In parallel update local (word, topic) table• In parallel update global (word, topic) table

Page 316: Graphical Models for the Internet

Our Approach

network bound

concurrentcpu hdd net

• For 1000 iterations do (independently per computer)• For each thread/core do

• For each document do• For each word in the document do

• Resample topic for the word• Update local (document, topic) table• Generate computer local (word, topic) message

• In parallel update local (word, topic) table• In parallel update global (word, topic) table

Page 317: Graphical Models for the Internet

Our Approach

network bound

memoryinefficient

concurrentcpu hdd net

minimal view

• For 1000 iterations do (independently per computer)• For each thread/core do

• For each document do• For each word in the document do

• Resample topic for the word• Update local (document, topic) table• Generate computer local (word, topic) message

• In parallel update local (word, topic) table• In parallel update global (word, topic) table

Page 318: Graphical Models for the Internet

Our Approach

table out of sync

network bound

memoryinefficient

continuoussync

concurrentcpu hdd net

minimal view

• For 1000 iterations do (independently per computer)• For each thread/core do

• For each document do• For each word in the document do

• Resample topic for the word• Update local (document, topic) table• Generate computer local (word, topic) message

• In parallel update local (word, topic) table• In parallel update global (word, topic) table

Page 319: Graphical Models for the Internet

Our Approach

table out of sync blocking

network bound

memoryinefficient

continuoussync

barrier free

concurrentcpu hdd net

minimal view

• For 1000 iterations do (independently per computer)• For each thread/core do

• For each document do• For each word in the document do

• Resample topic for the word• Update local (document, topic) table• Generate computer local (word, topic) message

• In parallel update local (word, topic) table• In parallel update global (word, topic) table

Page 320: Graphical Models for the Internet

Architecture details

Page 321: Graphical Models for the Internet

Multicore Architecture

• Decouple multithreaded sampling and updating (almost) avoids stalling for locks in the sampler

• Joint state table• much less memory required• samplers syncronized (10 docs vs. millions delay)

• Hyperparameter update via stochastic gradient descent• No need to keep documents in memory (streaming)

tokens

topics

file

combiner

count

updater

diagnostics

&

optimization

output to

filetopics

samplersampler

samplersampler

sampler

Intel Threading Building Blocks

joint state table

Page 322: Graphical Models for the Internet

Cluster Architecture

• Distributed (key,value) storage via memcached• Background asynchronous synchronization

• single word at a time to avoid deadlocks• no need to have joint dictionary• uses disk, network, cpu simultaneously

sampler sampler sampler sampler

iceiceiceice

Page 323: Graphical Models for the Internet

Cluster Architecture

• Distributed (key,value) storage via ICE• Background asynchronous synchronization

• single word at a time to avoid deadlocks• no need to have joint dictionary• uses disk, network, cpu simultaneously

sampler sampler sampler sampler

iceiceiceice

Page 324: Graphical Models for the Internet

Making it work• Startup

• Randomly initialize topics on each node (read from disk if already assigned - hotstart)

• Sequential Monte Carlo for startup much faster• Aggregate changes on the fly

• Failover• State constantly being written to disk

(worst case we lose 1 iteration out of 1000)• Restart via standard startup routine

• Achilles heel: need to restart from checkpoint if even a single machine dies.

Page 325: Graphical Models for the Internet

Easily extensible• Better language model (topical n-grams)

can process millions of users (vs 1000s)• Conditioning on side information (upstream)

estimate topic based on authorship, source, joint user model ...

• Conditioning on dictionaries (downstream)integrate topics between different languages

• Time dependent sampler for user modelapproximate inference per episode

Page 326: Graphical Models for the Internet

Google LDA Mallet Irvine’08 Irvine’09 Yahoo LDA

Multicore no yes yes yes yes

Cluster MPI no MPI point 2 point memcached

State table dictionarysplit

separatesparse separate separate joint

sparse

Schedule synchronousexact

synchronousexact

synchronousexact

asynchronousapproximate

messages

asynchronousexact

Page 327: Graphical Models for the Internet

Speed• 1M documents per day on 1 computer

(1000 topics per doc, 1000 words per doc)• 350k documents per day per node

(context switches & memcached & stray reducers)• 8 Million docs (Pubmed)

(sampler does not burn in well - too short doc)• Irvine: 128 machines, 10 hours• Yahoo: 1 machine, 11 days • Yahoo: 20 machines, 9 hours

• 20 Million docs (Yahoo! News Articles) • Yahoo: 100 machines, 12 hours

Page 328: Graphical Models for the Internet

Scalability

0

10

20

30

40

1 10 20 50 100

200k documents/computer

Runtime (hours) Initial topics per word x10

Likelihood even improves with parallelism!-3.295 (1 node) -3.288 (10 nodes) -3.287 (20 nodes)

CPUs

Page 329: Graphical Models for the Internet

The Competition

0

12500

25000

37500

50000

Google Irvine Yahoo

05

101520

Google Irvine Yahoo

032.5

6597.5130

Google Irvine Yahoo

Dataset size (millions)

Cluster size

Throughput/h

1506.4k

50k

Page 330: Graphical Models for the Internet

Design Principles

Page 331: Graphical Models for the Internet

Variable Replication• Global shared variable

• Make local copy• Distributed (key,value) storage table for global copy• Do all bookkeeping locally (store old versions)• Sync local copies asynchronously using message passing

(no global locks are needed)• This is an approximation!

x y z x y y’ z

local copysynchronize

computer

Page 332: Graphical Models for the Internet

Asymmetric Message Passing• Large global shared state space

(essentially as large as the memory in computer)• Distribute global copy over several machines

(distributed key,value storage)

old copy current copy

global state

Page 333: Graphical Models for the Internet

Out of core storage• Very large state space

• Gibbs sampling requires us to traverse the data sequentially many times (think 1000x)

• Stream local data from disk and update coupling variable each time local data is accessed

• This is exact

x y z

tokens

topics

file

combiner

count

updater

diagnostics

&

optimization

output to

filetopics

samplersampler

samplersampler

sampler

Page 334: Graphical Models for the Internet

Part 6 - Advanced Modeling

Page 335: Graphical Models for the Internet

Advances in Representation

Page 336: Graphical Models for the Internet

• Prior over document topic vector• Usually as Dirichlet distribution• Use correlation between topics (CTM)• Hierarchical structure over topics

• Document structure• Bag of words• n-grams (Li & McCallum)• Simplical Mixture (Girolami & Kaban)

• Side information• Upstream conditioning (Mimno & McCallum)• Downstream conditioning (Petterson et al.)• Supervised LDA (Blei and McAulliffe 2007; Lacoste,

Sha and Jordan 2008; Zhu, Ahmed and Xing 2009)

x

y

θ

α

Extensions to topic models

Page 337: Graphical Models for the Internet

• Dirichlet distribution• Can only model which topics are hot• Does not model relationships between topics

• Key idea•We expect to see documents about sports and health but not about sports and politics• Uses a logistic normal distribution as a prior

• Conjugacy is no longer maintained• Inference is harder than in LDA

Blei & Lafferty 2005; Ahmed & Xing 2007

Correlated topic models

Page 338: Graphical Models for the Internet

• Dirichlet distribution• Can only model which topics are hot• Does not model relationships between topics

• Key idea•We expect to see documents about sports and health but not about sports and politics• Uses a logistic normal distribution as a prior

• Conjugacy is no longer maintained• Inference is harder than in LDA

Blei & Lafferty 2005; Ahmed & Xing 2007

Correlated topic models

Page 339: Graphical Models for the Internet

Dirichlet prior on topics

Page 340: Graphical Models for the Internet

Log-normal prior on topics

θ = eη−g(η) with η ∼ N (µ,Σ)with

Page 341: Graphical Models for the Internet

Blei and Lafferty 2005

Correlated topics

Page 342: Graphical Models for the Internet

Correlated topics

Page 343: Graphical Models for the Internet

• Model the prior as a Directed Acyclic Graph• Each document is modeled as multiple paths• To sample a word, first select a path and then

sample a word from the final topic• The topics reside on the leaves of the tree

Li and McCallum 2006

Pachinko Allocation

Page 344: Graphical Models for the Internet

Li and McCallum 2006

Pachinko Allocation

Page 345: Graphical Models for the Internet

• Topics can appear anywhere in the tree• Each document is modeled as• Single path over the tree (Blei et al., 2004)•Multiple paths over the tree (Mimno et al.,2007)

Topic Hierarchies

Page 346: Graphical Models for the Internet

Blei et al. 2004

Topic Hierarchies

Page 347: Graphical Models for the Internet

• Documents as bag of words• Exploit sequential structure• N-gram models• Capture longer phrases• Switch variables to determine segments• Dynamic programming needed

x

y

θ

α

Topical n-grams

y

x x

Girolami & Kaban, 2003; Wallach, 2006; Wang & McCallum, 2007

Page 348: Graphical Models for the Internet

Topic n-grams

Page 349: Graphical Models for the Internet

Side information• Upstream conditioning (Mimno et al., 2008)

• Document features are informative for topics• Estimate topic distribution e.g. based on authors, links,

timestamp• Downstream conditioning (Petterson et al., 2010)

• Word features are informative on topics• Estimate topic distribution for words e.g. based on dictionary,

lexical similarity, distributional similarity• Class labels (Blei and McAulliffe 2007; Lacoste, Sha and Jordan

2008; Zhu, Ahmed and Xing 2009)• Joint model of unlabeled data and labels• Joint likelihood - semisupervised learning done right!

Page 350: Graphical Models for the Internet

Downstream conditioningEuroparl corpus

without alignment

Page 351: Graphical Models for the Internet

Recommender Systems

Agarwal & Chen, 2010

Page 352: Graphical Models for the Internet

Chinese Restaurant Process

φ2φ1 φ3

Page 353: Graphical Models for the Internet

Problem• How many clusters should we pick?• How about a prior for infinitely many clusters?• Finite model

• Infinite modelAssume that the total smoother weight is constant

p(y|Y,α) = n(y) + αy

n+�

y� αy�

p(y|Y,α) = n(y)

n+�

y� αy�and p(new|Y,α) = α

n+ αand new

Page 354: Graphical Models for the Internet

Chinese Restaurant Metaphor

-­‐For  data  point  xi  

-­‐  Choose  table  j  ∝  mj        and    Sample  xi  ~  f(φj)-­‐  Choose  a  new  table    K+1  ∝  α  

-­‐  Sample  φK+1  ~  G0      and  Sample  xi  ~  f(φK+1)

Genera=ve  Process

φ2φ1 φ3

the rich get richer

Pitman; Antoniak; Ishwaran; Jordan et al.; Teh et al.;

Page 355: Graphical Models for the Internet

Evolutionary Clustering

• Time series of objects, e.g. news stories• Stories appear / disappear• Want to keep track of clusters automatically

Page 356: Graphical Models for the Internet

Recurrent Chinese Restaurant Process

φ2,1φ1,1 φ3,1

T=1

T=2m'1,1=2 m'2,1=3 m'3,1=1

Page 357: Graphical Models for the Internet

Recurrent Chinese Restaurant Process

φ2,1φ1,1 φ3,1

T=1

T=2m'1,1=2 m'2,1=3 m'3,1=1

Page 358: Graphical Models for the Internet

φ2,1φ1,1 φ3,1

Recurrent Chinese Restaurant Process

φ2,1φ1,1 φ3,1

T=1

T=2m'1,1=2 m'2,1=3 m'3,1=1

Page 359: Graphical Models for the Internet

φ2,1φ1,1 φ3,1

Recurrent Chinese Restaurant Process

φ2,1φ1,1 φ3,1

T=1

T=2m'1,1=2 m'2,1=3 m'3,1=1

Page 360: Graphical Models for the Internet

φ2,1φ1,1 φ3,1

Recurrent Chinese Restaurant Process

φ2,1φ1,1 φ3,1

T=1

T=2m'1,1=2 m'2,1=3 m'3,1=1

Sample  φ1,2  ~  P(.| φ1,1)  

Page 361: Graphical Models for the Internet

φ2,1φ1,1 φ3,1

Recurrent Chinese Restaurant Process

φ2,1φ1,1 φ3,1

T=1

T=2m'1,1=2 m'2,1=3 m'3,1=1

Page 362: Graphical Models for the Internet

Recurrent Chinese Restaurant Process

φ2,1φ1,1 φ3,1

T=1

T=2

φ4,2

m'1,1=2 m'2,1=3 m'3,1=1

φ2,2φ1,2 φ3,1

dead cluster new cluster

Page 363: Graphical Models for the Internet

Longer History

φ2,1φ1,1 φ3,1

T=1

T=2

φ2,2φ1,2 φ3,1

m'1,1=2 m'2,1=3 m'3,1=1

φ4,2

T=3φ2,2φ1,2 φ4,2

m'2,3

Page 364: Graphical Models for the Internet

TDPM Generative PowerW= ∞λ = ∞

DPM

W=4λ = .4

TDPM

W= 0λ = ? (any)

Independent DPMs

37

power law

Page 365: Graphical Models for the Internet

User modeling

0 10 20 30 400

0.1

0.2

0.3

Prop

otio

n

Day

Baseball

Finance

Jobs

Dating

Page 366: Graphical Models for the Internet

Buying a camera

time

Page 367: Graphical Models for the Internet

Buying a camera

time

Page 368: Graphical Models for the Internet

Buying a camera

show ads nowtime

Page 369: Graphical Models for the Internet

Buying a camera

show ads now too latetime

Page 370: Graphical Models for the Internet

CarDealsvan

jobHiringdiet

HiringSalaryDietcalories

AutoPriceUsedinspec=on

FlightLondonHotelweather

DietCaloriesRecipechocolate

MoviesTheatreArtgallery

SchoolSuppliesLoancollege

Page 371: Graphical Models for the Internet

CarDealsvan

jobHiringdiet

HiringSalaryDietcalories

AutoPriceUsedinspec=on

FlightLondonHotelweather

DietCaloriesRecipechocolate

MoviesTheatreArtgallery

SchoolSuppliesLoancollege

Page 372: Graphical Models for the Internet

CarDealsvan

jobHiringdiet

HiringSalaryDietcalories

AutoPriceUsedinspec=on

FlightLondonHotelweather

DietCaloriesRecipechocolate

MoviesTheatreArtgallery

SchoolSuppliesLoancollege

CARS Art

DietJobs

Travel

Collegefinance

Page 373: Graphical Models for the Internet

FlightLondonHotelweather

SchoolSuppliesLoancollege

Travel

Collegefinance

Input

• Queries  issued  by  the  user  or  Tags  of  watched  content

• Snippet  of  page  examined  by  user

• Time  stamp  of  each  acEon  (day  resoluEon)

Output

•    Users’  daily  distribuEon  over  intents•    Dynamic  intent  representaEon

User modeling

Page 374: Graphical Models for the Internet

Time dependent models

• LDA for topical model of users where• User interest distribution changes over time• Topics change over time

• This is like a Kalman filter except that• Don’t know what to track (a priori)• Can’t afford a Rauch-Tung-Striebel smoother• Much more messy than plain LDA

Page 375: Graphical Models for the Internet

Graphical Model

wij

zij

θi

α

φk

αt

β

θti

zij

wij

φtk

βt

αt−1 αt+1

θt−1i

θt+1i

φt−1k

βt−1

φt+1k

βt+1

plainLDA

time dependentuser interest

user actions

actions per topic

Page 376: Graphical Models for the Internet

job  CareerBusinessAssistantHiring

Part-­‐=meRecep=onist

CarBlueBookKelleyPricesSmallSpeedlarge

BankOnlineCreditCarddebt  

porZolioFinanceChase

RecipeChocolate

PizzaFood

ChickenMilkBu\erPowder

Time                t                                  t+1                

FoodChickenpizza

recipejobhiring

Part-­‐=meOpeningsalary

foodchickenPizzamillage

Kellyrecipecuisine

Diet Cars Job Finance

Prior  for  user  ac=ons  at  =me  t

Page 377: Graphical Models for the Internet

All

job  CareerBusinessAssistantHiring

Part-­‐=meRecep=onist

CarBlueBookKelleyPricesSmallSpeedlarge

BankOnlineCreditCarddebt  

porZolioFinanceChase

RecipeChocolate

PizzaFood

ChickenMilkBu\erPowder

Time                t                                  t+1                

FoodChickenpizza

recipejobhiring

Part-­‐=meOpeningsalary

foodchickenPizzamillage

Kellyrecipecuisine

Diet Cars Job Finance

Prior  for  user  ac=ons  at  =me  t

Long-­‐term

Page 378: Graphical Models for the Internet

All

job  CareerBusinessAssistantHiring

Part-­‐=meRecep=onist

CarBlueBookKelleyPricesSmallSpeedlarge

BankOnlineCreditCarddebt  

porZolioFinanceChase

RecipeChocolate

PizzaFood

ChickenMilkBu\erPowder

month

Time                t                                  t+1                

FoodChickenpizza

recipejobhiring

Part-­‐=meOpeningsalary

foodchickenPizzamillage

Kellyrecipecuisine

Diet Cars Job Finance

Prior  for  user  ac=ons  at  =me  t

Long-­‐term

Page 379: Graphical Models for the Internet

All

week

job  CareerBusinessAssistantHiring

Part-­‐=meRecep=onist

CarBlueBookKelleyPricesSmallSpeedlarge

BankOnlineCreditCarddebt  

porZolioFinanceChase

RecipeChocolate

PizzaFood

ChickenMilkBu\erPowder

month

Time                t                                  t+1                

FoodChickenpizza

recipejobhiring

Part-­‐=meOpeningsalary

foodchickenPizzamillage

Kellyrecipecuisine

Diet Cars Job Finance

Prior  for  user  ac=ons  at  =me  t

Long-­‐termshort-­‐term

Page 380: Graphical Models for the Internet

All

week

job  CareerBusinessAssistantHiring

Part-­‐=meRecep=onist

CarBlueBookKelleyPricesSmallSpeedlarge

BankOnlineCreditCarddebt  

porZolioFinanceChase

RecipeChocolate

PizzaFood

ChickenMilkBu\erPowder

month

Time                t                                  t+1                

FoodChickenpizza

recipejobhiring

Part-­‐=meOpeningsalary

foodchickenPizzamillage

Kellyrecipecuisine

Diet Cars Job Finance

Prior  for  user  ac=ons  at  =me  t

Long-­‐termshort-­‐term

Page 381: Graphical Models for the Internet

All

week

job  CareerBusinessAssistantHiring

Part-­‐=meRecep=onist

CarBlueBookKelleyPricesSmallSpeedlarge

BankOnlineCreditCarddebt  

porZolioFinanceChase

RecipeChocolate

PizzaFood

ChickenMilkBu\erPowder

month

Time                t                                  t+1                

FoodChickenpizza

recipejobhiring

Part-­‐=meOpeningsalary

foodchickenPizzamillage

Kellyrecipecuisine

Diet Cars Job Finance

Prior  for  user  ac=ons  at  =me  t

μ

μ2

μ3

Long-­‐termshort-­‐term

Page 382: Graphical Models for the Internet

Food  ChickenPizza    mileage

Car  speed  offerCamry  accord  career

At  )me  t At  )me  t+1job  

CareerBusinessAssistantHiring

Part-­‐EmeRecepEoni

st

CarAlEmaAccordBlueBookKelleyPricesSmallSpeed

BankOnlineCreditCarddebt  

porYolioFinanceChase

RecipeChocolate

PizzaFood

ChickenMilkBu\erPowder

Page 383: Graphical Models for the Internet

Food  ChickenPizza    mileage

Car  speed  offerCamry  accord  career

At  )me  t At  )me  t+1job  

CareerBusinessAssistantHiring

Part-­‐EmeRecepEoni

st

CarAlEmaAccordBlueBookKelleyPricesSmallSpeed

BankOnlineCreditCarddebt  

porYolioFinanceChase

RecipeChocolate

PizzaFood

ChickenMilkBu\erPowder

Page 384: Graphical Models for the Internet

Food  ChickenPizza    mileage

Car  speed  offerCamry  accord  career

At  )me  t At  )me  t+1

short-­‐termpriors

job  CareerBusinessAssistantHiring

Part-­‐EmeRecepEoni

st

CarAlEmaAccordBlueBookKelleyPricesSmallSpeed

BankOnlineCreditCarddebt  

porYolioFinanceChase

RecipeChocolate

PizzaFood

ChickenMilkBu\erPowder

Page 385: Graphical Models for the Internet

Food  ChickenPizza    mileage

Car  speed  offerCamry  accord  career

At  )me  t At  )me  t+1

•  For  each  user  interac=on•  Choose  an  intent  from  local  distribu=on• Sample  word  from  the  topic’s  word-­‐distribu=on  

•Choose  a  new  intent    ∝  α  • Sample  a  new  intent  from  the  global  distribu=on•  Sample  word  from  the  new  topic  word-­‐distribu=on  

Genera=ve  Process

short-­‐termpriors

job  CareerBusinessAssistantHiring

Part-­‐EmeRecepEoni

st

CarAlEmaAccordBlueBookKelleyPricesSmallSpeed

BankOnlineCreditCarddebt  

porYolioFinanceChase

RecipeChocolate

PizzaFood

ChickenMilkBu\erPowder

Page 386: Graphical Models for the Internet

At  )me  t At  )me  t+1 At  )me  t+2 At  )me  t+3

User  1process

User  2process

User  3process

Globalprocess

mm'

nn'

Page 387: Graphical Models for the Internet

Sample users

0 10 20 30 400

0.1

0.2

0.3

Prop

otio

nDay

Baseball

Finance

Jobs

Dating

0 10 20 30 400

0.1

0.2

0.3

0.4

0.5

Prop

otio

n

Day

Baseball

Dating

Celebrity

Health

SnookiTom  CruiseKatie

Holmes  PinkettKudrow

Hollywood

League  baseball  basketball,  doubleheadBergesenGriffeybullpen  Greinke

skinbody  fingers  cells  toes  

wrinkle  layers

women  mendating  singles  

personals  seeking  match

Dating Baseball Celebrity Health

job  careerbusinessassistanthiring

part-­‐timereceptionist

financial  Thomson  chart  real  StockTradingcurrency

Jobs Finance

Page 388: Graphical Models for the Internet

Sample users

0 10 20 30 400

0.1

0.2

0.3

Prop

otio

nDay

Baseball

Finance

Jobs

Dating

0 10 20 30 400

0.1

0.2

0.3

0.4

0.5

Prop

otio

n

Day

Baseball

Dating

Celebrity

Health

SnookiTom  CruiseKatie

Holmes  PinkettKudrow

Hollywood

League  baseball  basketball,  doubleheadBergesenGriffeybullpen  Greinke

skinbody  fingers  cells  toes  

wrinkle  layers

women  mendating  singles  

personals  seeking  match

Dating Baseball Celebrity Health

job  careerbusinessassistanthiring

part-­‐timereceptionist

financial  Thomson  chart  real  StockTradingcurrency

Jobs Finance

Page 389: Graphical Models for the Internet

DatasetsData

Page 390: Graphical Models for the Internet

ROC score improvement

Page 391: Graphical Models for the Internet

ROC score improvement

50

52

54

56

58

60

62

Dataset−2

>100

0

[1000

,600]

[600,4

00]

[400,2

00]

[200,1

00]

[100,6

0]

[60,40

]

[40,20

]<2

0

baselineTLDATLDA+Baseline

Page 392: Graphical Models for the Internet

LDA for user profilingSample  ZFor  users

Sample  ZFor  users

Sample  ZFor  users

Sample  ZFor  users

Barrier

Write  counts  to  

memcached

Write  counts  to  

memcached

Write  counts  to  

memcached

Write  counts  to  

memcached

Collect  counts  and  sample  

Do  nothing Do  nothing Do  nothing

Barrier

Read   from  memcached

Read   from  memcached

Read   from  memcached

Read   from  memcached

Page 393: Graphical Models for the Internet

News

Page 394: Graphical Models for the Internet

News Stream

Page 395: Graphical Models for the Internet

News Stream

Page 396: Graphical Models for the Internet

News Stream• Over 1 high quality news article per second • Multiple sources (Reuters, AP, CNN, ...)• Same story from multiple sources• Stories are related

• Goals• Aggregate articles into a storyline• Analyze the storyline (topics, entities)

Page 397: Graphical Models for the Internet

Clustering / RCRP• Assume active story

distribution at time t• Draw story indicator• Draw words from story

distribution • Down-weight story counts for

next day

Ahmed & Xing, 2008

Page 398: Graphical Models for the Internet

Clustering / RCRP• Pro

• Nonparametric model of story generation(no need to model frequency of stories)

• No fixed number of stories• Efficient inference via collapsed sampler

• Con• We learn nothing!• No content analysis

Page 399: Graphical Models for the Internet

Latent Dirichlet Allocation

• Generate topic distribution per article

• Draw topics per word from topic distribution

• Draw words from topic specific word distribution

Blei, Ng, Jordan, 2003

Page 400: Graphical Models for the Internet

Latent Dirichlet Allocation

• Pro• Topical analysis of stories• Topical analysis of words (meaning, saliency)• More documents improve estimates

• Con• No clustering

Page 401: Graphical Models for the Internet

• Named entities are special, topics less(e.g. Tiger Woods and his mistresses)

• Some stories are strange(topical mixture is not enough - dirty models)

• Articles deviate from general story(Hierarchical DP)

More Issues

Page 402: Graphical Models for the Internet

• Named entities are special, topics less(e.g. Tiger Woods and his mistresses)

• Some stories are strange(topical mixture is not enough - dirty models)

• Articles deviate from general story(Hierarchical DP)

More Issues

Page 403: Graphical Models for the Internet

Storylines

Page 404: Graphical Models for the Internet

Storylines Model• Topic model• Topics per cluster• RCRP for cluster• Hierarchical DP for

article• Separate model for

named entities• Story specific

correction

Page 405: Graphical Models for the Internet

Storylines Model• Topic model• Topics per cluster• RCRP for cluster• Hierarchical DP for

article• Separate model for

named entities• Story specific

correction

Page 406: Graphical Models for the Internet

Storylines Model• Topic model• Topics per cluster• RCRP for cluster• Hierarchical DP for

article• Separate model for

named entities• Story specific

correction

Page 407: Graphical Models for the Internet

Storylines Model• Topic model• Topics per cluster• RCRP for cluster• Hierarchical DP for

article• Separate model for

named entities• Story specific

correction

Page 408: Graphical Models for the Internet

Storylines Model• Topic model• Topics per cluster• RCRP for cluster• Hierarchical DP for

article• Separate model for

named entities• Story specific

correction

Page 409: Graphical Models for the Internet

Storylines Model• Topic model• Topics per cluster• RCRP for cluster• Hierarchical DP for

article• Separate model for

named entities• Story specific

correction

Page 410: Graphical Models for the Internet

Storylines Model• Topic model• Topics per cluster• RCRP for cluster• Hierarchical DP for

article• Separate model for

named entities• Story specific

correction

Page 411: Graphical Models for the Internet

Dynamic Cluster-Topic Hybrid

UEFA-soccer

!"#$%&'()*+'#,*!'#-"*./0&120*3&452,4*%2(#,/6*

7892(/8)**:!*3&,#(**;#<&'*='(#,4'*;6'(****

Tax-Bill

>#?*@&,,&'(*!8/*A,#(*@84B2/*C-'('$6*

@8)"*.2(#/2*D,2&)-"20*E"&/2*F'8)2*=2%8G,&-#(*

Border-Tension

H8-,2#0*@'0420*I&#,'B82*I&%,'$#J-*$&,&/#(/*K()80B2(-6*$&))&,2*

A#1&)/#(*K(4&#*L#)"$&0*H2M*I2,"&*K),#$#G#4*38)"#00#N*O#P%#622*

SportsgamesWonTeamFinalSeasonLeagueheld

Poli)csGovernmentMinister

AuthoriEesOpposiEonOfficialsLeadersgroup

AccidentsPoliceA\ackrunmangrouparrestedmove

!!

Page 412: Graphical Models for the Internet

Dynamic Cluster-Topic Hybrid

UEFA-soccer

!"#$%&'()*+'#,*!'#-"*./0&120*3&452,4*%2(#,/6*

7892(/8)**:!*3&,#(**;#<&'*='(#,4'*;6'(****

Tax-Bill

>#?*@&,,&'(*!8/*A,#(*@84B2/*C-'('$6*

@8)"*.2(#/2*D,2&)-"20*E"&/2*F'8)2*=2%8G,&-#(*

Border-Tension

H8-,2#0*@'0420*I&#,'B82*I&%,'$#J-*$&,&/#(/*K()80B2(-6*$&))&,2*

A#1&)/#(*K(4&#*L#)"$&0*H2M*I2,"&*K),#$#G#4*38)"#00#N*O#P%#622*

SportsgamesWonTeamFinalSeasonLeagueheld

Poli)csGovernmentMinister

AuthoriEesOpposiEonOfficialsLeadersgroup

AccidentsPoliceA\ackrunmangrouparrestedmove

!!

Page 413: Graphical Models for the Internet

Dynamic Cluster-Topic Hybrid

UEFA-soccer

!"#$%&'()*+'#,*!'#-"*./0&120*3&452,4*%2(#,/6*

7892(/8)**:!*3&,#(**;#<&'*='(#,4'*;6'(****

Tax-Bill

>#?*@&,,&'(*!8/*A,#(*@84B2/*C-'('$6*

@8)"*.2(#/2*D,2&)-"20*E"&/2*F'8)2*=2%8G,&-#(*

Border-Tension

H8-,2#0*@'0420*I&#,'B82*I&%,'$#J-*$&,&/#(/*K()80B2(-6*$&))&,2*

A#1&)/#(*K(4&#*L#)"$&0*H2M*I2,"&*K),#$#G#4*38)"#00#N*O#P%#622*

SportsgamesWonTeamFinalSeasonLeagueheld

Poli)csGovernmentMinister

AuthoriEesOpposiEonOfficialsLeadersgroup

AccidentsPoliceA\ackrunmangrouparrestedmove

!!

Page 414: Graphical Models for the Internet

Inference

• We receive articles as a streamWant topics & stories now

• Variational inference infeasible(RCRP, sparse to dense, vocabulary size)

• We have a ‘tracking problem’• Sequential Monte Carlo• Use sampled variables of surviving particle• Use ideas from Cannini et al. 2009

Page 415: Graphical Models for the Internet

• Proposal distribution - draw stories s, topics z

using Gibbs Sampling for each particle• Reweight particle via

• Resample particles if l2 norm too large(resample some assignments for diversity, too)

• Compare to multiplicative updates algorithmIn our case predictive likelihood yields weights

Particle Filter

p(xt+1|x1...t, s1...t, z1...t)

p(st+1, zt+1|x1...t+1, s1...t, z1...t)

past state

new data

Page 416: Graphical Models for the Internet

Particle Filter

•s  and  z  are  =ghtly  coupled•Alterna=ve  to  MCMC

•Sample  s  then  sample  z  (high  variance)•Sample  z  then  sample  s  (doesn’t  make  sense)

•Idea  (following  a  similar  trick  by  Jain  and  Neal)•Run  a  few  itera=ons  of  MCMC  over  s  and  z•Take  last  sample  as  the  proposed  value

Page 417: Graphical Models for the Internet

Particle Filter

Page 418: Graphical Models for the Internet

Particle Filter

Page 419: Graphical Models for the Internet

Particle Filter

Page 420: Graphical Models for the Internet

Inheritance Tree!"#$%&'$(&%)*+'!"#$%&',)&-.#%+'

!""#$

%

&'()*+$%$",-.'/*+$0$/)'&1)+$2$

3

0

4)(5#67$ /)'&1)+$8$

(.9.*#):+$%$

&'()*+$;$*)'*"9+$3$

/0"-)#'$&%%'1&%)*2'34&'$(&%)*+5'

!""#$

%

&'()*+$%$",-.'/*+$0$/)'&1)+$2$

3

0

4)(5#67$ /)'&1)+$8$&'()*+$0$

(.9.*#):+$<$

&'()*+$;$*)'*"9+$3$

0 = get(1,’games’) set(2,’games’,3)

set(3,’minister’,7)

6%+)7,#"08''(")&*',)&-.#%+'

!""#$

&'()*+$%$",-.'/*+$0$/)'&1)+$2$

3=%$

0

4)(5#67$ /)'&1)+$8$&'()*+$0$

(.9.*#):+$<$

&'()*+$;$*)'*"9+$3$

copy(2,1)

+,!-&.909+%*':&)0.(%+'

!""#$

&'()*+$%$",-.'/*+$0$/)'&1)+$2$

3=%$

0

4)(5#67$ /)'&1)+$8$&'()*+$0$

(.9.*#):+$<$

&'()*+$;$*)'*"9+$3$

/(00$"*&.#408':&)0.(%+'

!""#$

&'()*+$%$",-.'/*+$0$/)'&1)+$2$

3=%$

0

/)'&1)+$8$&'()*+$0$

(.9.*#):+$<$

3=%$&'()*+$0$*)'*"9+$3$/)'&1)+$8$

maintain_prune() maintain_collapse()

;&%)$%'-&1.0&$2&*'

!""#$

&'()*+$%$",-.'/*+$0$/)'&1)+$2$

0

(.9.*#):+$<$

&'()*+$0$*)'*"9+$3$/)'&1)+$8$

branch(1) branch(2)

% 3

4)(5#67$ 4)(5#67$

<%='"0"-)#'$&%%'1&%)*2'34&'$(&%)*+5'

!""#$

&'()*+$%$",-.'/*+$0$/)'&1)+$2$

0

(.9.*#):+$<$

&'()*+$0$*)'*"9+$3$/)'&1)+$8$

% 3

4)(5#67$ 4)(5#67$

Page 421: Graphical Models for the Internet

Inheritance Tree!"#$%&'$(&%)*+'!"#$%&',)&-.#%+'

!""#$

%

&'()*+$%$",-.'/*+$0$/)'&1)+$2$

3

0

4)(5#67$ /)'&1)+$8$

(.9.*#):+$%$

&'()*+$;$*)'*"9+$3$

/0"-)#'$&%%'1&%)*2'34&'$(&%)*+5'

!""#$

%

&'()*+$%$",-.'/*+$0$/)'&1)+$2$

3

0

4)(5#67$ /)'&1)+$8$&'()*+$0$

(.9.*#):+$<$

&'()*+$;$*)'*"9+$3$

0 = get(1,’games’) set(2,’games’,3)

set(3,’minister’,7)

6%+)7,#"08''(")&*',)&-.#%+'

!""#$

&'()*+$%$",-.'/*+$0$/)'&1)+$2$

3=%$

0

4)(5#67$ /)'&1)+$8$&'()*+$0$

(.9.*#):+$<$

&'()*+$;$*)'*"9+$3$

copy(2,1)

+,!-&.909+%*':&)0.(%+'

!""#$

&'()*+$%$",-.'/*+$0$/)'&1)+$2$

3=%$

0

4)(5#67$ /)'&1)+$8$&'()*+$0$

(.9.*#):+$<$

&'()*+$;$*)'*"9+$3$

/(00$"*&.#408':&)0.(%+'

!""#$

&'()*+$%$",-.'/*+$0$/)'&1)+$2$

3=%$

0

/)'&1)+$8$&'()*+$0$

(.9.*#):+$<$

3=%$&'()*+$0$*)'*"9+$3$/)'&1)+$8$

maintain_prune() maintain_collapse()

;&%)$%'-&1.0&$2&*'

!""#$

&'()*+$%$",-.'/*+$0$/)'&1)+$2$

0

(.9.*#):+$<$

&'()*+$0$*)'*"9+$3$/)'&1)+$8$

branch(1) branch(2)

% 3

4)(5#67$ 4)(5#67$

<%='"0"-)#'$&%%'1&%)*2'34&'$(&%)*+5'

!""#$

&'()*+$%$",-.'/*+$0$/)'&1)+$2$

0

(.9.*#):+$<$

&'()*+$0$*)'*"9+$3$/)'&1)+$8$

% 3

4)(5#67$ 4)(5#67$

not th

read sa

fe

Page 422: Graphical Models for the Internet

Extended Inheritance Tree

Root

1

India:  [(I-­‐P  tension,3),(Tax  bills,1)]Pakistan:  [(I-­‐P  tension,2),(Tax  bills,1)]Congress:  [(I-­‐P  tension,1),(Tax  bills,1)]

2

3

(empty) Congress:  [(I-­‐P  tension,0),(Tax  bills,2)]

Bush:  [(I-­‐P  tension,1),(Tax  bills,2)]India:  [(Tax  bills,0)]

India:  [(I-­‐P  tension,2)]US:  [(I-­‐P  tension,1),[Tax  bills,1)]

Extended Inheritance  Tree

[(I-­P  tension,2),(Tax  bills,1)]  =  get_list

set_entry

-­‐ India-­‐

write only in the leaves

(per thread)

Page 423: Graphical Models for the Internet

Results

Page 424: Graphical Models for the Internet

Ablation studies• TDT5 (Topic Detection and Tracking)

macro-averaged minimum detection cost: 0.714

• Removing features

time entities topics story words

0.84 0.90 0.86 0.75

Page 425: Graphical Models for the Internet

Comparison

Hashing & correlation clustering

Page 426: Graphical Models for the Internet

Time-Accuracy trade off

Page 427: Graphical Models for the Internet

Stories������

���������������

���������� �����

�����

����������������

� ��������������������������������� �

��� ��

����������� ������� �

�����������

�������������� ����

������������������ �������������������������������

����������������� ���������������������� ����������

������� �

���� �������������������������� ������

���������������� ������ ������!�"�# #���#!�#��$$

��������

������������ ����������������������

%���&�����'&�#�( ���)����*�� +�����,#��� ������*��

��

���

����

���

��

Page 428: Graphical Models for the Internet

���������������� �

�������

������������������� ������ �� ���������������

����������������� ���������������������� ����������

����������� �����

���������������������������� ����� �������

�� ���������������������������� ������� ����

� ���� �����������

�������

���� �������������������������

�� ���� ��������� ��!"�#�������$���$

������������� ��������� �����

������������� ��������������� ���

�������������

Related Stories

Page 429: Graphical Models for the Internet

Detecting Ideologies

Ahmed and Xing, 2010

Page 430: Graphical Models for the Internet

Problem  Statement

Build  a  model  to  describe  both  collec=ons  of  data

VisualizaEon

•  How  does  each  ideology  view  mainstream  events?•  On  which  topics  do  they  differ?•  On  which  topics  do  they  agree?

Ideologies

Page 431: Graphical Models for the Internet

Problem  Statement

Build  a  model  to  describe  both  collec=ons  of  data

VisualizaEon

Ideologies

ClassificaEon

•Given  a  new  news  arEcle    or  a  blog  post,  the  system  should  infer•  From  which  side  it  was  wri\en•    JusEfy  its  answer  on  a  topical  level  (view  on  abor=on,  taxes,  health  care)

Page 432: Graphical Models for the Internet

Problem  Statement

Build  a  model  to  describe  both  collec=ons  of  data

VisualizaEon

Ideologies

ClassificaEon

Structured  browsing

•Given  a  new  news  arEcle    or  a  blog  post,  the  user  can  ask  for  :•Examples  of  other  arEcles  from  the  same  ideology  about  the  same  topic•Documents  that  could  exemplify  alterna)ve  views  from  other  ideologies

Page 433: Graphical Models for the Internet

Ω1 Ω2

β1

β1

βk-­‐1

βk

φ1,1

φ1,2

φ1,k

φ2,1

φ2,2

φ2,k

Ideology  1Views

Ideology  2Views

Topics

Building a factored model

Page 434: Graphical Models for the Internet

Building a factored model

Page 435: Graphical Models for the Internet

Ω1 Ω2

β1

β2

βk-­‐1

βk

φ1,1

φ1,2

φ1,k

φ2,1

φ2,2

φ2,k

Ideology  1Views

Ideology  2Views

Topics

Building a factored model

Page 436: Graphical Models for the Internet

Ω1 Ω2

β1

β2

βk-­‐1

βk

φ1,1

φ1,2

φ1,k

φ2,1

φ2,2

φ2,k

Ideology  1Views

Ideology  2Views

Topics

λ

1−λ

λ

1−λ

Building a factored model

Page 437: Graphical Models for the Internet

Datasets

• BiCerlemons:  

• Middle-­‐east  conflict,  document  wri\en  by  Israeli  and  PalesEnian  authors.

•  ~300  documents  form  each  view  with  average  length  740

•  MulE  author  collecEon

•  80-­‐20  split  for  test  and  train

• Poli)cal  Blog-­‐1:

•  American  poliEcal  blogs  (Democrat  and  Republican)

•  2040  posts  with  average  post  length  =  100  words•  Follow  test  and  train  split  as  in  (Yano  et  al.,  2009)

• Poli)cal  Blog-­‐2    (test  generalizaEon  to  a  new  wriEng  style)

•  Same  as  1  but  6  blogs,  3  from  each  side

•    ~14k  posts  with  ~200  words  per  post•  4  blogs  for  training  and  2  blogs  for  test

Data

Page 438: Graphical Models for the Internet

Example:  Bi\erlemons  corpus

palestinian israelipeace

year political process

state end right

government need conflict

waysecurity

palestinian israeliPeacepolitical

occupation process

end security conflict

way government

people time year

force negotiation

bush US president american sharon administration prime pressure policy washington

powell minister colin visit internal policy statement

express pro previous package work transfer

european

arafat state leader roadmap election month iraq yasir

senior involvement clinton terrorism

US    role

PalesEnianView

IsraeliView

roadmap phase security ceasefire state plan

international step authority

end settlement implementation obligation

stop expansion commitment fulfill unit illegal present

previous assassination meet forward

process force terrorism unit provide confidence element

interim discussion union succee point build positive

recognize present timetable

Roadmap  process

syria syrian negotiate lebanon deal conference concession

asad agreement regional october initiative relationship

track negotiation official leadership position

withdrawal time victory present second stand

circumstance represent sense talk strategy issue

participant parti negotiator

peace strategic plo hizballah islamic neighbor territorial radical iran relation think obviou countri mandate

greater conventional intifada affect jihad time

Arab  Involvement

Bitterlemons dataset

Page 439: Graphical Models for the Internet

ClassificaEonClassification accuracy

Page 440: Graphical Models for the Internet

GeneralizaEon  to  New  BlogsGeneralization to new blogs

Page 441: Graphical Models for the Internet

Geong  AlternaEve  View

-­‐ Given  a  document  wri\en  in  one  ideology,  retrieve  the  equivalent-­‐ Baseline:  SVM  +  cosine  similarity

144

Finding alternate views

Page 442: Graphical Models for the Internet

Can  We  use  Unlabeled  data?Unlabeled data

Page 443: Graphical Models for the Internet

Can  We  use  Unlabeled  data?

•  In  theory  this  is  simple•Add  a  step  that  samples  the  document  view  (v)•Doesn’t  mix  in  pracEce  because  Eght  coupling  between  v  and  (x1,x2,z)

Unlabeled data

Page 444: Graphical Models for the Internet

Can  We  use  Unlabeled  data?

•  In  theory  this  is  simple•Add  a  step  that  samples  the  document  view  (v)•Doesn’t  mix  in  pracEce  because  Eght  coupling  between  v  and  (x1,x2,z)

•SoluEon

Unlabeled data

Page 445: Graphical Models for the Internet

Can  We  use  Unlabeled  data?

•  In  theory  this  is  simple•Add  a  step  that  samples  the  document  view  (v)•Doesn’t  mix  in  pracEce  because  Eght  coupling  between  v  and  (x1,x2,z)

•SoluEon•Sample    v  and  (x1,x2,z)    as  a  block    using  a  Metropolis-­‐HasEng  step

Unlabeled data

Page 446: Graphical Models for the Internet

Can  We  use  Unlabeled  data?

•  In  theory  this  is  simple•Add  a  step  that  samples  the  document  view  (v)•Doesn’t  mix  in  pracEce  because  Eght  coupling  between  v  and  (x1,x2,z)

•SoluEon•Sample    v  and  (x1,x2,z)    as  a  block    using  a  Metropolis-­‐HasEng  step•  This  is  a  huge  proposal!

Unlabeled data

Page 447: Graphical Models for the Internet

Part 7 - Undirected Graphical Models

Page 448: Graphical Models for the Internet

Review

Page 449: Graphical Models for the Internet

Debugging

SpamFiltering

Clustering

DocumentUnderstanding User

Modeling

Advertising

Exploration

Segmentation

Prediction(time series)

Classification

Annotation

NoveltyDetection

PerformanceTuning

SystemDesign

Page 450: Graphical Models for the Internet

Data Integration

uWebpageRanking

d

r

q

a

cDisplay

Advertising

nNews d

p

Answerst

Page 451: Graphical Models for the Internet

• Forward/Backward messages as normal for chain• When we have more edges for a vertex use

• For each outgoing message, send it once you have all other incoming messages

• PRINCIPLED HACKIf no message received yet, set it to 1 altogether

x0 x1 x2

x3 x4 x5

x6 x7 x8

Page 452: Graphical Models for the Internet

• Forward/Backward messages as normal for chain• When we have more edges for a vertex use

• For each outgoing message, send it once you have all other incoming messages

• PRINCIPLED HACKIf no message received yet, set it to 1 altogether

x0 x1 x2

x3 x4 x5

x6 x7 x8

Page 453: Graphical Models for the Internet

• Forward/Backward messages as normal for chain• When we have more edges for a vertex use

• For each outgoing message, send it once you have all other incoming messages

• PRINCIPLED HACKIf no message received yet, set it to 1 altogether

x0 x1 x2

x3 x4 x5

x6 x7 x8

Page 454: Graphical Models for the Internet

• Forward/Backward messages as normal for chain• When we have more edges for a vertex use

• For each outgoing message, send it once you have all other incoming messages

• PRINCIPLED HACKIf no message received yet, set it to 1 altogether

x0 x1 x2

x3 x4 x5

x6 x7 x8

Page 455: Graphical Models for the Internet

• Forward/Backward messages as normal for chain• When we have more edges for a vertex use

• For each outgoing message, send it once you have all other incoming messages

• PRINCIPLED HACKIf no message received yet, set it to 1 altogether

x0 x1 x2

x3 x4 x5

x6 x7 x8

Page 456: Graphical Models for the Internet

• Forward/Backward messages as normal for chain• When we have more edges for a vertex use

• For each outgoing message, send it once you have all other incoming messages

• PRINCIPLED HACKIf no message received yet, set it to 1 altogether

x0 x1 x2

x3 x4 x5

x6 x7 x8

Page 457: Graphical Models for the Internet

• Forward/Backward messages as normal for chain• When we have more edges for a vertex use

• For each outgoing message, send it once you have all other incoming messages

• PRINCIPLED HACKIf no message received yet, set it to 1 altogether

x0 x1 x2

x3 x4 x5

x6 x7 x8

Page 458: Graphical Models for the Internet

Blunting the arrows ...

Page 459: Graphical Models for the Internet

Chicken and Egg

Page 460: Graphical Models for the Internet

Chicken and Egg

p(c|e)p(e|c)

Page 461: Graphical Models for the Internet

Chicken and Egg

p(c|e)p(e|c)

Page 462: Graphical Models for the Internet

Chicken and Egg

we know that chicken and egg are correlated

either chicken or egg

Page 463: Graphical Models for the Internet

Chicken and Egg

p(c, e) ∝ exp ψ(c, e)

encode the correlation via the clique potential

between c and ewe know that

chicken and egg are correlated

either chicken or egg

Page 464: Graphical Models for the Internet

Chicken and Egg

we know that chicken and egg are correlated

either chicken or egg

Page 465: Graphical Models for the Internet

Chicken and Egg

we know that chicken and egg are correlated

either chicken or egg

p(c, e) =exp ψ(c, e)�

c�,e� exp ψ(c�, e�)

= exp [ψ(c, e)− g(ψ)] where g(ψ) = log�

c,e

exp ψ(c, e)

Page 466: Graphical Models for the Internet

... some Yahoo service

MySQL Apache

Website

p(w|m, a)p(m)p(a)

Website

m �⊥⊥ a|w

Page 467: Graphical Models for the Internet

... some Yahoo service

MySQL Apache

Website

p(w|m, a)p(m)p(a)

Website

m �⊥⊥ a|w

MySQL Apache

Website

Site affects MySQL

Site affects Apache

p(m, w, a) ∝ φ(m, w)φ(w, a)

Website

m ⊥⊥ a|w

Page 468: Graphical Models for the Internet

... some Yahoo service

MySQL Apache

Website

p(w|m, a)p(m)p(a)

Website

m �⊥⊥ a|w

MySQL Apache

Website

Site affects MySQL

Site affects Apache

p(m, w, a) ∝ φ(m, w)φ(w, a)

Website

m ⊥⊥ a|w

easier “debugging”

easier “modeling”

Page 469: Graphical Models for the Internet

Undirected Graphical Models

Key ConceptObserving nodes makes remainder

conditionally independent

Page 470: Graphical Models for the Internet

Undirected Graphical Models

Key ConceptObserving nodes makes remainder

conditionally independent

Page 471: Graphical Models for the Internet

Undirected Graphical Models

Key ConceptObserving nodes makes remainder

conditionally independent

Page 472: Graphical Models for the Internet

Undirected Graphical Models

Key ConceptObserving nodes makes remainder

conditionally independent

Page 473: Graphical Models for the Internet

Undirected Graphical Models

Key ConceptObserving nodes makes remainder

conditionally independent

Page 474: Graphical Models for the Internet

Undirected Graphical Models

Key ConceptObserving nodes makes remainder

conditionally independent

Page 475: Graphical Models for the Internet

Undirected Graphical Models

Key ConceptObserving nodes makes remainder

conditionally independent

Page 476: Graphical Models for the Internet

Undirected Graphical Models

Key ConceptObserving nodes makes remainder

conditionally independent

Page 477: Graphical Models for the Internet

Cliques

Page 478: Graphical Models for the Internet

Cliques

maximal fully connected subgraph

Page 479: Graphical Models for the Internet

Cliques

maximal fully connected subgraph

Page 480: Graphical Models for the Internet

Cliques

maximal fully connected subgraph

Page 481: Graphical Models for the Internet

Hammersley Clifford Theorem

p(x) =�

c

ψc(xc)

If density has full support then it decomposes into products of clique potentials

Page 482: Graphical Models for the Internet

Directed vs. Undirected

• Causal description• Normalization automatic• Intuitive• Requires knowledge of

dependencies• Conditional

independence tricky(Bayes Ball algorithm)

• Noncausal description (correlation only)

• Intuitive• Easy modeling• Normalization difficult• Conditional independence

easy to read off (graph connectivity)

Page 483: Graphical Models for the Internet

Examples

Page 484: Graphical Models for the Internet

x

p(x) =�

i

ψi(xi, xi+1)

Chains

Page 485: Graphical Models for the Internet

x

p(x) =�

i

ψi(xi, xi+1)

Chains

Page 486: Graphical Models for the Internet

x

p(x) =�

i

ψi(xi, xi+1)

Chains

x

y

p(x, y) =�

i

ψxi (xi, xi+1)ψxy

i (xi, yi)

Page 487: Graphical Models for the Internet

x

p(x) =�

i

ψi(xi, xi+1)

Chains

p(x|y) ∝�

i

ψxi (xi, xi+1)ψxy

i (xi, yi)� �� �=:fi(xi,xi+1)

x

y

p(x, y) =�

i

ψxi (xi, xi+1)ψxy

i (xi, yi)

Page 488: Graphical Models for the Internet

Chains

p(x|y) ∝�

i

ψxi (xi, xi+1)ψxy

i (xi, yi)� �� �=:fi(xi,xi+1)

x

y

Dynamic Programmingl1(x1) = 1 and li+1(xi+1) =

xi

li(xi)fi(xi, xi+1)

rn(xn) = 1 and ri(xi) =�

xi+1

ri+1(xi+1)fi(xi, xi+1)

Page 489: Graphical Models for the Internet

Named Entity Tagging

p(x|y) ∝�

i

ψxi (xi, xi+1)ψxy

i (xi, yi)� �� �=:fi(xi,xi+1)

x

y

Page 490: Graphical Models for the Internet

Trees + Ontologies

• Ontology classification (e.g. YDir, DMOZ)

x

y

y

y

y

y

y y

Document

Labels p(y|x) =�

i

ψ(yi, yparent(i), x)

Page 491: Graphical Models for the Internet

Spin Glasses + Images

xy

observed pixelsreal image

p(x|y) =�

ij

ψright(xij , xi+1,j)ψup(xij , xi,j+1)ψxy(xij , yij)

Page 492: Graphical Models for the Internet

Spin Glasses + Images

xy

observed pixelsreal image

p(x|y) =�

ij

ψright(xij , xi+1,j)ψup(xij , xi,j+1)ψxy(xij , yij)

long range interactions

Page 493: Graphical Models for the Internet

Image Denoising

Li&Huttenlocher, ECCV’08

Page 494: Graphical Models for the Internet

Semi-Markov Models

• Flexible length of an episode• Segmentation between episodes

classification

CRF

SMM

phrase segmentation, activity recognition, motion data analysisShi, Smola, Altun, Vishwanathan, Li, 2007-2009

Page 495: Graphical Models for the Internet

2D CRF for Webpages

web page information extraction, segmentation, annotationBo, Zhu, Nie, Wen, Hon, 2005-2007

Page 496: Graphical Models for the Internet

Exponential Families and Graphical Models

Page 497: Graphical Models for the Internet

Exponential Family Reunion

• Density function

• Log partition function generates cumulants

• g is convex (second derivative is p.s.d.)

p(x; θ) = exp (�φ(x), θ� − g(θ))

where g(θ) = log�

x�

exp (�φ(x�), θ�)

∂θg(θ) = E [φ(x)]

∂2θg(θ) = Var [φ(x)]

Page 498: Graphical Models for the Internet

Log Partition FunctionUnconditional model

p(y|θ, x) = e�φ(x,y),θ�−g(θ|x)

g(θ|x) = log�

y

e�φ(x,y),θ�

∂θg(θ|x) =�

y φ(x, y)e�φ(x,y),θ��

y e�φ(x,y),θ� =�

y

φ(x, y)e�φ(x,y),θ�−g(θ|x)

p(x|θ) = e�φ(x),θ�−g(θ)

g(θ) = log�

x

e�φ(x),θ�

∂θg(θ) =�

x φ(x)e�φ(x),θ��

x e�φ(x),θ� =�

x

φ(x)e�φ(x),θ�−g(θ)

Conditional model

Page 499: Graphical Models for the Internet

• Conditional log-likelihood

• Log-posterior (Gaussian Prior)

• First order optimality conditions

Estimation

log p(y|x; θ) = �φ(x, y), θ� − g(θ|x)

log p(θ|X,Y ) =�

i

log(yi|xi; θ) + log p(θ) + const.

=

��

i

φ(xi, yi), θ

�−

i

g(θ|xi)−1

2σ2�θ�2 + const.

i

φ(xi, yi) =�

i

Ey|xi[φ(xi, y)] +

1σ2

θ

priormaxentmodel

expensive

Page 500: Graphical Models for the Internet

Logistic Regression• Label space

• Log-partition function

• Convex minimization problem

• Prediction

x

y

φ(x, y) = yφ(x) where y ∈ {±1}

g(θ|x) = log�e1·�φ(x),θ� + e−1·�φ(x),θ�

�= log 2 cosh �φ(x), θ�

minimizeθ

12σ2

�θ�2 +�

i

log 2 cosh �φ(xi), θ� − yi �φ(xi, θ�

p(y|x, θ) =ey�φ(x),θ�

e�φ(x),θ� + e−�φ(x),θ� =1

1 + e−2y�φ(x),θ�

Page 501: Graphical Models for the Internet

Exponential Clique Decomposition

p(x) =�

c

ψc(xc)

Theorem: Clique decomposition holds in sufficient statisticsφ(x) = (. . . , φc(xc), . . .) and �φ(x), θ� =

c

�φc(xc), θc�

Corollary: we only need expectations on cliquesEx[φ(x)] = (. . . ,Exc [φc(xc)] , . . .)

Page 502: Graphical Models for the Internet

Conditional Random Fields

y

x

φ(x) = (y1φx(x1), . . . , ynφx(xn), φy(y1, y2), . . . ,φy(yn−1, yn))

�φ(x), θ� =�

i

�φx(xi, yi), θx� +�

i

�φy(yi, yi+1), θy�

g(θ|x) =�

y

i

fi(yi, yi+1) where

fi(yi, yi+1) = e�φx(xi,yi),θx�+�φy(yi,yi+1),θy�

dynamicprogramming

Page 503: Graphical Models for the Internet

Conditional Random Fields

• Compute distribution over marginal and adjacent labels

• Take conditional expectations• Take update step (batch or online)

• More general techniques for computing normalization via message passing ...

Page 504: Graphical Models for the Internet

Dynamic Programming +Message Passing

Page 505: Graphical Models for the Internet

Clique Graph

4

2

3

8

5

1

7

6

p(x) =�

c

ψc(xc)

Page 506: Graphical Models for the Internet

1,2,31,2,3 2,3,4 4,5,8

7,8

6,7

5,6

Clique Graph

4

2

3

8

5

1

7

6

p(x) =�

c

ψc(xc)

Page 507: Graphical Models for the Internet

1,2,31,2,3 2,3,4 4,5,8

7,8

6,7

5,6

Clique Graph

4

2

3

8

5

1

7

6

2,3 4

8

56

7

p(x) =�

c

ψc(xc)

Page 508: Graphical Models for the Internet

1,2,31,2,3 2,3,4 4,5,8

7,8

6,7

5,6

Clique Graph

4

2

3

8

5

1

7

6

2,3 4

8

56

7

p(x) =�

c

ψc(xc)

loop

Page 509: Graphical Models for the Internet

7,8

6,7

5,6

8

56

7

1,2,31,2,3 2,3,4 4,5,8

Junction Tree / Triangulation

4

2

3

8

5

1

7

6

2,3 4

p(x) =�

c

ψc(xc)

Page 510: Graphical Models for the Internet

7,8

6,7

5,6

8

56

7

1,2,31,2,3 2,3,4 4,5,8

Junction Tree / Triangulation

4

2

3

8

5

1

7

6

2,3 4

p(x) =�

c

ψc(xc)

Page 511: Graphical Models for the Internet

5,7,8 5,6,75,8 5,71,2,31,2,3 2,3,4 4,5,8

Junction Tree / Triangulation

4

2

3

8

5

1

7

6

2,3 4

p(x) =�

c

ψc(xc)

message passing possible

Page 512: Graphical Models for the Internet

Triangulation Examples

• Clique size increases

• Separator set size increases

4

2 3

5

1

7

6

Page 513: Graphical Models for the Internet

Triangulation Examples

• Clique size increases

• Separator set size increases

4

2 3

5

1

7

6

Page 514: Graphical Models for the Internet

Triangulation Examples

• Clique size increases

• Separator set size increases

4

2 3

5

1

7

6

1,5,61,2,3 1,3 1,41,3,4 1,4,5 1,5 1,6 1,6,7

Page 515: Graphical Models for the Internet

Message Passing

• Joint Probability

• Computing the normalization

1,5,61,2,3 1,3 1,41,3,4 1,4,5 1,5 1,6 1,6,7

p(x) ∝ ψ(x1, x2, x3)ψ(x1, x3, x4)ψ(x1, x4, x5)ψ(x1, x5, x6)ψ(x1, x6, x7)

m→(x1, x3) =�

x2

ψ(x1, x2, x3)

m→(x1, x4) =�

x3

m→(x1, x3)ψ(x1, x3, x4)

m→(x1, x5) =�

x4

m→(x1, x4)ψ(x1, x4, x5)

Page 516: Graphical Models for the Internet

Message Passing

• Initialize messages with 1• Guaranteed to converge for (junction) trees• Works well in practice even for loopy graphs• Only local computations are required

xs

xc

xs’

xs’’ xc’

mc→c�(xc∩c�) =�

xc\c�

ψc(xc)�

c��∈N(c)\c�

mc��→c(xc∩c��)

p(xc) ∝ ψc(xc)�

c��∈N(c)

mc��→c(xc∩c��)

Page 517: Graphical Models for the Internet

Message Passing in Practice• Incoming messages contain aggregate

uncertainty from neighboring random variables• Message passing combines and transmits this

information in both directionsphase 1ranker

tiering

crawler

Page 518: Graphical Models for the Internet

Message Passing in Practice• Incoming messages contain aggregate

uncertainty from neighboring random variables• Message passing combines and transmits this

information in both directionsphase 1ranker

tiering

crawler

automate

Page 519: Graphical Models for the Internet

Related work

• Tools• GraphLab (CMU - Guestrin, Low,

Gonzalez ...)• Factorie (UMass - McCallum & coworkers)• HBC (Hal Daume)• Variational Bayes .NET (MSR Cambridge)

• See more on alex.smola.org / blog.smola.org

Page 520: Graphical Models for the Internet

Today’s mission

Find structure in the data

Human understandableImproved knowledge for estimation