b490 mining the big data
TRANSCRIPT
2-1
Data Mining
What is Data Mining?
A “definition”: Discovery of useful, possibly unexpected,patterns in data.
2-2
Data Mining
What is Data Mining?
A “definition”: Discovery of useful, possibly unexpected,patterns in data.
I don’t think this is practical, until a day machines haveintelligence. (You can have different opinions)
2-3
Data Mining
What is Data Mining?
A “definition”: Discovery of useful, possibly unexpected,patterns in data.
I don’t think this is practical, until a day machines haveintelligence. (You can have different opinions)
I think, most of the time, people just mean to
• Compute some functions defined on the data
(Efficient algorithms).
• Fit data into some concrete models
(Statistical modeling).
3-1
In this course, we will talk about . . .
In this course we will focus on efficient algorithms.In particular, we will discuss
Finding similar items
3-2
In this course, we will talk about . . .
In this course we will focus on efficient algorithms.In particular, we will discuss
Finding similar items Mining frequent items
3-3
In this course, we will talk about . . .
In this course we will focus on efficient algorithms.In particular, we will discuss
Finding similar items Mining frequent items
Clustering (aggregate similar items)
3-4
In this course, we will talk about . . .
In this course we will focus on efficient algorithms.In particular, we will discuss
Finding similar items Mining frequent items
Clustering (aggregate similar items)
Link analysis(explore structure in large graphs)
5-1
• : over 2.5 petabytes of sales transactions• : an index of over 19 billion web pages• : over 40 billion of pictures• . . .
Big Data
Big data is everywhere
5-2
• : over 2.5 petabytes of sales transactions• : an index of over 19 billion web pages• : over 40 billion of pictures• . . .
Big Data
Big data is everywhere
Nature ’06 Nature ’08 Economist ’10CACM ’08
Magazine covers
6-1
• Retailer databases: Amazon, Walmart
• Logistics, financial & health data: Stock prices
• Social network: Facebook, twitter
• Pictures by mobile devices: iphone
• Internet traffic: IP addresses
• New forms of scientific data: Large Synoptic Survey Telescope
Source and Challenge
Source
6-2
• Retailer databases: Amazon, Walmart
• Logistics, financial & health data: Stock prices
• Social network: Facebook, twitter
• Pictures by mobile devices: iphone
• Internet traffic: IP addresses
• New forms of scientific data: Large Synoptic Survey Telescope
Source and Challenge
Source
• Volume
• Velocity
• Variety (Documents, Stock records, Personal profiles,Photographs, Audio & Video, 3D models, Location data, . . . )
Challenge
6-3
• Retailer databases: Amazon, Walmart
• Logistics, financial & health data: Stock prices
• Social network: Facebook, twitter
• Pictures by mobile devices: iphone
• Internet traffic: IP addresses
• New forms of scientific data: Large Synoptic Survey Telescope
Source and Challenge
Source
• Volume
• Velocity
• Variety (Documents, Stock records, Personal profiles,Photographs, Audio & Video, 3D models, Location data, . . . )
Challenge
} The focus of algorithm design
7-1
What does Big Data Really Mean?
We don’t define Big Data in terms of TB, PB, EB, . . .
The data is too big to fit in memory. What can we do?
7-2
What does Big Data Really Mean?
We don’t define Big Data in terms of TB, PB, EB, . . .
The data is too big to fit in memory. What can we do?
Processing one by one as they come,and throw some of them away on the fly.
7-3
What does Big Data Really Mean?
We don’t define Big Data in terms of TB, PB, EB, . . .
The data is too big to fit in memory. What can we do?
Processing one by one as they come,and throw some of them away on the fly.
Store in multiple machines, which collaborate via communication
7-4
What does Big Data Really Mean?
We don’t define Big Data in terms of TB, PB, EB, . . .
The data is too big to fit in memory. What can we do?
Processing one by one as they come,and throw some of them away on the fly.
Store in multiple machines, which collaborate via communication
RAM model does not fit
A processor and an infinite size memory
Probing each cell of the memory has a unit cost
RAM
CPU
9-1
Data Streams
The data stream model (Alon, Matias & Szegedy 1996)
RAM
CPU
Widely used:Stanford Stream,Aurora, Telegraph,NiagaraCQ . . .
9-2
Data Streams
The data stream model (Alon, Matias & Szegedy 1996)
Applications
Internet Router.
RAM
CPU
Router
Packets limited space
Stock data, ad auction, flight logs on tapes, etc.
The router wants to maintain some statistics on data.E.g., want to detect anomalies for security.
Widely used:Stanford Stream,Aurora, Telegraph,NiagaraCQ . . .
11-1
Difficulty: See and forget!
Game 1: A sequence of numbers
Game 2: Relationships between
Alice, Bob, Carol, Dave, Eva and Paul
Q: What’s the median?
A: 33
11-2
Difficulty: See and forget!
Game 1: A sequence of numbers
Game 2: Relationships between
Alice, Bob, Carol, Dave, Eva and Paul
Q: What’s the median?
A: 33
Alice and Bob become friends
11-3
Difficulty: See and forget!
Game 1: A sequence of numbers
Game 2: Relationships between
Alice, Bob, Carol, Dave, Eva and Paul
Q: What’s the median?
A: 33
Carol and Eva become friends
11-4
Difficulty: See and forget!
Game 1: A sequence of numbers
Game 2: Relationships between
Alice, Bob, Carol, Dave, Eva and Paul
Q: What’s the median?
A: 33
Eva and Bob become friends
11-5
Difficulty: See and forget!
Game 1: A sequence of numbers
Game 2: Relationships between
Alice, Bob, Carol, Dave, Eva and Paul
Q: What’s the median?
A: 33
Dave and Paul become friends
11-6
Difficulty: See and forget!
Game 1: A sequence of numbers
Game 2: Relationships between
Alice, Bob, Carol, Dave, Eva and Paul
Q: What’s the median?
A: 33
Alice and Paul become friends
11-7
Difficulty: See and forget!
Game 1: A sequence of numbers
Game 2: Relationships between
Alice, Bob, Carol, Dave, Eva and Paul
Q: What’s the median?
A: 33
Eva and Bob unfriends
11-8
Difficulty: See and forget!
Game 1: A sequence of numbers
Game 2: Relationships between
Alice, Bob, Carol, Dave, Eva and Paul
Q: What’s the median?
A: 33
Alice and Dave become friends
11-9
Difficulty: See and forget!
Game 1: A sequence of numbers
Game 2: Relationships between
Alice, Bob, Carol, Dave, Eva and Paul
Q: What’s the median?
A: 33
Bob and Paul become friends
11-10
Difficulty: See and forget!
Game 1: A sequence of numbers
Game 2: Relationships between
Alice, Bob, Carol, Dave, Eva and Paul
Q: What’s the median?
A: 33
Dave and Paul unfriends
11-11
Difficulty: See and forget!
Game 1: A sequence of numbers
Game 2: Relationships between
Alice, Bob, Carol, Dave, Eva and Paul
Q: What’s the median?
A: 33
Dave and Carol become friends
11-12
Difficulty: See and forget!
Game 1: A sequence of numbers
Game 2: Relationships between
Alice, Bob, Carol, Dave, Eva and Paul
Q: What’s the median?
A: 33
Q: Are Eva and Bob connected by friends?
11-13
Difficulty: See and forget!
Game 1: A sequence of numbers
Game 2: Relationships between
Alice, Bob, Carol, Dave, Eva and Paul
Q: What’s the median?
A: 33
Q: Are Eva and Bob connected by friends?
A: YES. Eva ⇔ Carol ⇔ Dave ⇔ Alice ⇔ Bob
11-14
Difficulty: See and forget!
Game 1: A sequence of numbers
Game 2: Relationships between
Alice, Bob, Carol, Dave, Eva and Paul
Q: What’s the median?
A: 33
Q: Are Eva and Bob connected by friends?
A: YES. Eva ⇔ Carol ⇔ Dave ⇔ Alice ⇔ Bob
Have to allow approx/randomization given a small memory.
12-1
MapReduce
The MapReduce model (Dean & Ghemawat 2004)
Input
MapShuffle
Reduce
Output
Standard modelin industry formassive datacomputationE.g., Hadoop.
12-2
MapReduce
The MapReduce model (Dean & Ghemawat 2004)
Input
MapShuffle
Reduce
Output
Standard modelin industry formassive datacomputationE.g., Hadoop.
For each value xi ,xi → {(key1, v1), (key2, v2), . . .}
{(key1, v1), (key1, v2), . . .}→ {y1, y2, . . .}
Aggregate keys
12-3
MapReduce
The MapReduce model (Dean & Ghemawat 2004)
Goal
Minimize (1) total communication, (2) # rounds.
Input
MapShuffle
Reduce
Output
Standard modelin industry formassive datacomputationE.g., Hadoop.
For each value xi ,xi → {(key1, v1), (key2, v2), . . .}
{(key1, v1), (key1, v2), . . .}→ {y1, y2, . . .}
Aggregate keys
13-1
ActiveDHT
The ActiveDHT model (Bahmani, Chowdhury & Goel 2010)
responsiblefor keys withhash = 6, 7
responsiblefor keys withhash = 4, 5
• Update (key , at)
• Query (key)
Used inYahoo! S4 &Twitter Storm
01
23
4
56
789
10
11
12
13
1514
7
14-1
Tentative course plan
Part 0 : Introductions
Part 1 : Finding Similar Items
– Jaccard Similarty and Min-Hashing– Locality Sensitive Hashing (LSH) and Distances– Implementing LSH in ActiveDHT
Part 2 : Clustering
– Hierachical Clustering– Assignment-based Clustering (k-center, k-mean, k-median)– Spectural Clustering
Part 3 : Mining Frequent Items
– Finding Frequent Itemsets– Finding Frequent Items in Data Stream
Part 4 : Link Analysis
– Markov Chain Basics– Webpage Similarity and PageRank– Implementing PageRank in MapReduce
15-1
Resources
There is no official textbook for the class.
Background on Randomized Algorithms:
• Probability and Computing
by Mitzenmacher and Upfal
Main reference book:
• Mining Massive Data Sets
by Anand Rajaraman and Jeff Ullman
16-1
Instructors
Instructor: Qin ZhangEmail: [email protected] hours: By email appointment
Assitant Instructor: Prasanth VelamalaEmail: [email protected] hours: Thursdays, 2pm-3pm
17-1
Grading
Assignments 50% : There will be several homeworkassignments. Solutions should be typeset inLaTeX (highly recommended) or Word.
Project 50% : The project consists of three components:1. Write a proposal.2. Write a report.3. Make a presentation.(Details will be posted online)
Use A,B, . . . for each item (assignments or projects). Finalgrade will be a weighted average (according to XX%).
17-2
Grading
Assignments 50% : There will be several homeworkassignments. Solutions should be typeset inLaTeX (highly recommended) or Word.
Project 50% : The project consists of three components:1. Write a proposal.2. Write a report.3. Make a presentation.(Details will be posted online)
Most important thing:Learn something about models / algorithmic techniques/ theoretical analysis for Mining the Big Data.
Use A,B, . . . for each item (assignments or projects). Finalgrade will be a weighted average (according to XX%).
18-1
LaTeX
LaTeX: Highly recommended tools forassignments/reports
1. Read wiki articles:http://en.wikipedia.org/wiki/LaTeX
2. Find a good LaTeX editor.
3. Learn how to use it, e.g., read “A Not So ShortIntroduction to LaTeX 2e” (Google it)
19-1
Prerequisites
One is expected to know:Basics on algorithm design and analysis + probability +programming.
e.g., have taken(Math) M365 ”Introduction to Probability and Statistics”,(Math) M301 ”Linear Algebra and Applications”,(CS) C241 ”Discrete Structures for Computer Science”,(CS) B403 ”Introduction to Algorithm Design and Analysis”,or equivalent courses.
I will NOT start with things like big-O notations, thedefinitions of random variables and expectation. But, pleasealways ask at any time if you don’t understand sth.
20-1
Possible project topics
Part 1 : Finding Similar Items
– Locality Sensitive Hashing: Given a dictionary of a large numberof documents (or other objects) and a set of query docs. For eachquery doc, find all docs in the dictionary that are similar. CompareLSH with other methods that you can think of (e.g., the trivialone: compute the query with each of the docs in the dictionary),in terms of the running time.
Part 2 : Clustering
– Assignment-based Clustering (k-center, k-mean, k-median):Select clustering algorithms taught in class, and run them on largedata sets. One can also try to compare it with the hierarchicalclustering.
Part 3 : Mining Frequent Items
– Finding Frequent Itemsets: Run the A-priori algorithms on largedata sets to find frequent itemsets.
– Finding Frequent Items in Data Stream: Implement streamingalgorithms taught in class, and run them on large data sets to findfrequent items.Compare the results with the true frequent items/itemsets.
22-1
Approximation and Randomization
Approximation
Return f̂ (A) instead of f (A) where∣∣∣f (A)− f̂ (A)∣∣∣ ≤ εf (A)
is a (1 + ε)-approximation of f (A).
22-2
Approximation and Randomization
Approximation
Return f̂ (A) instead of f (A) where∣∣∣f (A)− f̂ (A)∣∣∣ ≤ εf (A)
is a (1 + ε)-approximation of f (A).
Randomization
Return f̂ (A) instead of f (A) where
Pr[∣∣∣f (A)− f̂ (A)
∣∣∣ ≤ εf (A)]≥ 1− δ
is a (1 + ε, δ)-approximation of f (A).
23-1
Markov and Chebyshev inequalities
Markov Inequality
Let X ≥ 0 be a random variable. Then for all a > 0,
Pr[X ≥ a] ≤ E[X ]
a.
23-2
Markov and Chebyshev inequalities
Markov Inequality
Let X ≥ 0 be a random variable. Then for all a > 0,
Pr[X ≥ a] ≤ E[X ]
a.
Chebyshev’s Inequality
Let X ≥ 0 be a random variable. Then for all a > 0,
Pr[|X − E[X ]| ≥ a] ≤ Var[X ]
a2.
24-1
Application: Birthday Paradox
Birthday Paradox
In a set of k randomly chosen people, what is the probabilitythat there exists at least a pair of them will have the samebirthday?Assuming each person’s birthday is randomly chosen fromJan. 1 to Dec. 31.
24-2
Application: Birthday Paradox
Birthday Paradox
In a set of k randomly chosen people, what is the probabilitythat there exists at least a pair of them will have the samebirthday?Assuming each person’s birthday is randomly chosen fromJan. 1 to Dec. 31.
Take 1: For any pair of people, the probability that they havethe same birthday is 1/n. For k people, we have
(k2
)pairs of
people. The probability that none of them have the same
birthday is (1 − 1/n)(k2). Thus the answer is 1 − (1 − 1/n)(k2).
24-3
Application: Birthday Paradox
Birthday Paradox
In a set of k randomly chosen people, what is the probabilitythat there exists at least a pair of them will have the samebirthday?Assuming each person’s birthday is randomly chosen fromJan. 1 to Dec. 31.
Take 1: For any pair of people, the probability that they havethe same birthday is 1/n. For k people, we have
(k2
)pairs of
people. The probability that none of them have the same
birthday is (1 − 1/n)(k2). Thus the answer is 1 − (1 − 1/n)(k2).Wrong!
24-4
Application: Birthday Paradox
Birthday Paradox
In a set of k randomly chosen people, what is the probabilitythat there exists at least a pair of them will have the samebirthday?Assuming each person’s birthday is randomly chosen fromJan. 1 to Dec. 31.
Take 1: For any pair of people, the probability that they havethe same birthday is 1/n. For k people, we have
(k2
)pairs of
people. The probability that none of them have the same
birthday is (1 − 1/n)(k2). Thus the answer is 1 − (1 − 1/n)(k2).
Take 2: 1−(n−0n
)·(n−1n
)·(n−2n
)· . . . ·
(n−(k−1)
n
)Pr[exists collision] ≈ k2/(2n)
Wrong!
25-1
Application: Coupon Collector
Coupon Collector
Suppose that each of box of cereal contains one of ndifferent coupons. Once you obtain one of every type ofcoupon, you can send in for a prize.
Assuming that the coupon in each box is chosenindependently and uniformly at random from the npossibilities, how many boxes of cereal must you buy beforeyou obtain at least one of every type of coupon?
25-2
Application: Coupon Collector
Coupon Collector
Suppose that each of box of cereal contains one of ndifferent coupons. Once you obtain one of every type ofcoupon, you can send in for a prize.
Assuming that the coupon in each box is chosenindependently and uniformly at random from the npossibilities, how many boxes of cereal must you buy beforeyou obtain at least one of every type of coupon?
Analysis (on board)
26-1
The Union Bound
The Union Bound
Consider t possible dependent random events X1, . . . ,Xt .The probability that all events occur is at least
1−t∑
i=1
(1− Pr[Xi occurs])
27-1
Summary for the introduction
We have discussed Big Data and Data Mining
We have introduced three popular models for moderncomputation.
We have talked about the course plan and assessment.
We have covered some basics on probability