b490 mining the big data

1-1

Qin Zhang

B490 Mining the Big Data

§0 Introduction

2-1

Data Mining

What is Data Mining?

A “definition”: Discovery of useful, possibly unexpected,patterns in data.

2-2

Data Mining



I don’t think this is practical, until a day machines haveintelligence. (You can have different opinions)

2-3

Data Mining



I don’t think this is practical, until a day machines haveintelligence. (You can have different opinions)

I think, most of the time, people just mean to

• Compute some functions defined on the data

(Efficient algorithms).

• Fit data into some concrete models

(Statistical modeling).

3-1

In this course, we will talk about . . .

In this course we will focus on efficient algorithms.In particular, we will discuss

Finding similar items

3-2



Finding similar items Mining frequent items

3-3




Clustering (aggregate similar items)

3-4




Clustering (aggregate similar items)

Link analysis(explore structure in large graphs)

4-1

Big Data

5-1

• : over 2.5 petabytes of sales transactions• : an index of over 19 billion web pages• : over 40 billion of pictures• . . .

Big Data

Big data is everywhere

5-2

• : over 2.5 petabytes of sales transactions• : an index of over 19 billion web pages• : over 40 billion of pictures• . . .

Big Data

Big data is everywhere

Nature ’06 Nature ’08 Economist ’10CACM ’08

Magazine covers

6-1

• Retailer databases: Amazon, Walmart

• Logistics, financial & health data: Stock prices

• Social network: Facebook, twitter

• Pictures by mobile devices: iphone

• Internet traffic: IP addresses

• New forms of scientific data: Large Synoptic Survey Telescope

Source and Challenge

Source

6-2








Source

• Volume

• Velocity

• Variety (Documents, Stock records, Personal profiles,Photographs, Audio & Video, 3D models, Location data, . . . )

Challenge

6-3








Source

• Volume

• Velocity

• Variety (Documents, Stock records, Personal profiles,Photographs, Audio & Video, 3D models, Location data, . . . )

Challenge

} The focus of algorithm design

7-1

What does Big Data Really Mean?

We don’t define Big Data in terms of TB, PB, EB, . . .

The data is too big to fit in memory. What can we do?

7-2




Processing one by one as they come,and throw some of them away on the fly.

7-3





Store in multiple machines, which collaborate via communication

7-4





Store in multiple machines, which collaborate via communication

RAM model does not fit

A processor and an infinite size memory

Probing each cell of the memory has a unit cost

RAM

CPU

8-1

Popular Models for Big Data

9-1

Data Streams

The data stream model (Alon, Matias & Szegedy 1996)

RAM

CPU

Widely used:Stanford Stream,Aurora, Telegraph,NiagaraCQ . . .

9-2

Data Streams

The data stream model (Alon, Matias & Szegedy 1996)

Applications

Internet Router.

RAM

CPU

Router

Packets limited space

Stock data, ad auction, flight logs on tapes, etc.

The router wants to maintain some statistics on data.E.g., want to detect anomalies for security.

Widely used:Stanford Stream,Aurora, Telegraph,NiagaraCQ . . .

10-1

Difficulty: See and forget!

Game 1: A sequence of numbers

10-2



52

10-3



45

10-4



18

10-5



23

10-6



17

10-7



41

10-8



33

10-9



29

10-10



49

10-11



12

10-12



35

10-13



Q: What’s the median?

10-14



A:


33

11-1



Game 2: Relationships between

Alice, Bob, Carol, Dave, Eva and Paul


A: 33

11-2






A: 33

Alice and Bob become friends

11-3






A: 33

Carol and Eva become friends

11-4






A: 33

Eva and Bob become friends

11-5






A: 33

Dave and Paul become friends

11-6






A: 33

Alice and Paul become friends

11-7






A: 33

Eva and Bob unfriends

11-8






A: 33

Alice and Dave become friends

11-9






A: 33

Bob and Paul become friends

11-10






A: 33

Dave and Paul unfriends

11-11






A: 33

Dave and Carol become friends

11-12






A: 33

Q: Are Eva and Bob connected by friends?

11-13






A: 33


A: YES. Eva ⇔ Carol ⇔ Dave ⇔ Alice ⇔ Bob

11-14






A: 33


A: YES. Eva ⇔ Carol ⇔ Dave ⇔ Alice ⇔ Bob

Have to allow approx/randomization given a small memory.

12-1

MapReduce

The MapReduce model (Dean & Ghemawat 2004)

Input

MapShuffle

Reduce

Output

Standard modelin industry formassive datacomputationE.g., Hadoop.

12-2

MapReduce


Input

MapShuffle

Reduce

Output


For each value xi ,xi → {(key1, v1), (key2, v2), . . .}

{(key1, v1), (key1, v2), . . .}→ {y1, y2, . . .}

Aggregate keys

12-3

MapReduce


Goal

Minimize (1) total communication, (2) # rounds.

Input

MapShuffle

Reduce

Output


For each value xi ,xi → {(key1, v1), (key2, v2), . . .}

{(key1, v1), (key1, v2), . . .}→ {y1, y2, . . .}

Aggregate keys

13-1

ActiveDHT

The ActiveDHT model (Bahmani, Chowdhury & Goel 2010)

responsiblefor keys withhash = 6, 7

responsiblefor keys withhash = 4, 5

• Update (key , at)

• Query (key)

Used inYahoo! S4 &Twitter Storm

01

23

4

56

789

10

11

12

13

1514

7

14-1

Tentative course plan

Part 0 : Introductions

Part 1 : Finding Similar Items

– Jaccard Similarty and Min-Hashing– Locality Sensitive Hashing (LSH) and Distances– Implementing LSH in ActiveDHT

Part 2 : Clustering

– Hierachical Clustering– Assignment-based Clustering (k-center, k-mean, k-median)– Spectural Clustering

Part 3 : Mining Frequent Items

– Finding Frequent Itemsets– Finding Frequent Items in Data Stream

Part 4 : Link Analysis

– Markov Chain Basics– Webpage Similarity and PageRank– Implementing PageRank in MapReduce

15-1

Resources

There is no official textbook for the class.

Background on Randomized Algorithms:

• Probability and Computing

by Mitzenmacher and Upfal

Main reference book:

• Mining Massive Data Sets

by Anand Rajaraman and Jeff Ullman

16-1

Instructors

Instructor: Qin ZhangEmail: [email protected] hours: By email appointment

Assitant Instructor: Prasanth VelamalaEmail: [email protected] hours: Thursdays, 2pm-3pm

17-1

Grading

Assignments 50% : There will be several homeworkassignments. Solutions should be typeset inLaTeX (highly recommended) or Word.

Project 50% : The project consists of three components:1. Write a proposal.2. Write a report.3. Make a presentation.(Details will be posted online)

Use A,B, . . . for each item (assignments or projects). Finalgrade will be a weighted average (according to XX%).

17-2

Grading

Assignments 50% : There will be several homeworkassignments. Solutions should be typeset inLaTeX (highly recommended) or Word.

Project 50% : The project consists of three components:1. Write a proposal.2. Write a report.3. Make a presentation.(Details will be posted online)

Most important thing:Learn something about models / algorithmic techniques/ theoretical analysis for Mining the Big Data.

Use A,B, . . . for each item (assignments or projects). Finalgrade will be a weighted average (according to XX%).

18-1

LaTeX

LaTeX: Highly recommended tools forassignments/reports

1. Read wiki articles:http://en.wikipedia.org/wiki/LaTeX

2. Find a good LaTeX editor.

3. Learn how to use it, e.g., read “A Not So ShortIntroduction to LaTeX 2e” (Google it)

19-1

Prerequisites

One is expected to know:Basics on algorithm design and analysis + probability +programming.

e.g., have taken(Math) M365 ”Introduction to Probability and Statistics”,(Math) M301 ”Linear Algebra and Applications”,(CS) C241 ”Discrete Structures for Computer Science”,(CS) B403 ”Introduction to Algorithm Design and Analysis”,or equivalent courses.

I will NOT start with things like big-O notations, thedefinitions of random variables and expectation. But, pleasealways ask at any time if you don’t understand sth.

20-1

Possible project topics

Part 1 : Finding Similar Items

– Locality Sensitive Hashing: Given a dictionary of a large numberof documents (or other objects) and a set of query docs. For eachquery doc, find all docs in the dictionary that are similar. CompareLSH with other methods that you can think of (e.g., the trivialone: compute the query with each of the docs in the dictionary),in terms of the running time.

Part 2 : Clustering

– Assignment-based Clustering (k-center, k-mean, k-median):Select clustering algorithms taught in class, and run them on largedata sets. One can also try to compare it with the hierarchicalclustering.

Part 3 : Mining Frequent Items

– Finding Frequent Itemsets: Run the A-priori algorithms on largedata sets to find frequent itemsets.

– Finding Frequent Items in Data Stream: Implement streamingalgorithms taught in class, and run them on large data sets to findfrequent items.Compare the results with the true frequent items/itemsets.

21-1

Basics on probability

22-1

Approximation and Randomization

Approximation

Return f̂ (A) instead of f (A) where∣∣∣f (A)− f̂ (A)∣∣∣ ≤ εf (A)

is a (1 + ε)-approximation of f (A).

22-2

Approximation and Randomization

Approximation

Return f̂ (A) instead of f (A) where∣∣∣f (A)− f̂ (A)∣∣∣ ≤ εf (A)

is a (1 + ε)-approximation of f (A).

Randomization

Return f̂ (A) instead of f (A) where

Pr[∣∣∣f (A)− f̂ (A)

∣∣∣ ≤ εf (A)]≥ 1− δ

is a (1 + ε, δ)-approximation of f (A).

23-1

Markov and Chebyshev inequalities

Markov Inequality

Let X ≥ 0 be a random variable. Then for all a > 0,

Pr[X ≥ a] ≤ E[X ]

a.

23-2

Markov and Chebyshev inequalities

Markov Inequality


Pr[X ≥ a] ≤ E[X ]

a.

Chebyshev’s Inequality


Pr[|X − E[X ]| ≥ a] ≤ Var[X ]

a2.

24-1

Application: Birthday Paradox

Birthday Paradox

In a set of k randomly chosen people, what is the probabilitythat there exists at least a pair of them will have the samebirthday?Assuming each person’s birthday is randomly chosen fromJan. 1 to Dec. 31.

24-2


Birthday Paradox


Take 1: For any pair of people, the probability that they havethe same birthday is 1/n. For k people, we have

(k2

)pairs of

people. The probability that none of them have the same

birthday is (1 − 1/n)(k2). Thus the answer is 1 − (1 − 1/n)(k2).

24-3


Birthday Paradox



(k2

)pairs of


birthday is (1 − 1/n)(k2). Thus the answer is 1 − (1 − 1/n)(k2).Wrong!

24-4


Birthday Paradox



(k2

)pairs of


birthday is (1 − 1/n)(k2). Thus the answer is 1 − (1 − 1/n)(k2).

Take 2: 1−(n−0n

)·(n−1n

)·(n−2n

)· . . . ·

(n−(k−1)

n

)Pr[exists collision] ≈ k2/(2n)

Wrong!

25-1

Application: Coupon Collector

Coupon Collector

Suppose that each of box of cereal contains one of ndifferent coupons. Once you obtain one of every type ofcoupon, you can send in for a prize.

Assuming that the coupon in each box is chosenindependently and uniformly at random from the npossibilities, how many boxes of cereal must you buy beforeyou obtain at least one of every type of coupon?

25-2

Application: Coupon Collector

Coupon Collector

Suppose that each of box of cereal contains one of ndifferent coupons. Once you obtain one of every type ofcoupon, you can send in for a prize.

Assuming that the coupon in each box is chosenindependently and uniformly at random from the npossibilities, how many boxes of cereal must you buy beforeyou obtain at least one of every type of coupon?

Analysis (on board)

26-1

The Union Bound

The Union Bound

Consider t possible dependent random events X1, . . . ,Xt .The probability that all events occur is at least

1−t∑

i=1

(1− Pr[Xi occurs])

27-1

Summary for the introduction

We have discussed Big Data and Data Mining

We have introduced three popular models for moderncomputation.

We have talked about the course plan and assessment.

We have covered some basics on probability

28-1

Thank you!