cs294-1 behavioral data mining - peoplejfc/datamining/sp13/lecs/lec01.pdf · cs294-1 behavioral...

CS294-1 Behavioral

Data Mining

Prof: John Canny

GSI: Huasha Zhao

A brief history of behavioral data

Pagerank: The web as a behavioral dataset

DB size = 50 billion sites

Google server farms

2 million machines (est ?)

1998 sponsored search

Overture

2002

Sponsored search

Google revenue around $28 bn last year from marketing, 97% of the companies revenue.

Sponsored search uses an auction a pure competition for marketers trying to win access to consumers.

In other words, a competition for models of consumers their likelihood of responding to the ad and of determining the right bid for the item.

There are around 20 billion search requests a month. Perhaps a trillion events of history between search providers.

Features that matter: source URL, time-of-day, geo, demographics, and user history. Increasingly real-time.

Sponsored search

Bluefin Technologies

Social Media and Sentiment

Bluefin Technologies

Microtargeting

Recommenders

Tag data + Implicit tagging

Based on Luis Von Ahns ESP game

Flickr hosts around 6 billion images

John Kelly, Berkman center Harvard

Computational Social Science

Sentiment Analysis

Alan Mislove et al, NorthEastern

http://www.youtube.com/v/ujcrJZRSGkg&hl=en_US

Social Media and Crises

Virginia earthquake 2011: Mislove et al. 2011

http://www.youtube.com/watch?v=XJ1EQbmJ_LQ

Leskovec, Backstrom, Kleinberg

Modeling Spread of Disease from Social Interactions

Adam Sadilek, Henry Kautz, Vincent Silenzio

Epidemiology

Behavioral Data Landscape

19

Dataset sizes dictated by number of people:

Behavioral data is big but not outrageously so.

1 TB 100 TB 10 GB

Web graph Portal logs

Twitter archives

Spinn3r archives

Human Genome

NOT:

FB w/ images

Youtube

MRI data

Climate models

The Web

Course Components

Algorithms for mining behavioral data: Machine learning (Natural Language)

Causal analysis

Network algorithms

Tools: Algorithm level (BIDMach, Mahout)

Matrix level (BIDMat, Matlab)

Data Movement (Hadoop, Spark)

Techniques: GPU programming

Performance patterns

Optimization

Communication-minimizing Algorithms

Course Contents

Introduction, example problems, open research questions

Basic statistics, bias-variance tradeoffs, Nave Bayes classifier

Regression and generalized linear models (GLM)s

Causal analysis: Matching, propensity scores

About people - power laws, traits, social network structure

Performance measurement, significance tests, resampling

Map-Reduce: Hadoop, Spark,

Machine biology - b/w hierarchy, caching, communication

Scientific computing: communication-minimizing algorithms

GPU architecture and programming

Text and metadata indexing and retrieval, XML content-bases

Natural language processing

Course Contents

Natural language processing

Classifiers

Excavating - crawling, web services, processing streams

Query languages

Factor Models - Naive Bayes, LDA, GaP

Optimizers: ADAGRAD, LBFGS, MCMC

Clustering - k-means, PAM, generative models

Causal analysis: Targeted Maximum-Likelihood Estimation

Causal graphical models

Prediction - kNN, kd-trees, kNC

Network algorithms - HITS and Pagerank

Network algorithms - diffusion

Visualizing large datasets

Instructor John Canny (Ph.D. MIT 1987) research in computer vision, robotics

DataMining projects

MarkLogic co-founder 2003

BigTribe/Overstock 2004-2005

Yahoo 2007-2008

Ebay 2009-2010

Quantcast 2010-2011

About 40% of time in industry since 2003.

Research in behavior change, health care, education, computational social science: using data mining techniques on social media.

Desiderata for BDM Toolkits

24

Desiderata:

Interactivity (Scala REPL)

Natural syntax +,-,*, etc and high-level expressivity

CPU and GPU acceleration custom GPU kernels

Easy threading (Actors)

Java VM + Java codebase

C accelerators

Hence BIDMat (Matrix Algebra) and BIDMach (Stats +

Machine learning), written in Scala. (on github).

Toolkit Roadmap

Algorithm layer

Matrix layer

CPU and GPU

acceleration Cluster mapping

BIDMach

BIDMat Weka, MFE, CRAN

Jama, Matlab, R

Hadoop, Spark,

Hyracks, DryadLINQ,

Pig, Hive, Shark,

Intel MKL

JCUDA Mahout

GraphLab (CS colloquium talk today, 4pm)

Algorithm layer

Graph Layer

CPU acceleration Cluster mapping

GraphLab

Intel MKL

Powergraph

Innovations

Algorithm layer

Matrix layer

CPU and GPU


BIDMach

BIDMat

Intel MKL

JCUDA

Communication-

Minimizing

Algorithms

Optimizers:

SGD, MCMC

A conflict

Algorithm layer

Matrix layer

CPU and GPU


BIDMach

BIDMat

Intel MKL

JCUDA

Communication-

Minimizing

Algorithms

Optimizers:

SGD, MCMC

Communication

bloat

Resolved

Algorithm layer

Matrix layer

CPU and GPU


BIDMach

BIDMat

Intel MKL

JCUDA

Communication-

Minimizing

Algorithms

Optimizers:

SGD, MCMC

Butterfly Mixing

Course Mechanics

The class wiki is up at:

http://bid.berkeley.edu/cs294-1-spring13/index.php/

There is no textbook for the course, but several recommended texts are up on the wiki.

There will be required and recommended readings posted before each class. Those are the text for the class.

There will be a Piazza for Berkeley as CS294-1. Please sign up on http://piazza.com

http://bid.berkeley.edu/cs294-1-spring13/index.php/Main_Pagehttp://bid.berkeley.edu/cs294-1-spring13/index.php/Main_Pagehttp://bid.berkeley.edu/cs294-1-spring13/index.php/Main_Pagehttp://bid.berkeley.edu/cs294-1-spring13/index.php/Main_Pagehttp://bid.berkeley.edu/cs294-1-spring13/index.php/Main_Pagehttp://bid.berkeley.edu/cs294-1-spring13/index.php/Main_Pagehttp://piazza.com/

TA, Office Hours, Section

Teaching Assistants

Huasha Zhao, Par Lab (5th floor)

Office Hours

John Canny: TuTh 4-5, in 637 Soda Hall for now

Sections TBD

Prerequisites

A desire to learn a bunch of new topics and the time to do it.

Background in:

Basic math literacy:

Linear algebra, statistics

Programming experience, ideally in Java

Basic understanding of computers and networks

Course Enrollment

Waitlist + current enrollment is larger than the course limit.

We will process the waitlist by next Monday.

If youre on the waitlist, please fill out a petition and hand it in to

637 Soda.

What is data mining?

Data Mining

The process of exploiting large amounts of data? YES

The art of discovering novel structure and value in data? YES!,

and this is where innovation happens.

Agile visualization and analysis of raw data

Quick hypothesis testing with sketch models

Ability to test against samples of big datasets

Ability to easily migrate to large scale

Ability to get performance data on the scaled-up model

Accuracy: as scientists, we care most about accuracy:

drawing the most accurate conclusions we can from the data.

Resolution: we also want to be able to tell whether putative

structure is there at all, i.e. can we resolve it?

For exploring data we want answers quickly!

Generally speaking, both accuracy and resolution improve with

the amount of data processed. So there are many reasons to

improve speed and scale, i.e. algorithm and system

performance really matter.

A Few Words on Performance

Performance is a small but very important part of the course.

You dont need to be a systems specialist to be able to apply

these principles.

Understanding asymptotics, O(n2) vs O(n log n)

Understanding memory: sequential vs. random access, cache

hierarchy.

Having a feel for constants: hashing, string comparison, array

access etc.

Often, the solution is to use an efficient library (e.g. Intel MKL or

Atlas for matrix operations, scientific functions, and random

numbers).

Simple Performance Rules

In Java: multiply C = A * B,

where A is an m*p matrix, B is a p*n matrix

for (int i = 0; i < m; i++) {

for (int j = 0; j < n; j++) {

double sum = 0;

for (int k = 0; k < p; k++)

sum += A[i,k] * B[k,j];

C[i,j] = sum;

}}

(For Java, we map [i, j] to [i + j*nrows])

Case Study: Matrix Multiply

Performance: Mahout (1000x1000 multiply, Intel i5):

12 secs, 0.16 Gflops

Matlab performance (i.e. Intel MKL), same machine, same

problem: ??


Performance: Mahout (1000x1000 multiply, Intel i5):

12 secs, 0.16 Gflops

Matlab performance (i.e. Intel MKL), same machine, same

problem: 14 msecs, 156 Gflops, about 1000x faster.

Where does the difference come from?:

Matlabs implementation is multi-threaded, 4x

2x Overhead in array access calculations getQuick(i,j)

* Memory grain (about 10x).

Memory hierarchy (L1, L2, L3 caches), Intel SSSE4

instructions, loop unrolling, hand assembly-coding,



C A B

= i i

j j k

k

Column-major order, memory grain

DRAM

Reading one location really reads a row (16kB for 4GB

DRAM). Sequential access >>faster>> random access

16 kB row

Fixing Matrix Multiply

C A B

= i i

j j k

k

Column-major order, memory grain

(Assume C is initialized to zero)

for (int j = 0; j < n; j++)

for (int k = 0; k < p; k++) {

double bval = B[k,j];

for (int i = 0; i < m; i++)

C[i,j] += A[i,k] * bval;

}

About 10x faster on 1000x1000 matrices, i7 processor. Caveats:

100x100 multiply also much faster with Grandmas alg.!

But you can get similar gains with large dense/sparse

operations by paying attention to grain.


Constants

47

Dense Matrix Algebra:

Sparse Matrix Algebra:

Transcendental Functions:

Random numbers:

Nave Java

(Mahout, GL)

Linear memory

Java (Weka)

Blocked, native

(Intel MKL, 8C)

GPU (4G)

0.1-1 Gflop/s 10 Gflop/s 300 Gflop/s 4 Tflop/s

Nave Java Intel MKL (8C) GPU (4G)

0.2 Gfev/s 4 Gfev/s 160 Gfev/s


0.1-1 Gflop/s 5-7 Gflop/s 80-160 Gflop/s


0.09 GRN/s 1 GRN/s 160 GRN/s

Constants

48

Dense Matrix Algebra:

Sparse Matrix Algebra:

Transcendental Functions:

Random numbers:

Nave Java

(Mahout, GL)

Linear memory

Java (Weka)

Blocked, native

(Intel MKL, 8C)

GPU (4G)

0.1-1 Gflop/s 10 Gflop/s 300 Gflop/s 4 Tflop/s


0.2 Gfev/s 4 Gfev/s 160 Gfev/s


0.1-1 Gflop/s 5-7 Gflop/s 80-160 Gflop/s


0.09 GRN/s 1 GRN/s 160 GRN/s

~105

104

104

104

Benchmarks

49

Latent Dirichlet Allocation (1000 dims, 20 million docs)

System Type Nodes x cores Time Perf

Smola et al. Gibbs sampler 100 x 8 12 hours 20 Gflops

Powergraph Gibbs sampler 64 x 16 16 hours ? 15 Gflops

BIDMach VB 1 x 8 x 4G 30 mins 110 Gflops

Benchmarks

50

Alternating Least Squares (Netflix synthetic data) on a single

node

But the basic ALS algorithm can be improved by an order of

magnitude by replacing matrix inversion with iterative solving.

Dimension 20 50 100 1000

Mahout 0.13 Gflops 0.1 Gflops * *

Powergraph 0.5 Gflops 1.1 Gflops 1 Gflops **

BIDMach 5 Gflop 100 Gflops 300 Gflops 500 Gflops

Why Causal Analysis?

51

Pervasive in Economics, Public Health, Biology,

not so far in data mining.

A motivating example from Mechanical Turk:

Mechanical Turker Demographics

52

Causal Analysis

53

If we know the Turkers country of origin, this all makes sense.

But if not we could easily conclude:

On average, women earn about 10x what men do

University education cuts expected income by about 5x

* The households that men live have on average 2x as many

people in them as the households that women live in.

The nation of origin is a confounder in all these relationships.

Causal analysis allows us to separate its effects from genuine

influences from these variables.

* Graph data not provided for this

Why Natural Language?

54

Weakness of bag-of-words and n-gram features:

No attachments of actors, sentiments and targets

No sense of agency: who did what to whom?

Conflation of voices in a text

How similar are two texts? Did one derive from the other?

Generally low-quality as features

Parsing is one approach to addressing these challenges.

But high-quality parsing is extremely slow:

10s of sentences per second (constituency or PCFG parsing)

Natural Language Parsing

55

Surprisingly good match for GPU computation.

High-quality parsing requires around 1 billion PCFG rule

evaluations per sentence.

Only practical on CPUs with heavy pruning: 99.7%

Using new techniques we achieve ~1 rule evaluation per GPU

instruction, 3 Teraflops (4G), and 1000 sentences per second

without any pruning.

This is similar to GPU performance on dense matrix

multiply, and close to the devices max speed.

Natural Language Alignment

56

Tracking stories in social media

Tracking influence of scientific ideas and theories

Tracking literary plots and themes

DNA sequence alignment

Summary

57

Behavioral data history

Some example problems

Course components

Why single-machine performance matters

Causal analysis

Natural language

cs294-1 behavioral data mining - peoplejfc/datamining/sp13/lecs/lec01.pdf · cs294-1 behavioral...

Documents