cs294-1 behavioral data mining - peoplejfc/datamining/sp13/lecs/lec01.pdf · cs294-1 behavioral...
TRANSCRIPT
-
CS294-1 Behavioral
Data Mining
Prof: John Canny
GSI: Huasha Zhao
-
A brief history of behavioral data
-
1997
-
Pagerank: The web as a behavioral dataset
-
DB size = 50 billion sites
Google server farms
2 million machines (est ?)
-
1998 sponsored search
Overture
2002
-
Sponsored search
Google revenue around $28 bn last year from marketing, 97% of the companies revenue.
Sponsored search uses an auction a pure competition for marketers trying to win access to consumers.
In other words, a competition for models of consumers their likelihood of responding to the ad and of determining the right bid for the item.
There are around 20 billion search requests a month. Perhaps a trillion events of history between search providers.
Features that matter: source URL, time-of-day, geo, demographics, and user history. Increasingly real-time.
-
Sponsored search
-
Sponsored search
-
Bluefin Technologies
Social Media and Sentiment
-
Bluefin Technologies
Microtargeting
-
Recommenders
-
Tag data + Implicit tagging
Based on Luis Von Ahns ESP game
Flickr hosts around 6 billion images
-
John Kelly, Berkman center Harvard
Computational Social Science
-
Sentiment Analysis
Alan Mislove et al, NorthEastern
http://www.youtube.com/v/ujcrJZRSGkg&hl=en_US
-
Social Media and Crises
Virginia earthquake 2011: Mislove et al. 2011
http://www.youtube.com/watch?v=XJ1EQbmJ_LQ
-
Leskovec, Backstrom, Kleinberg
-
Modeling Spread of Disease from Social Interactions
Adam Sadilek, Henry Kautz, Vincent Silenzio
Epidemiology
-
Behavioral Data Landscape
19
Dataset sizes dictated by number of people:
Behavioral data is big but not outrageously so.
1 TB 100 TB 10 GB
Web graph Portal logs
Twitter archives
Spinn3r archives
Human Genome
NOT:
FB w/ images
Youtube
MRI data
Climate models
The Web
-
Course Components
Algorithms for mining behavioral data: Machine learning (Natural Language)
Causal analysis
Network algorithms
Tools: Algorithm level (BIDMach, Mahout)
Matrix level (BIDMat, Matlab)
Data Movement (Hadoop, Spark)
Techniques: GPU programming
Performance patterns
Optimization
Communication-minimizing Algorithms
-
Course Contents
Introduction, example problems, open research questions
Basic statistics, bias-variance tradeoffs, Nave Bayes classifier
Regression and generalized linear models (GLM)s
Causal analysis: Matching, propensity scores
About people - power laws, traits, social network structure
Performance measurement, significance tests, resampling
Map-Reduce: Hadoop, Spark,
Machine biology - b/w hierarchy, caching, communication
Scientific computing: communication-minimizing algorithms
GPU architecture and programming
Text and metadata indexing and retrieval, XML content-bases
Natural language processing
-
Course Contents
Natural language processing
Classifiers
Excavating - crawling, web services, processing streams
Query languages
Factor Models - Naive Bayes, LDA, GaP
Optimizers: ADAGRAD, LBFGS, MCMC
Clustering - k-means, PAM, generative models
Causal analysis: Targeted Maximum-Likelihood Estimation
Causal graphical models
Prediction - kNN, kd-trees, kNC
Network algorithms - HITS and Pagerank
Network algorithms - diffusion
Visualizing large datasets
-
Instructor John Canny (Ph.D. MIT 1987) research in computer vision, robotics
DataMining projects
MarkLogic co-founder 2003
BigTribe/Overstock 2004-2005
Yahoo 2007-2008
Ebay 2009-2010
Quantcast 2010-2011
About 40% of time in industry since 2003.
Research in behavior change, health care, education, computational social science: using data mining techniques on social media.
-
Desiderata for BDM Toolkits
24
Desiderata:
Interactivity (Scala REPL)
Natural syntax +,-,*, etc and high-level expressivity
CPU and GPU acceleration custom GPU kernels
Easy threading (Actors)
Java VM + Java codebase
C accelerators
Hence BIDMat (Matrix Algebra) and BIDMach (Stats +
Machine learning), written in Scala. (on github).
-
Toolkit Roadmap
Algorithm layer
Matrix layer
CPU and GPU
acceleration Cluster mapping
BIDMach
BIDMat Weka, MFE, CRAN
Jama, Matlab, R
Hadoop, Spark,
Hyracks, DryadLINQ,
Pig, Hive, Shark,
Intel MKL
JCUDA Mahout
-
GraphLab (CS colloquium talk today, 4pm)
Algorithm layer
Graph Layer
CPU acceleration Cluster mapping
GraphLab
Intel MKL
Powergraph
-
Innovations
Algorithm layer
Matrix layer
CPU and GPU
acceleration Cluster mapping
BIDMach
BIDMat
Intel MKL
JCUDA
Communication-
Minimizing
Algorithms
Optimizers:
SGD, MCMC
-
A conflict
Algorithm layer
Matrix layer
CPU and GPU
acceleration Cluster mapping
BIDMach
BIDMat
Intel MKL
JCUDA
Communication-
Minimizing
Algorithms
Optimizers:
SGD, MCMC
Communication
bloat
-
Resolved
Algorithm layer
Matrix layer
CPU and GPU
acceleration Cluster mapping
BIDMach
BIDMat
Intel MKL
JCUDA
Communication-
Minimizing
Algorithms
Optimizers:
SGD, MCMC
Butterfly Mixing
-
Course Mechanics
The class wiki is up at:
http://bid.berkeley.edu/cs294-1-spring13/index.php/
There is no textbook for the course, but several recommended texts are up on the wiki.
There will be required and recommended readings posted before each class. Those are the text for the class.
There will be a Piazza for Berkeley as CS294-1. Please sign up on http://piazza.com
http://bid.berkeley.edu/cs294-1-spring13/index.php/Main_Pagehttp://bid.berkeley.edu/cs294-1-spring13/index.php/Main_Pagehttp://bid.berkeley.edu/cs294-1-spring13/index.php/Main_Pagehttp://bid.berkeley.edu/cs294-1-spring13/index.php/Main_Pagehttp://bid.berkeley.edu/cs294-1-spring13/index.php/Main_Pagehttp://bid.berkeley.edu/cs294-1-spring13/index.php/Main_Pagehttp://piazza.com/
-
TA, Office Hours, Section
Teaching Assistants
Huasha Zhao, Par Lab (5th floor)
Office Hours
John Canny: TuTh 4-5, in 637 Soda Hall for now
Sections TBD
-
Prerequisites
A desire to learn a bunch of new topics and the time to do it.
Background in:
Basic math literacy:
Linear algebra, statistics
Programming experience, ideally in Java
Basic understanding of computers and networks
-
Course Enrollment
Waitlist + current enrollment is larger than the course limit.
We will process the waitlist by next Monday.
If youre on the waitlist, please fill out a petition and hand it in to
637 Soda.
-
What is data mining?
-
Data Mining
The process of exploiting large amounts of data? YES
The art of discovering novel structure and value in data? YES!,
and this is where innovation happens.
Agile visualization and analysis of raw data
Quick hypothesis testing with sketch models
Ability to test against samples of big datasets
Ability to easily migrate to large scale
Ability to get performance data on the scaled-up model
-
Accuracy: as scientists, we care most about accuracy:
drawing the most accurate conclusions we can from the data.
Resolution: we also want to be able to tell whether putative
structure is there at all, i.e. can we resolve it?
For exploring data we want answers quickly!
Generally speaking, both accuracy and resolution improve with
the amount of data processed. So there are many reasons to
improve speed and scale, i.e. algorithm and system
performance really matter.
A Few Words on Performance
-
Performance is a small but very important part of the course.
You dont need to be a systems specialist to be able to apply
these principles.
Understanding asymptotics, O(n2) vs O(n log n)
Understanding memory: sequential vs. random access, cache
hierarchy.
Having a feel for constants: hashing, string comparison, array
access etc.
Often, the solution is to use an efficient library (e.g. Intel MKL or
Atlas for matrix operations, scientific functions, and random
numbers).
Simple Performance Rules
-
In Java: multiply C = A * B,
where A is an m*p matrix, B is a p*n matrix
for (int i = 0; i < m; i++) {
for (int j = 0; j < n; j++) {
double sum = 0;
for (int k = 0; k < p; k++)
sum += A[i,k] * B[k,j];
C[i,j] = sum;
}}
(For Java, we map [i, j] to [i + j*nrows])
Case Study: Matrix Multiply
-
Performance: Mahout (1000x1000 multiply, Intel i5):
12 secs, 0.16 Gflops
Matlab performance (i.e. Intel MKL), same machine, same
problem: ??
Case Study: Matrix Multiply
-
Performance: Mahout (1000x1000 multiply, Intel i5):
12 secs, 0.16 Gflops
Matlab performance (i.e. Intel MKL), same machine, same
problem: 14 msecs, 156 Gflops, about 1000x faster.
Where does the difference come from?:
Matlabs implementation is multi-threaded, 4x
2x Overhead in array access calculations getQuick(i,j)
* Memory grain (about 10x).
Memory hierarchy (L1, L2, L3 caches), Intel SSSE4
instructions, loop unrolling, hand assembly-coding,
Case Study: Matrix Multiply
-
Case Study: Matrix Multiply
C A B
= i i
j j k
k
Column-major order, memory grain
-
DRAM
Reading one location really reads a row (16kB for 4GB
DRAM). Sequential access >>faster>> random access
16 kB row
-
Fixing Matrix Multiply
C A B
= i i
j j k
k
Column-major order, memory grain
-
Fixing Matrix Multiply
C A B
= i i
j j k
k
Column-major order, memory grain
-
(Assume C is initialized to zero)
for (int j = 0; j < n; j++)
for (int k = 0; k < p; k++) {
double bval = B[k,j];
for (int i = 0; i < m; i++)
C[i,j] += A[i,k] * bval;
}
About 10x faster on 1000x1000 matrices, i7 processor. Caveats:
100x100 multiply also much faster with Grandmas alg.!
But you can get similar gains with large dense/sparse
operations by paying attention to grain.
Case Study: Matrix Multiply
-
Constants
47
Dense Matrix Algebra:
Sparse Matrix Algebra:
Transcendental Functions:
Random numbers:
Nave Java
(Mahout, GL)
Linear memory
Java (Weka)
Blocked, native
(Intel MKL, 8C)
GPU (4G)
0.1-1 Gflop/s 10 Gflop/s 300 Gflop/s 4 Tflop/s
Nave Java Intel MKL (8C) GPU (4G)
0.2 Gfev/s 4 Gfev/s 160 Gfev/s
Nave Java Intel MKL (8C) GPU (4G)
0.1-1 Gflop/s 5-7 Gflop/s 80-160 Gflop/s
Nave Java Intel MKL (8C) GPU (4G)
0.09 GRN/s 1 GRN/s 160 GRN/s
-
Constants
48
Dense Matrix Algebra:
Sparse Matrix Algebra:
Transcendental Functions:
Random numbers:
Nave Java
(Mahout, GL)
Linear memory
Java (Weka)
Blocked, native
(Intel MKL, 8C)
GPU (4G)
0.1-1 Gflop/s 10 Gflop/s 300 Gflop/s 4 Tflop/s
Nave Java Intel MKL (8C) GPU (4G)
0.2 Gfev/s 4 Gfev/s 160 Gfev/s
Nave Java Intel MKL (8C) GPU (4G)
0.1-1 Gflop/s 5-7 Gflop/s 80-160 Gflop/s
Nave Java Intel MKL (8C) GPU (4G)
0.09 GRN/s 1 GRN/s 160 GRN/s
~105
104
104
104
-
Benchmarks
49
Latent Dirichlet Allocation (1000 dims, 20 million docs)
System Type Nodes x cores Time Perf
Smola et al. Gibbs sampler 100 x 8 12 hours 20 Gflops
Powergraph Gibbs sampler 64 x 16 16 hours ? 15 Gflops
BIDMach VB 1 x 8 x 4G 30 mins 110 Gflops
-
Benchmarks
50
Alternating Least Squares (Netflix synthetic data) on a single
node
But the basic ALS algorithm can be improved by an order of
magnitude by replacing matrix inversion with iterative solving.
Dimension 20 50 100 1000
Mahout 0.13 Gflops 0.1 Gflops * *
Powergraph 0.5 Gflops 1.1 Gflops 1 Gflops **
BIDMach 5 Gflop 100 Gflops 300 Gflops 500 Gflops
-
Why Causal Analysis?
51
Pervasive in Economics, Public Health, Biology,
not so far in data mining.
A motivating example from Mechanical Turk:
-
Mechanical Turker Demographics
52
-
Causal Analysis
53
If we know the Turkers country of origin, this all makes sense.
But if not we could easily conclude:
On average, women earn about 10x what men do
University education cuts expected income by about 5x
* The households that men live have on average 2x as many
people in them as the households that women live in.
The nation of origin is a confounder in all these relationships.
Causal analysis allows us to separate its effects from genuine
influences from these variables.
* Graph data not provided for this
-
Why Natural Language?
54
Weakness of bag-of-words and n-gram features:
No attachments of actors, sentiments and targets
No sense of agency: who did what to whom?
Conflation of voices in a text
How similar are two texts? Did one derive from the other?
Generally low-quality as features
Parsing is one approach to addressing these challenges.
But high-quality parsing is extremely slow:
10s of sentences per second (constituency or PCFG parsing)
-
Natural Language Parsing
55
Surprisingly good match for GPU computation.
High-quality parsing requires around 1 billion PCFG rule
evaluations per sentence.
Only practical on CPUs with heavy pruning: 99.7%
Using new techniques we achieve ~1 rule evaluation per GPU
instruction, 3 Teraflops (4G), and 1000 sentences per second
without any pruning.
This is similar to GPU performance on dense matrix
multiply, and close to the devices max speed.
-
Natural Language Alignment
56
Tracking stories in social media
Tracking influence of scientific ideas and theories
Tracking literary plots and themes
DNA sequence alignment
-
Summary
57
Behavioral data history
Some example problems
Course components
Why single-machine performance matters
Causal analysis
Natural language