cs525: big data analytics machine learning on hadoop fall 2013 elke a. rundensteiner 1
TRANSCRIPT
1
CS525: Big Data Analytics
Machine Learning on Hadoop
Fall 2013
Elke A. Rundensteiner
2
Analytics ?
• Machine learning, data mining & statistics tools• Analyze/mine/summarize large datasets• Extract knowledge from past or streaming data• Predict trends in future data
ML Today
• Internet search clustering
• Social network analysis
• Taxonomy transformations
• Market analytics
• Recommendation systems
• Log analysis & event filtering
• SPAM filtering
• Fraud detection
4
Tools & Algorithms
• Collaborative Filtering
• Clustering Techniques
• Classification Algorithms
• Association Rules
• Frequent Pattern Mining
• Statistical libraries (Regression, SVM, …)
• Others…
5
Common Use Cases
6
Make It Industry Strength: Big Data
--Efficient in analyzing/mining data--Do not scale
--Efficient in managing big data--Does not analyze or mine data
How to integrate these two worlds ?
8
Some Projects
• Apache Mahout• Open-source package on Hadoop for
data mining and machine learning
• Revolution R (R-Hadoop or Radoop )• Extensions to R package to run on
Hadoop
9
Apache Mahout
10
Apache Mahout
• Apache Software Foundation project
• Create scalable machine learning libraries
• Why ?
• Many Open Source ML libraries either:• Lack Community• Lack Documentation• Lack Scalability• Or are research-oriented only
Support Machine Learning
12
But Must Scale & Perform
• Be as fast as possible
• Scale to as much data as possible
13
But Must Scale & Perform
• Be as fast as possible given intrinsic algorithm !
• What is expressible as map-reduce jobs ?
• Work in progress . . .
14
C1: Collaborative Filtering
15
C2: Clustering
• Group similar objects together
• K-Means, Fuzzy K-Means, Density-Based,…
• Different distance measures• Manhattan, Euclidean, …
16
C3: Classification
17
FPM: Frequent Pattern Mining
• Find the frequent itemsets• <milk, bread, cheese> are sold
frequently together
• Very common in market analysis, access pattern analysis, etc…
18
Matrices and Statistics
• Math libraries• Vectors, matrices, etc.
• Noise reduction
• Similarity Functions