introduction to microsoft azure machine learning€¦ · ppt file · web view ·...
TRANSCRIPT
![Page 1: Introduction to Microsoft Azure Machine Learning€¦ · PPT file · Web view · 2018-04-09Microsoft Azure Machine Learning. ... drag-and-drop modules ... For every example, we](https://reader031.vdocuments.site/reader031/viewer/2022012305/5addadee7f8b9ae1408d4bbc/html5/thumbnails/1.jpg)
Girish NathanMisha Bilenko
Microsoft Azure Machine Learning
How to Work with Large Datasets to Build Predictive Models
![Page 2: Introduction to Microsoft Azure Machine Learning€¦ · PPT file · Web view · 2018-04-09Microsoft Azure Machine Learning. ... drag-and-drop modules ... For every example, we](https://reader031.vdocuments.site/reader031/viewer/2022012305/5addadee7f8b9ae1408d4bbc/html5/thumbnails/2.jpg)
Agenda
1. How to Work with Large Datasets• Sample Dataset: NYC Taxi • HDInsight (Hadoop on Azure) • iPython notebook and HDInsight
2. Building Predictive Models• Azure ML Studio• Learning with Counts
3. Putting it all together: Learning with Counts and HDInsight
![Page 3: Introduction to Microsoft Azure Machine Learning€¦ · PPT file · Web view · 2018-04-09Microsoft Azure Machine Learning. ... drag-and-drop modules ... For every example, we](https://reader031.vdocuments.site/reader031/viewer/2022012305/5addadee7f8b9ae1408d4bbc/html5/thumbnails/3.jpg)
Sample Data: NYC Taxi• One year log of NYC taxi rides• 60GB, publicly available at http://www.andresmh.com/nyctaxitrips/• Trip (driver id, times, locations) and fare (fare, tip, tolls)
• Rest of tutorial: data wrangling and tip prediction• Tools: AzCopy, HDInsight, iPython, Azure ML Studio
![Page 4: Introduction to Microsoft Azure Machine Learning€¦ · PPT file · Web view · 2018-04-09Microsoft Azure Machine Learning. ... drag-and-drop modules ... For every example, we](https://reader031.vdocuments.site/reader031/viewer/2022012305/5addadee7f8b9ae1408d4bbc/html5/thumbnails/4.jpg)
• 100% Apache Hadoop as an Azure service• Can deploy on Windows or Linux• Provides Map-Reduce capability over big data in Azure
blobs• Head node: job and cluster monitoring• Hive: SQL-like queries as an alternative to writing codeSELECT Col1, COUNT(*) AS Count_Col1 FROM Your_TableGROUP BY Col1 ORDER BY Count_Col1 DESC LIMIT 10;
HD Insight : Hadoop on Azure
![Page 5: Introduction to Microsoft Azure Machine Learning€¦ · PPT file · Web view · 2018-04-09Microsoft Azure Machine Learning. ... drag-and-drop modules ... For every example, we](https://reader031.vdocuments.site/reader031/viewer/2022012305/5addadee7f8b9ae1408d4bbc/html5/thumbnails/5.jpg)
• Web-based Python REPL environment• Combines authoring, execution, visualization• Can author and execute HDInsight Hive queries• Sample query (python code snippet)
def submit_hive_query(self): response=urllib2.urlopen(self.url, self.hiveParams)data = json.load(response)self.hiveJobID = data[‘id’] def query(self, queryString):self.submit_hive_query()Example query string: SELECT * FROM sample_table LIMIT 10;
Ipython Notebook
![Page 6: Introduction to Microsoft Azure Machine Learning€¦ · PPT file · Web view · 2018-04-09Microsoft Azure Machine Learning. ... drag-and-drop modules ... For every example, we](https://reader031.vdocuments.site/reader031/viewer/2022012305/5addadee7f8b9ae1408d4bbc/html5/thumbnails/6.jpg)
• Fully managed cloud service• Browser based authoring of
dataflow• Best in class machine
learning algorithms • Support for R/Python/SQL• Collaborative data science • Quickly deploy models as
web services/REST API’s• Publish to a gallery for
collaboration with community
What is Azure ML Studio
![Page 7: Introduction to Microsoft Azure Machine Learning€¦ · PPT file · Web view · 2018-04-09Microsoft Azure Machine Learning. ... drag-and-drop modules ... For every example, we](https://reader031.vdocuments.site/reader031/viewer/2022012305/5addadee7f8b9ae1408d4bbc/html5/thumbnails/7.jpg)
(Distributed Robust Algorithm for CoUnt-based LeArning)
Misha Bilenko
Microsoft Azure Machine LearningMicrosoft Research
Learning with Counts a.k.a Dracula
![Page 8: Introduction to Microsoft Azure Machine Learning€¦ · PPT file · Web view · 2018-04-09Microsoft Azure Machine Learning. ... drag-and-drop modules ... For every example, we](https://reader031.vdocuments.site/reader031/viewer/2022012305/5addadee7f8b9ae1408d4bbc/html5/thumbnails/8.jpg)
adid = 1010054353adText = K2 ski sale!adURL= www.k2.com/sale
Userid = 0xb49129827048dd9bIP = 131.107.65.14
Query = powder skisQCategories = {skiing, outdoor gear}
8
¿𝑢𝑠𝑒𝑟𝑠 109 ¿𝑞𝑢𝑒𝑟𝑖𝑒𝑠 109+¿¿𝑎𝑑𝑠 107 ¿ (𝑎𝑑×𝑞𝑢𝑒𝑟𝑦 ) 1010+¿ ¿
• Information retrieval• Advertising, recommending, search: item, page/query, user
• Transaction classification• Payment fraud: transaction, product, user• Email spam: message, sender, recipient• Intrusion detection: session, system, user• IoT: device, location
Large Scale learning in multi entity domains
![Page 9: Introduction to Microsoft Azure Machine Learning€¦ · PPT file · Web view · 2018-04-09Microsoft Azure Machine Learning. ... drag-and-drop modules ... For every example, we](https://reader031.vdocuments.site/reader031/viewer/2022012305/5addadee7f8b9ae1408d4bbc/html5/thumbnails/9.jpg)
adid: 1010054353adText: Fall ski sale!adURL: www.k2.com/sale
userid 0xb49129827048dd9bIP 131.107.65.14
query powder skisqCategories {skiing, outdoor gear}
9
• Problem: representing high-cardinality attributes as features• Scalable: to billions of attribute values• Efficient: predictions/sec• Flexible: for a variety of downstream learners• Adaptive: to distribution change
• Standard approaches: binary features, hashing, projections• What everyone uses in industry: learning with counts• This talk: formalization and generalization
Large Scale learning in multi entity domains
![Page 10: Introduction to Microsoft Azure Machine Learning€¦ · PPT file · Web view · 2018-04-09Microsoft Azure Machine Learning. ... drag-and-drop modules ... For every example, we](https://reader031.vdocuments.site/reader031/viewer/2022012305/5addadee7f8b9ae1408d4bbc/html5/thumbnails/10.jpg)
• Features are transforms of conditional statistics (per-label counts)
= [N+ N- log(N+)-log(N-) IsBackoff]• log(N+)-log(N-) = log log-odds/Naïve Bayes estimate
• N+, N- indicators of confidence of the naïve estimate
• IsFromRest: indicator of back-off vs. “real count”
) )
131.107.65.14
) )
k2. com
)
powder skis
)
powder skis , k2. com
IP
173.194.33.9 46964 993424
87.250.251.11 31 843
131.107.65.14 12 430… … …
REST 745623 13964931
Learning with Counts
![Page 11: Introduction to Microsoft Azure Machine Learning€¦ · PPT file · Web view · 2018-04-09Microsoft Azure Machine Learning. ... drag-and-drop modules ... For every example, we](https://reader031.vdocuments.site/reader031/viewer/2022012305/5addadee7f8b9ae1408d4bbc/html5/thumbnails/11.jpg)
• Features are transforms of conditional counts = [N+ N- log(N+)-log(N-) IsBackoff]
Scalable “head” in memory + tail in backoff; or: count-min sketch Efficient low cost, low dimensionality Flexible low dimensionality works well with non-linear learners new values easily added, back-off for infrequent values, temporal counts
) )
131.107.65.14
) )
k2. com
)
powder skis
)
powder skis , k2. com
IP
173.194.33.9 46964 993424
87.250.251.11 31 843
131.107.65.14 12 430… … …
REST 745623 13964931
Learning with Counts
![Page 12: Introduction to Microsoft Azure Machine Learning€¦ · PPT file · Web view · 2018-04-09Microsoft Azure Machine Learning. ... drag-and-drop modules ... For every example, we](https://reader031.vdocuments.site/reader031/viewer/2022012305/5addadee7f8b9ae1408d4bbc/html5/thumbnails/12.jpg)
Aggregate for different • Standard MapReduce• Bin function: any projection• Backoff options: “tail bin”, hashing,
hierarchical (shrinkage)
IP
173.194.33.9 46964 993424
87.250.251.11 31 843
131.253.13.32 12 430… … …
REST 745623 13964931
query
facebook 281912 7957321
dozen roses 32791 640964… … …
REST 6321789 43477252
Query × AdId
facebook, ad1 54546 978964
facebook, ad2 232343 8431467
dozen roses, ad3 12973 430982… … …
REST 4419312
52754683
timeTnow
Counting
IP[2]
173.194.*.* 46964 993424
87.250.*.* 6341 91356
131.253.*.* 75126 430826… … …
12
Learning with Counts : aggregation
![Page 13: Introduction to Microsoft Azure Machine Learning€¦ · PPT file · Web view · 2018-04-09Microsoft Azure Machine Learning. ... drag-and-drop modules ... For every example, we](https://reader031.vdocuments.site/reader031/viewer/2022012305/5addadee7f8b9ae1408d4bbc/html5/thumbnails/13.jpg)
IP
173.194.33.9 46964 993424
87.250.251.11 31 843
131.253.13.32 12 430… … …
REST 745623 13964931
query
facebook 281912 7957321
dozen roses 32791 640964… … …
REST 6321789 43477252
timeTnow
Train predictor
….
IsBackoff
ln𝑁 +¿− ln𝑁−¿Aggregatedfeatures
Original numeric features𝑁−𝑁+¿¿
Counting
Train non-linear model on count-based features
• Counts, transforms, lookup properties
• Additional features can be injected
Query × AdId
facebook, ad1 54546 978964
facebook, ad2 232343 8431467
dozen roses, ad3 12973 430982… … …
REST 4419312
52754683
13
Learning with Counts : combiner training
![Page 14: Introduction to Microsoft Azure Machine Learning€¦ · PPT file · Web view · 2018-04-09Microsoft Azure Machine Learning. ... drag-and-drop modules ... For every example, we](https://reader031.vdocuments.site/reader031/viewer/2022012305/5addadee7f8b9ae1408d4bbc/html5/thumbnails/14.jpg)
IP
173.194.33.9 46964 993424
87.250.251.11 31 843
131.253.13.32 12 430… … …
REST 745623 13964931
query
facebook 281912 7957321
dozen roses 32791 640964… … …
REST 6321789 43477252
URL × Country
url1, US 54546 978964
url2, CA 232343 8431467
url3, FR 12973 430982… … …
REST 4419312
52754683
timeTnow
….
IsBackoff
ln𝑁 +¿− ln𝑁−¿Aggregatedfeatures
𝑁−𝑁+¿¿
Counting
• Counts are updated continuously
• Combiner re-training infrequent
Ttrain
Original numeric features
Prediction with counts
![Page 15: Introduction to Microsoft Azure Machine Learning€¦ · PPT file · Web view · 2018-04-09Microsoft Azure Machine Learning. ... drag-and-drop modules ... For every example, we](https://reader031.vdocuments.site/reader031/viewer/2022012305/5addadee7f8b9ae1408d4bbc/html5/thumbnails/15.jpg)
• State-of-the-art accuracy• Good fit for map-reduce• Modular (vs. monolithic)• Learner can be tuned/monitored/replaced in isolation
• Monitorable, debuggable (this is HUGE in practice!)• Temporal changes easy to monitor• Easy emergency recovery (remove bot attacks, etc.)• Decomposable predictions• Error debugging (which feature can we blame…)
15
What is great about learning with Counts ?
![Page 16: Introduction to Microsoft Azure Machine Learning€¦ · PPT file · Web view · 2018-04-09Microsoft Azure Machine Learning. ... drag-and-drop modules ... For every example, we](https://reader031.vdocuments.site/reader031/viewer/2022012305/5addadee7f8b9ae1408d4bbc/html5/thumbnails/16.jpg)
Learning with Counts : in Azure ML
![Page 17: Introduction to Microsoft Azure Machine Learning€¦ · PPT file · Web view · 2018-04-09Microsoft Azure Machine Learning. ... drag-and-drop modules ... For every example, we](https://reader031.vdocuments.site/reader031/viewer/2022012305/5addadee7f8b9ae1408d4bbc/html5/thumbnails/17.jpg)
• HDInsight: large data storage and map-reduce processing
• Azure ML: cloud ML and analytics accessible anywhere
• Learning with Counts: intuitive, flexible large-scale ML solution
Putting it all together
![Page 18: Introduction to Microsoft Azure Machine Learning€¦ · PPT file · Web view · 2018-04-09Microsoft Azure Machine Learning. ... drag-and-drop modules ... For every example, we](https://reader031.vdocuments.site/reader031/viewer/2022012305/5addadee7f8b9ae1408d4bbc/html5/thumbnails/18.jpg)
Thanks for your time
Useful Links:http://azure.microsoft.com/ml- Sign up for your free Azure ML Trial
http://bit.ly/datasc_ebook - Free tutorial on how to use Azure ML
Need Azure ML for teaching in classroom ? - Contact the speakers
Other Questions ? - Contact the speakers
Speakers :-Misha Bilenko : [email protected] Nathan – [email protected]