using hadoop for big data
TRANSCRIPT
Hadoop for (Young) Data Scientist
Komes Chandavimol and TeamData Science Lab, Thailand
Agenda
• Big Data, Analytics and Data Science
• Hadoop + Sparks Workshops
• Sharing Experience: Hadoop (Real) Use Cases
• Hadoop + Spark Trends,
3
Big Data, Analytics and Data Science
Big Data
http://www.adweek.com/prnewser/how-many-times-do-the-worlds-social-media-users-click-every-minute/117427
https://www.domo.com/learn/data-never-sleeps-3-0
Internet of Things
http://topmanagement.com.mx/innovacion-social-y-empresarial-objetivo-de-hitachi/
6http://www.adweek.com/prnewser/how-many-times-do-the-worlds-social-media-users-click-every-minute/117427
https://www.domo.com/learn/data-never-sleeps-3-0
The Growth of Data
7http://www.adweek.com/prnewser/how-many-times-do-the-worlds-social-media-users-click-every-minute/117427
https://www.domo.com/learn/data-never-sleeps-3-0
What is Big Data?
8http://blogs.forrester.com/category/hadoophttp://solutions.forrester.com/Global/FileLib/webinars/Big_Data_-_Gold_Rush_or_Illusion.pdf
The Big Data Tools
http://thebigdatablog.weebly.com/blog/the-hadoop-ecosystem-overview
11http://hortonworks.com/blog/optimize-your-data-architecture-with-hadoop/
Traditional Data Management Architecture
12http://hortonworks.com/blog/optimize-your-data-architecture-with-hadoop/
New Data Management Architecture
13http://www.kdnuggets.com/2014/05/big-data-landscape-v30-analyzed.html
14
https://www.digitalnewsasia.com/business/forget-data-warehousing-its-data-lakes-now
Data Lake
How the Data Lake works?
15http://www.clearpeaks.com/blog/category/tableau
Traditional Enterprise Data warehouse
16
What you consume from Data Lake?
https://www.digitalnewsasia.com/business/forget-data-warehousing-its-data-lakes-now
17
Volume? Variety? Velocity?
https://www.digitalnewsasia.com/business/forget-data-warehousing-its-data-lakes-now
18
Value
https://www.digitalnewsasia.com/business/forget-data-warehousing-its-data-lakes-now
19
Big Data + Analytics = Values
https://www.digitalnewsasia.com/business/forget-data-warehousing-its-data-lakes-now
Big Data Analytics
20http://hortonworks.com/blog/big-data-refinery-fuels-next-generation-data-architecture/
Big Data Analytics
21http://dataofthings.blogspot.com/2014/04/the-bbbt-sessions-hortonworks-big-data.html
Big Data Analytics
22http://www.gartner.com/it-glossary/predictive-analytics
23
How to do Big Data Analytics?
https://www.digitalnewsasia.com/business/forget-data-warehousing-its-data-lakes-now
Data Science Experience Sharing, Big Data Challenge #2,Bangkok Thailand
http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
What is Data Science?
The Rise of Data Scientist
27
http://flowingdata.com/2009/06/04/rise-of-the-data-scientist/
2009
https://hbr.org/
28http://hrb.org
http://www.anlytcs.com/2014/01/data-science-venn-diagram-v20.html
2014
The Rise of Data Scientist
Data Science Experience Sharing, Big Data Challenge #2,Bangkok Thailand
http://www.anlytcs.com/2014/01/data-science-venn-diagram-v20.html
2014
The Data Science
30
The Solution, Data Science Team
31
Data Science Team
Doing Data Science by O'Neil et al (2013)
32
Doing Data Science by O'Neil et al (2013)
33
Doing Data Science by O'Neil et al (2013)
Data Science Team
Analyzing the Analyzers, Harris (2013)
34
Data Science TeamData Scientist & Data Engineer
http://www.kdnuggets.com/2015/11/different-data-science-roles-industry.html
35
Data Science TeamData Scientist & Data Engineer
http://www.kdnuggets.com/2015/11/different-data-science-roles-industry.htmlhttps://www.facebook.com/DataScienceTh/posts/931828353527079:0
36
Data Science Professionals
http://www.kdnuggets.com/2015/11/different-data-science-roles-industry.html
37
Data Science for Dummies Pierson
(2015)
∗Build In-house Team
• Train existing employee
• Train existing employee and hire experts
• Hire experts
∗Outsourcing requirements to private DS consultants
• Outsourcing for comprehensive DS Strategy development
• Outsource for DS Solutions to specific problem
∗Leverage Cloud-based platform solutions
How to build DS Team?
Machine Learning
Improving Performance in some Task with Experience”. Tom Mitchell
Tom Mitchell (1998)
The field of study that gives computers the ability to learn
without being explicitly programmed. Arthur Samuel (1990)
Wikipedia, Data Visualization for Dummies (2014)
Data Points: Visualization That Means Something(2013)38
Machine Learning deals with systems
that can learn from data.
39
Machine Learning Discovery
• Class Discovery• Correlation Discovery• Novelty (Surprise) Discovery• Association (or Link Discovery)
40
KirkBorne-workshop-ODSC2016.pdf
The XYZ of Data Science
Smart X : • Smart Cities • Smart Highways • Smart Supply Chain Precision Y : • Precision Medicine • Precision Farming • Precision Pricing Personalized Z : • Personalized Health • Personalized Learning • Personalized Shopping Experience
41KirkBorne-Workshop-ODSC2016.pdf
Intelligence at the edge of the network… at the point of data collection
42DataInquest – Predictive Analytics and Data Science Bootcamp
Data Science is a Team Sport
http://www.ibmbigdatahub.com/blog/why-data-science-team-sport
44
How to Start?
45
Hadoop + Spark Workshops
49
Workshop #1 การติดตั้ง HDFS และ YARN
51
Workshop #2 WordCount
53
Workshop #3 WordCount (Streaming)
54
Workshop #4 WordCount(Frequency Sort)
56
Workshop #5 Setup Cloudera QuickStart
58
Workshop #6 Exploring HBASE data in HUE
59
Workshop #7 Design a Schema for quick twitter
relationship lookup
60
Workshop #8 Design a schema for IoT log
(Smart Meter)
61
Workshop #9 Create an HBase table for
Smart meter data
62
Workshop #10 Bank Customer Snapshot
65
Workshop #10.1 -10.1 Create Hive Tables
10.2 Create External Hive Tables10.3 Create External Hive Tables
10.4 Partition
67
Workshop #11SQOOP
73
Workshop spk1 WordCountspk2 WordCountspk3 WordCount
76
Workshop spk4 SparkSQL + ML
84
Sharing Experience:
Source: Analytics: The New Path to Value, a joint MIT Sloan Management Review and IBM Institute for Business Value study. Copyright © Massachusetts Institute of Technology 2010.
Top Performers Use Analytics 5
Times More Than Lower
Performers
Revenue - Cost = Profit
Monitoring and MaintenanceData sources: IoT Sensors in factory
Data products: predictive maintenance models
http://www.electrex.it/en/news/600-automated-energy-management-system-a-enms-for-cement-production-plants.html
Customer Engagement + LocationData sources: Mobile App, Loyalty Program, GIS
Data products: Buying behavior analysis, coupon-response model , location visualizationhttp://www.fastcompany.com/3020859/most-creative-people/how-chinas-one-child-policy-forced-starbucks-to-rethink-its-beijing-sto
Fuel Saving Data sources: Telematics (sensor), GPS
Data products: Prescriptive analytics – route
optimization, predictive maintenance
(parts/malfunction)http://www.cnet.com/news/ups-turns-data-analysis-into-big-savings/
Fraud DetectionData sources: historical pattern of transaction data
Data products: predictive models – fraud/non-fraudhttps://bluefishway.com/2013/09/13/panic-oh-no-not-again/
HR Analytics – Google Hiring Data sources: Historical hiring attributesData products: Predictive model – recruiting high performer
Behavioral Test
Situational Test
GPA
Brain Teaser
Good School
Average ROI of Analytics/Data Science
93
Hadoop + Spark Trends