hortonworks & bilot data driven transformations with hadoop
Post on 13-Jan-2017
279 Views
Preview:
TRANSCRIPT
Data driven transformations
Mats JohanssonSolutions Engineer - EMEA
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Page 2 © Hortonworks Inc. 2014
Traditional systems under pressureChallenges
• Constrains data to app• Can’t manage new data
• Costly to Scale
Business Value
Clickstream
Geolocation
Web Data
Internet of Things
Docs, emails
Server logs
20122.8 Zettabytes
202040 Zettabytes
LAGGARDS
INDUSTRY LEADERS
1
2 New Data
ERP CRM SCM
New
Traditional
Page 3 © Hortonworks Inc. 2014
Modern Data Architecture emerges to unify data & processing
Modern Data Architecture• Enable applications to have access to all your enterprise data through an efficient centralized platform
• Supported with a centralized approach governance, security and operations
• Versatile to handle any applications and datasets no matter the size or type
Clickstream Web & Social
Geolocation Sensor & Machine
Server Logs
Unstructured
SOURCES
Existing Systems
ERP CRM SCM
ANALYTICS
Data Marts
Business Analytics
Visualization& Dashboards
ANALYTICS
Applications Business Analytics
Visualization& Dashboards
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
HDFS (Hadoop Distributed File System)
YARN: Data Operating System
Interactive Real-TimeBatch Partner ISVBatch BatchMPP
EDW
Page 4 © Hortonworks Inc. 2014
Hortonworks Data Platformpowered by Apache Hadoop
Hortonworks Data Platformpowered by Apache Hadoop
EnrichContext
Store Data and Metadata
Internetof Anything
Hortonworks DataFlow powered by Apache NiFi
Perishable Insights
HistoricalInsights
Hortonworks DataFlow Adds to Hadoop Capabilities
Hortonworks DataFlow and Hortonworks Data Platform deliver the industry’s most complete solution Big Data management
Page 5 © Hortonworks Inc. 2014
Only Hortonworks Delivers Open Enterprise Hadoop
HOR TONWOR K S D ATA P L AT FORM
YARN: Data Operating System
CLICKSTREAM SENSOR SOCIAL MOBILE GEOLOCATION SERVERLOG
Batch Interactive Search Streaming Machine Learning
EXISTING
Page 6 © Hortonworks Inc. 2014
YARND A T A O P E R A T I N G S Y S T E M
OPERATIONS SECURI TY
GOVERNANCE
STORAGE
STORAGE
MachineLearningBatch
StreamingInteractive
Search
Centralized Platformfor operations, governance and security
Diverse Applicationsrun simultaneously on a single cluster
Maximum Data Ingestincluding existing and new sources, regardless of raw format
Shared Big Data Assetsacross business groups, functions and users
Centralized Platform with YARN-Based Architecture
Page 7 © Hortonworks Inc. 2014
Offering You the Most Flexibility
AN Y D ATAExisting and new datasets
A N Y A P P L IC AT IONMultiple engines for data analysis
A N YWH ER EComplete range of deployment options
Batch
Interactive
Search
Streaming
Machine Learning
Click-stream Sensor
Social Mobile
Geo-Location
ServerLog Linux Windows
CloudOn-Premise
Page 8 © Hortonworks Inc. 2014
Hortonworks Capabilities
The Data Flow Thing
Processand
AnalyzeCollect
Store & Integrate
Page 9 © Hortonworks Inc. 2014
Hadoop Driver: Cost optimization
Archive Data off EDWMove rarely used data to Hadoop as active archive, store more data longer
Offload costly ETL processFree your EDW to perform high-value functions like analytics & operations, not ETL
Enrich the value of your EDWUse Hadoop to refine new data sources, such as web and machine data for new analytical context
ANALYTICS
Data Marts
Business Analytics
Visualization& Dashboards
HDP helps you reduce costs and optimize the value associated with your EDW
ANALYTICS
DATA SYSTEMS
Data Marts
Business Analytics
Visualization& Dashboards
HDP 2.3
ELT°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
N
Cold Data, Deeper Archive& New Sources
Enterprise Data Warehouse
Hot
MPP
In-Memory
Clickstream Web & Social
Geolocation Sensor & Machine
Server Logs
Unstructured
Existing Systems
ERP CRM SCM
SOURCES
Page 10 © Hortonworks Inc. 2014
Single ViewImprove acquisition and retention
Predictive Analytics Identify your next best action
Data DiscoveryUncover new findings
Financial ServicesNew Account Risk Screens Trading Risk Insurance Underwriting
Improved Customer Service Insurance Underwriting Aggregate Banking Data as a Service
Cross-sell & Upsell of Financial Products Risk Analysis for Usage-Based Car Insurance Identify Claims Errors for Reimbursement
TelecomUnified Household View of the Customer Searchable Data for NPTB Recommendations Protect Customer Data from Employee Misuse
Analyze Call Center Contacts Records Network Infrastructure Capacity Planning Call Detail Records (CDR) Analysis
Inferred Demographics for Improved Targeting Proactive Maintenance on Transmission Equipment Tiered Service for High-Value Customers
Retail360° View of the Customer Supply Chain Optimization Website Optimization for Path to Purchase
Localized, Personalized Promotions A/B Testing for Online Advertisements Data-Driven Pricing, improved loyalty programs
Customer Segmentation Personalized, Real-time Offers In-Store Shopper Behavior
ManufacturingSupply Chain and Logistics Optimize Warehouse Inventory Levels Product Insight from Electronic Usage Data
Assembly Line Quality Assurance Proactive Equipment Maintenance Crowdsource Quality Assurance
Single View of a Product Throughout Lifecycle Connected Car Data for Ongoing Innovation Improve Manufacturing Yields
HealthcareElectronic Medical Records Monitor Patient Vitals in Real-Time Use Genomic Data in Medical Trials
Improving Lifelong Care for Epilepsy Rapid Stroke Detection and Intervention Monitor Medical Supply Chain to Reduce Waste
Reduce Patient Re-Admittance Rates Video Analysis for Surgical Decision Support Healthcare Analytics as a Service
Oil & GasUnify Exploration & Production Data Monitor Rig Safety in Real-Time Geographic exploration
DCA to Slow Well Declines Curves Proactive Maintenance for Oil Field Equipment Define Operational Set Points for Wells
GovernmentSingle View of Entity CBM & Autonomic Logistic Analysis Sentiment Analysis on Program Effectiveness
Prevent Fraud, Waste and Abuse Proactive Maintenance for Public Infrastructure Meet Deadlines for Government Reporting
Hadoop Driver: Advanced analytic applications
Page 11 © Hortonworks Inc. 2014
Hortonworks Data Platform
Hortonworks Data Platform 2.3
Hortonworks Data Platform provides Hadoop for the Enterprise: a centralized architecture of core enterprise services, for any application and any data.
Open & Enterprise
• HDP incorporates every element required of an enterprise data platform: data storage, data access, governance, security, operations
• All components are developed in open source and then rigorously tested, certified, and delivered as an integrated open source platform that’s easy to consume and use by the enterprise and ecosystem.
YARN: Data Operating System(Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
Apache Pig
° °
° °
° ° °
° ° °
HDFS (Hadoop Distributed File System)
INTEGRATION GOVERNANCE BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
Apache Falcon
Apache Hive
Apache Slider
Apache HBase
Apache Accumulo
Apache Solr
Apache Spark
Apache Storm
Apache Sqoop
Apache Flume
Apache Kafka
SECURITY
Apache Ranger
Apache Knox
Apache Falcon
OPERATIONS
Apache Ambari
ApacheZookeeper
Apache Oozie
Apache Atlas
Apache Atlas Cloudbreak
Page 12 © Hortonworks Inc. 2014
HDP: Any Data, Any Application, Anywhere
Any Application• Deep integration with ecosystem partners to extend existing investments and skills
• Broadest set of applications through the stable of YARN-Ready applications
Any DataDeploy applications fueled by clickstream, sensor, social, mobile, geo-location, server log, and other new paradigm datasets with existing legacy datasets.
AnywhereImplement HDP naturally across the complete range of deployment options
Clickstream Web & Social
Geolocation Internet of Things
Server Logs
Files, emailsERP CRM SCM
hybrid
commodity appliance cloud
Over 70 Hortonworks Certified YARN Apps
The Data LakeUse Cases
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Page 14 © Hortonworks Inc. 2014
What is a Data Lake?
§ It is a PLATFORM for your data. (NOT a database)§Multipurpose open PLATFORM to land all data in a single place and interact with it many ways (Stream, Batch, Interactive Query).
§A platform that allows for the ecosystem to provide higher level services (SAP, SAS, Microsoft, Teradata, etc..)
§Provides first class APIs and frameworks to enable integration§Provides first class data management capabilities (metadata management, security, governance, transformation pipelines, replication, retention, etc..)
Page 14
Spotify Use Case
Full presentation available at:
http://www.slideshare.net/JoshBaer/how-apache-drives-music-recommendations-at-spotify?related=1
Page 16 © Hortonworks Inc. 2014Page 16 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Data Discovery and Predictive AnalyticsElefante Wine Inc. Use Case & Demo
Mats JohanssonSolutions Engineer EMEAHortonworks
Tweet: #hadooproadshow
Page 17 © Hortonworks Inc. 2014
Elefante Wine Current ChallengesThe CompanyElefante Wine is a boutique wine fulfillment company with a large fleet of trucks. It delivers wine in a highly-regulated industry with stringent transportation requirements.
The SituationRecently a number of driver violations led to fines and increased insurance rates
The Challenges• Rising Operational Costs• Driver Safety• Risk Management• Logistics Optimization
Tweet: #hadooproadshow
Page 18 © Hortonworks Inc. 2014
Elefante Wine Risk and Driver Safety Challenges
Trucks outfitted with new sensors generating large volumes of new data:
• Location
• Speed
• Driver Violations
Need to be integrate real-time & historical data
Increase safety and reduce liabilitiesAnticipate driver violations BEFORE they happen and take precautionary actions
Find predictive correlations in driver behavior over large volumes of real-time data
Difficult to deliver timely insights to the right people and systems to take action
Data DiscoveryUncover new findings
Predictive Analytics Identify your next best action
Better Understandingof the Past
Better Prediction of the Future
Tweet: #hadooproadshow
Page 19 © Hortonworks Inc. 2014
Elefante Wine’s YARN-enabled Architecture
Distributed Storage: HDFS
Many Workloads: YARN
Stream Processing (Storm)
Inbound Messaging(Kafka)
Real-‐time Serving (HBase)
Alerts & Events(ActiveMQ)
Real-‐Time Web App
SQL
Interactive Query(Hive on Tez)
Truck Sensors
One cluster with consistent security, governance & operations
Tweet: #hadooproadshow
Page 20 © Hortonworks Inc. 2014
Explore Enriched Events to Build a Predictive Model
Apache ZeppelinNotebook environment that supports SparkAgile data visualizations
Zeppelin Supports Spark Jobs on YARN
Data ScientistsExplore and visualize events in ZeppelinBuild a machine-learning model in Spark, to predict driver violations
Tweet: #hadooproadshow
Page 21 © Hortonworks Inc. 2014
Streaming DemoData Discovery Through Streaming Sensor Data from Trucks
Page 22 © Hortonworks Inc. 2014
Enriching Truck Events for Analysis with Pig
HDFS Raw Truck EventsWeather Data Sets
Raw Weather Data
HCatalog (Metadata)
Payroll Data
HR & Payroll DBs
Load Raw Truck Events
Clean & Filter
Cleaned Events
TransformedEvents
Transform
Join withHR & weather data
EnrichedEvents
Enriched Events
Store
Zeppelin
Tweet: #hadooproadshow
Page 23 © Hortonworks Inc. 2014
Apache Zeppelin Visualization DemoExploring and Model Building on enriched sensor data
Page 24 © Hortonworks Inc. 2014
Recommendations from the CDO
Investment recommendations, in order of priority
1. Visibility sensors and auto braking systems to deal with foggy conditions2. Slip-resistant tires for improved safety during rainy conditions3. Driver certification to minimize violations
Tweet: #hadooproadshow
Page 25 © Hortonworks Inc. 2014
Apps on YARN
Trucking company datasets stored in HDFS
Real-time and Predictive Application Architecture
Your BI Tool
Predictive application
Truck sensors
App alerts(ActiveMQ)
Messages
SQL Stream NoSQLMLUseModel
Tweet: #hadooproadshow
Page 26 © Hortonworks Inc. 2014
Large Scale Machine-‐Learning Insights for ElefanteWine
Improve Predictive PowerAlgorithms on Terabytes of dataImprove confidence by testing hypotheses over huge datasets
Accelerate Time to MarketRapidly test out machine-learning algorithms
Integrate Predictive Models into AppsRun models in Storm or your other apps
Run it All in a Multi-Tenant ClusterLarge scale machine learning on YARN respects other tenants in an HDP cluster
Tweet: #hadooproadshow
Page 27 © Hortonworks Inc. 2014Page 27 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Tweet: #hadooproadshow
Try It Yourself, Download the Sandbox:
hortonworks.com/sandbox
Tweet: #hadooproadshow
Page 28 © Hortonworks Inc. 2014
Thank you!
Mats Johansson
mjohansson@hortonworks.com
@matsjo66
https://se.linkedin.com/in/matsjo66
top related