solving big data problems using hortonworks
TRANSCRIPT
Solving Big Data Problems using Hortonworks
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
ON
LY 100open source
Apache Hadoop data platform
% Founded in 2011
HADOOP1STprovider to go public
IPO 4Q14 (NASDAQ: HDP)
employees across800+
countries
technology partners1,350
17TM
Hortonworks Company Profile
Fastest company to reach $100 M in revenue
Let’s talk about Big Data
, September 2014 survey of 100 CIOs from the US and Europe
What problems and opportunities does Big Data create?
Data that traditional platforms
cannot handleNEW
TRADITIONAL
The OpportunityUnlock transformational business valuefrom a full fidelity of data and analyticsfor all data.
Geolocation
Server logs
Files & emails
ERP, CRM, SCM
Traditional Data Sources
New Data Sources
Sensorsand machines
Clickstream
Social media
The Future of Data: Actionable Intelligence
D A T A I N M O T I O N
ST
OR
AG
E
ST
OR
AG
E
GR OU P 2GR OU P 1
GR OU P 4GR OU P 3
D A T A A T R E S T
INTERNETOF
ANYTHING
Hortonworks Data Platform
H O R T O N W O R K S D A TA P L A T F O R M
Batch Interactive Search Streaming Machine Learning
YARN Resource Management System
CLICKSTREAM SENSOR SOCIAL MOBILE GEOLOCATIONS SERVER LOG EXISTING
HDP is a collection of Apache Projects
HORTONWORKS DATA PLATFORM
Had
oop
&YA
RN
Flum
e
Ooz
ie
Pig
Hiv
e
Tez
Sqoo
p
Clo
udbr
eak
Am
bari
Slid
er
Kaf
ka
Kno
x
Solr
Zook
eepe
r
Spar
k
Falc
on
Ran
ger
HB
ase
Atla
s
Acc
umul
o
Stor
m
Phoe
nix
4.10.2
DATA MGMT DATA ACCESS GOVERNANCE & INTEGRATION OPERATIONS SECURITY
HDP 2.2Dec 2014
HDP 2.1April 2014
HDP 2.0Oct 2013
HDP 2.2Dec 2014
HDP 2.1April 2014
HDP 2.0Oct 2013 0.12.0 0.12.0
0.12.1 0.13.0 0.4.0
1.4.4 1.4.4 3.3.23.4.5
0.4.00.5.0
0.14.0 0.14.0 3.4.6 0.5.0 0.4.00.9.30.5.2
4.0.04.7.2
1.2.1 0.60.0 0.98.4 4.2.0 1.6.1 0.6.0 1.5.21.4.5 4.1.01.7.0
1.4.0 1.5.1 4.0.0
1.3.1
1.5.1 1.4.4 3.4.5
1.3.1
2.2.0
2.4.0
2.6.0
2.7.1 1.4.6 1.0.0 0.6.0 0.5.02.1.00.8.2 3.4.61.5.25.2.1 0.80.0 1.1.1 0.5.01.7.04.4.0 0.10.0 0.6.10.7.01.2.10.15.0HDP 2.3
July 2015 4.2.0
Ongoing Innovation in Apache
0.96.1
0.98.0 0.9.1
0.8.1
Hortonworks Data Flow
Visual User InterfaceDrag and drop for efficient, agile operations
Immediate FeedbackStart, stop, tune, replay dataflows in real-time
Adaptive to Volume and BandwidthAny data, big or small
Event Level Data ProvenanceGovernance, compliance & data evaluation
Secure Data Acquisition & TransportFine grained encryption for controlled data sharing and selective data democratization
Powered by Apache NiFi
HDF and HDP Deliver a Complete Big Data Solution
• HDF dynamically connects HDP to data at the edge
• HDF secures and encrypts the movement of data into HDP
• HDF includes mature IoAT data protocols that improve device extensibility
• HDF supports easily adjustable bi-direction IoAT dataflows
• HDF offers traceability of IoAT data with lineage and audit trails
• HDF brings a real-time, visual user interface to manipulate live dataflows
ST
OR
AG
E
ST
OR
AG
E
Hortonworks Revenue Model
HDP and HDF are 100% free and Open Source – no license. Our customers subscribe to support, consulting experts and training programsAnnual Subscriptionsalign your success with ours
Expert Consulting & Traininghelp your team get to actionable intelligence as efficiently as possible
ARCHITECT&
DEVELOP
DEPLOY
OPERATE
Project 1
Project 5
Project 4
Project 3
Project 2
Project 6
EXPAND
Sales Plays
Hadoop Driver: Cost optimization
Archive Data off EDWMove rarely used data to Hadoop as active archive, store more data longer
Offload costly ETL processFree your EDW to perform high-value functions like analytics & operations, not ETL
Enrich the value of your EDWUse Hadoop to refine new data sources, such as web and machine data for new analytical context
ANAL
YTIC
S
Data Marts
Business Analytics
Visualization& Dashboards
HDP helps you reduce costs and optimize the value associated with your EDW
ANAL
YTIC
SD
ATA
SYS
TEM
S
Data Marts
Business Analytics
Visualization& Dashboards
HDP 2.3
ELT°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
N
Cold Data, Deeper Archive& New Sources
Enterprise Data Warehouse
Hot
MPP
In-Memory
Clickstream Web&Social
Geolocation Sensor& Machine
ServerLogs
Unstructured
Existing Systems
ERP CRM SCM
SOU
RC
ES
Single ViewImprove acquisition and retention
Predictive Analytics Identify your next best action
Data DiscoveryUncover new findings
Financial ServicesNew Account Risk Screens Trading Risk Insurance Underwriting
Improved Customer Service Insurance Underwriting Aggregate Banking Data as a Service
Cross-sell & Upsell of Financial Products Risk Analysis for Usage-Based Car Insurance Identify Claims Errors for Reimbursement
TelecomUnified Household View of the Customer Searchable Data for NPTB Recommendations Protect Customer Data from Employee Misuse
Analyze Call Center Contacts Records Network Infrastructure Capacity Planning Call Detail Records (CDR) Analysis
Inferred Demographics for Improved Targeting Proactive Maintenance on Transmission Equipment Tiered Service for High-Value Customers
Retail360° View of the Customer Supply Chain Optimization Website Optimization for Path to Purchase
Localized, Personalized Promotions A/B Testing for Online Advertisements Data-Driven Pricing, improved loyalty programs
Customer Segmentation Personalized, Real-time Offers In-Store Shopper Behavior
ManufacturingSupply Chain and Logistics Optimize Warehouse Inventory Levels Product Insight from Electronic Usage Data
Assembly Line Quality Assurance Proactive Equipment Maintenance Crowdsource Quality Assurance
Single View of a Product Throughout Lifecycle Connected Car Data for Ongoing Innovation Improve Manufacturing Yields
HealthcareElectronic Medical Records Monitor Patient Vitals in Real-Time Use Genomic Data in Medical Trials
Improving Lifelong Care for Epilepsy Rapid Stroke Detection and Intervention Monitor Medical Supply Chain to Reduce Waste
Reduce Patient Re-Admittance Rates Video Analysis for Surgical Decision Support Healthcare Analytics as a Service
Oil & GasUnify Exploration & Production Data Monitor Rig Safety in Real-Time Geographic exploration
DCA to Slow Well Declines Curves Proactive Maintenance for Oil Field Equipment Define Operational Set Points for Wells
GovernmentSingle View of Entity CBM & Autonomic Logistic Analysis Sentiment Analysis on Program Effectiveness
Prevent Fraud, Waste and Abuse Proactive Maintenance for Public Infrastructure Meet Deadlines for Government Reporting
Hadoop Driver: Advanced analytic applications
NiFi and HDF Drivers
Optimize Splunk: Reduce costs by pre-filtering data so that only relevant content is forwarded into Splunk
Ingest Logs for Cyber Security: Integrated and secure log collection for real-time data analytics and threat detection
Feed Data to Streaming Analytics: Accelerate big data ROI by streaming data into analytics systems such as Apache Storm or Apache Spark Streaming
Move Data Internally: Optimize resource utilization by moving data between data centers or between on-premises infrastructure and cloud infrastructure
Capture IoT Data: Transport disparate and often remote IoTdata in real time, despite any limitations in device footprint, power or connectivity—avoiding data loss
Hadoop Driver: Enabling the data lakeSC
ALE
SCOPE
Data Lake Definition• Centralized Architecture
Multiple applications on a shared data set with consistent levels of service
• Any App, Any DataMultiple applications accessing all data affording new insights and opportunities.
• Unlocks ‘Systems of Insight’Advanced algorithms and applications used to derive new value and optimize existing value.
Drivers:1. Cost Optimization2. Advanced Analytic Apps
Goal:• Centralized Architecture• Data-driven Business
DATA LAKE
Journey to the Data Lake with Hadoop
Systems of Insight
Case Study: 12 month Hadoop evolution at TrueCarD
ata
Plat
form
Cap
abili
ties
12 months execution plan
June 2013Begin Hadoop Execution
July 2013Hortonworks Partnership
May ‘14IPO
Aug 2013Training & DevBegins
Nov 2013Production Cluster60 Nodes2 PB
Jan 201440% DevStaff Perficient
Dec 2013Three Production Apps(3 total)
Feb 2014Three More Production Apps(6 total)
12 Month Results at TRUECar• Six Production Hadoop Applications• Sixty nodes/2PB data• Storage Costs/Compute Costs
from $19/GB to $0.12/GB
“We addressed our data platform capabilities strategically as a pre-cursor to IPO.”
Hortonworks Data Platform
Hadoop emerged as foundation of new data architecture
Apache Hadoop is an open source data platform for managing large volumes of high velocity and variety of data• Built by Yahoo! to be the heartbeat of its ad & search business
• Donated to Apache Software Foundation in 2005 with rapid adoption by large web properties & early adopter enterprises
• Incredibly disruptive to current platform economics
Traditional Hadoop Advantagesü Manages new data paradigmü Handles data at scaleü Cost effectiveü Open source
Traditional Hadoop Had LimitationsBatch-only architecture Single purpose clusters, specific data setsDifficult to integrate with existing investmentsNot enterprise-grade
Application
StorageHDFS
Batch ProcessingMapReduce
20092006
1 ° ° ° ° °
° ° ° ° ° N
HDFS(HadoopDistributedFileSystem)
MapReduceLargelyBatchProcessing
Hadoop w/MapReduce
YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
°N
HDFS (Hadoop Distributed File System)
Hadoop2 & YARN based Architecture
Silo’d clustersLargely batch systemDifficult to integrate
MR-279:YARN
Hadoop 2 & YARN
Interactive Real-TimeBatch
Architected & led development of YARN to enable the Modern Data Architecture
October 23, 2013
Apache Hadoop – Data Operating System
Shared Compute & Workload Management• Common data platform, many applications• Support multi-tenant access & processing• Batch, interactive & real-time use cases
Common & Shared Scale Out Storage• Shared data assets• Flexible schema• Cross workload access
YARN: Data Operating System(Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
Script
Pig
SQL
Hive
TezTez
JavaScala
Cascading
Tez
° °
° °
° ° ° ° °
° ° ° ° °
Others
ISV Engines
HDFS (Hadoop Distributed File System)
Stream
Storm
Search
Solr
NoSQL
HBaseAccumulo
Slider Slider
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Enterprise Hadoop
Core Capabilities of Enterprise Hadoop
Load data and manage according
to policy
Deploy and effectively
manage the platform
Store and process all of your Corporate Data Assets
Access your data simultaneously in multiple ways(batch, interactive, real-time) Provide layered
approach tosecurity through Authentication, Authorization,
Accounting, and Data Protection
DATAMANAGEMENT
SECURITYDATAACCESSGOVERNANCE&INTEGRATION OPERATIONS
Enable both existing and new application toprovide value to the organization
PRESENTATION&APPLICATION
Empower existing operations and security tools to manage Hadoop
ENTERPRISEMGMT&SECURITY
Provide deployment choice across physical, virtual, cloud
DEPLOYMENTOPTIONS
Hortonworks Data Platform 2.3
YARN : Data Operating System
DATA ACCESS SECURITYGOVERNANCE & INTEGRATION OPERATIONS
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
N
AdministrationAuthenticationAuthorizationAuditingData Protection
RangerKnoxAtlasHDFS EncryptionData Workflow
SqoopFlumeKafkaNFSWebHDFS
Provisioning, Managing, & Monitoring
AmbariCloudbreakZookeeper
Scheduling
Oozie
Batch
MapReduce
Script
Pig
Search
Solr
SQL
Hive
NoSQL
HBaseAccumuloPhoenix
Stream
Storm
In-memory
Spark
Others
ISV Engines
TezTez Tez Slider Slider
HDFS Hadoop Distributed File System
DATA MANAGEMENT
Hortonworks Data Platform 2.3
DeploymentChoiceLinux Windows On-Premise Cloud
Data Lifecycle & Governance
FalconAtlas
Architectures
Basic EDW Cost Optimization Architecture
Batch
Sqoop
Transform
Processed
Hive
Raw
HDFS
Interactive
HiveServer
Reporting
BI Tools
Load
EDW
Existing Analytics
Fetch
1
2
3
4
ExternalTables
More than save cost, Enrich With New Data
Batch
Sqoop
Transform
Processed
Hive
Raw
HDFS
Interactive
HiveServer
Reporting
BI Tools
Load
EDW
New Sources
Streaming
NiFi
Load
Existing Analytics
Fetch
New Analytics
1
2
3
4
5
6
ExternalTables
Streaming Solution Architecture
HDP 2.x Data Lake
YARN
HDFS
APACHEKAFKA
SearchSolrSlider
OnlineDataProcessingHBaseAccumulo
RealTimeStreamProcessingStorm SQL
HiveStreaming Ingest
HDFS
HDP 2.x
Real-time data feeds
Key Tenants of Lambda Architecture
§ Batch Layer§ Manages master data§ Immutable, append-only set of raw data§ Cleanse, Normalize & Pre-Compute
Batch Views§ Advanced Statistical Calculations
§ Speed layer§ Real Time Event Stream Processing§ Computes Real-Time Views
§ Serving Layer § Low-latency, ad-hoc query§ Reporting, BI & Dashboard
New Data Stream
Store Pre-Compute Views
Process Streams
Incremental Views
Business View
Business View
Query
SPEED LAYER
BATCH LAYER
SERVING LAYER
HDP and HDF
High Level Big Data IoT Architecture
IoT on HDP
Problem Statement
Reference Architecture& Sizing
Solution Design& Customer Case Studies
Implementation Plan
Page 28 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Project Cost & ROI
www.hortonworks.com
Ms. Brady knows to get a handle on sky-rocketing
premiums, she will need to better understand what is causing the incidents and
being able to prevent them.Ms. Brady sets the goal of reducing incidents by 5%
within 90 days.
Incidents of maintenance vehicles have continued to increase under COO Brady’s watch
2012
17.5M
2013 2014 2015
Insurance Premiums
Ms. Brady tasks, her Business Analyst, Tam with
gathering the necessary data to understand the cause of and reduce
incidents.
Business AnalystTam
Mega Corp has a problem
www.hortonworks.com
Given the current premium cost of $3,500 per truck on 5,000 trucks, a 10% reduction in incidents will move the company from the high risk insurance category they are currently in and save the company $1000 on their insurance premium per truck per year or $5,000,000 annually.
Business AnalystTam
www.hortonworks.com
Tam considers four questions she must answer to better understand and mitigate incidents. The are:
1) Is there a correlation of driver training to incidents?
2) Is there a correlation of weather to incidents?
3) Is there a correlation between certain driving behavior and incidents?
4) Is is possible to predict incidents before they occur?
Business AnalystTam
…to Behavioral InsightFrom reaction to human activity
…to Resource OptimizationFrom static resource planning
From break then fix
Shift from Reactive……to…... Proactive & Proscriptive
…to Preventative Maintenance
www.hortonworks.com
Initially, Tam’s team is concerned that they may not be able to capture all the necessary data to answer the
questions Tam has posed and help her mitigate incidents. They know that the data is not all
structured and some of it is created in real-time and transmitted over the Internet. In addition, some data
will have to be captured from external sources.
Vehicle Data
Route Data
Weather Data
Structured Driver Data
Semi-Structured Maintenance Data
SueVarun Jeff
Page 33 © Hortonworks Inc. 2011 – 2015. All Rights ReservedD
ATA
SYS
TEM
SEnterprise Data Warehouse
Hot
MPP
In-Memory
1
2
Clickstream Web&Social
Geolocation Sensor& Machine
ServerLogs
Unstructured
RDBMS ERPCRM
Systems of Record
The Team Recognizes The Current Data Architectures Limits Predictive Capabilities
1. Data Silos: difficult to find predictive correlations
2. Data Volumes: cannot store enough data to find patterns
3. New Data Sources: unable to capture and use new data for real-time analysis
ANAL
YTIC
S
Data Marts
Business Analytics
Visualization& Dashboards
3
Page 34 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
DA
TA S
YSTE
MSEnterprise Data
Warehouse
Hot
MPP
In-Memory
RDBMS ERPCRM
Systems of Record
The Team Leverages HDF & HDP to Expand The Capabilities of Their Existing Data Platform
ANAL
YTIC
S
Data Marts
Business Analytics
Visualization& Dashboards
www.hortonworks.com
+HDP Data Analyst
Training
=HDP Data Analyst
+Developer Training
=HDP Developer
+HDP System Admin
Training
=HDP Sys Admin
+Data Science Training
=HDP Data Scientist
Developer System Admin SMESueVarun Jeff
Business AnalystTam
Then team engages their favorite SI and attends Hortonworks University training to get the project under
way
Page 36 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
IoT on HDP
Problem Statement
Reference Architecture& Sizing
Solution Design& Customer Case Studies
Implementation Plan
Page 36 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Project Cost & ROI
Page 37 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
StreamProcessing&Modeling(Kafka,Storm&Spark)
Solution Architecture
DistributedStorage:HDFS
ManyWorkloads:YARN
Real-timeServing&Searching(Hbase)
Alerts&Events
Real-TimeWebApp
InteractiveQuery(HiveonTez)SQL
Singleclusterwithconsistentsecurity,governance&operations
Collect,Conduct&Curate(HDF– BidirectionalDataFlow)
TruckSensors
The chosen solution provides XYZ company with the foundation to capture all the required data, analyze
correlations, and ultimately create a model that allows them to predict and mitigate incidents before they happen.
WeatherData
EDW
Sqoop
www.hortonworks.com
Tam and Varun build the application
HDP AnalystTam Varun
Developer Analyst
www.hortonworks.com
Ms. Brady is happy with the results. She is able to
determine that a subset of drivers are responsible for the increased cost. But like most managers she is not happy for long. Now she wants to be able
to predict future incidents.
Data Scientist
Machine Leaning
Jeff points out that HDP has tremendous statistical algorithm libraryand he can use these library to predict which drivers are likely to
have an event before the event occurs.
Jeff
www.hortonworks.com
Jeff implements predicted violations logic using HDP
Machine Learningand is able to predict events
before they happen
www.hortonworks.com
Ms. Brady is happy now that she can isolate where problems
exist, identify causal events and build models that help predict events before they
occur.
www.hortonworks.com
< TODO: Show St. Louis Case Study >
http://hortonworks.com/blog/st-louis-buses-run-with-lhp-telematics-and-hortonworks/
Page 43 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
IoT on HDP
Problem Statement
Reference Architecture& Sizing
Solution Design& Customer Case Studies
Implementation Plan
Page 43 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Project Cost & ROI
Page 44 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Big Data Functional ArchitectureKey Tenants of Lambda Architecture
§ Batch Layer§ Manages master data§ Immutable, append-only set of raw data§ Cleanse, Normalize & Pre-Compute
Batch Views§ Advanced Statistical Calculations
§ Speed layer§ Real Time Event Stream Processing§ Computes Real-Time Views
§ Serving Layer § Low-latency, ad-hoc query§ Reporting, BI & Dashboard
New Data Stream
Store Pre-Compute Views
Process Streams
Incremental Views
Business View
Business View
Query
SPEED LAYER
BATCH LAYER
SERVING LAYER
HDP and HDF
High Level Big Data IoT Architecture
Page 45 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Storm/Spark Streaming
Storm
Detailed Reference Architecture for IoT Applications
HDF
Flume
Sink toHDFS
Transform
Interactive
UI Framework
Hive
Hive
HDFS
HDFS
SOURCE DATA
Server logs
Application Logs
Firewall Logs
CRM/ERP
Sensor
Kafka
Kafka
Stream toHDF
Forward to Storm
Real Time Storage
Spark-ML
Pig
Alerts
Bolt toHDFS
Dashboard
Silk
JMSAlerts
Hive Server
HiveServer
Reporting
BI Tools
High Speed Ingest
Real-Time
Batch Interactive
Machine LearningModels
Spark
Pig
Alerts SQOOP
Flume
Iterative ML
Hbase/Pheonix
HBaseEvent Enrichment
Spark-Thrift
Pig
Page 46 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Sample Ingest: NiFi
Page 47 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Apache Storm – Key Attributes
Open source, real-time event stream processing platform that provides fixed, continuous, & low latency processing for very high frequency streaming data
• Horizontally scalable like Hadoop• Eg: 10 node cluster can process 1M tuples per secondHighly scalable
• Automatically reassigns tasks on failed nodesFault-tolerant
• Supports at least once & exactly once processing semanticsGuarantees processing
• Processing logic can be defined in any languageLanguage agnostic
• Brand, governance & a large active communityApache project
Page 48 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Storm - Basic ConceptsSpouts: Generate streams.
Tuple: Most fundamental data structure and is a named list of values that can be of any datatype
Streams: Groups of tuples
Bolts: Contain data processing, persistence and alerting logic. Can also emit tuples for downstream bolts
Tuple Tree: First spout tuple and all the tuples that were emitted by the bolts that processed it
Topology: Group of spouts and bolts wired together into a workflow
Topology
Page 49 © Hortonworks Inc. 2011 – 2015. All Rights ReservedHORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION
Distributed Database With Apache HBase
100%OpenSourceStoreandProcessPetabytesofDataFlexibleSchemaScaleoutonCommodityServersHighPerformance,HighAvailabilityIntegratedwithYARNSQLandNoSQL Interfaces
YARN:DataOperatingSystem
HBase
RegionServer
1 ° ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° ° N
HDFS(PermanentDataStorage)
HBase
RegionServer
HBase
RegionServer
Dynamic SchemaScales Horizontally to PB of DataDirectly Integrated with Hadoop
HDP
Page 50 © Hortonworks Inc. 2011 – 2015. All Rights ReservedHORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION
Apache Phoenix – Relational Database Layer Over HBase
A SQL Skin for HBase• Provides a SQL interface for managing data in HBase.• Large subset of SQL:1999 mandatory featureset.• Create tables, insert and update data and perform low-latency point lookups through JDBC.• Phoenix JDBC driver easily embeddable in any app that supports JDBC.
Phoenix Makes HBase Better• Oriented toward online / transactional apps.• If HBase is a good fit for your app, Phoenix makes it even better.• Phoenix gets you out of the “one table per query” model many other NoSQL stores force you into.
Page 51 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
In-Memory With Spark
Spark SQL
Spark Streaming MLlib GraphX
§ A data access engine for fast, large-scale data processing
§ Designed for iterative in-memory computations and interactive data mining
§ Provides expressive multi-language APIs for Scala, Java and Python
Page 52 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Spark ML for machine learning
Democratizes Machine Learning
Unsupervised tasks• Clustering (K-means)
• Recommendation
• Collaborative Filtering: alternating least squares
• Dimensionality reduction: PCA, SVD
Supervised tasks• Classification
• Naïve Bayes, Decision Tree, Random Forest, Gradient boosted trees
• Regression
• Linear models (SVM, linear regression, logistic regression)
Page 53 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Apache Hive: SQL in Hadoop
• Created by a team at Facebook
• Provides a standard SQL interface to data stored in Hadoop• Quickly analyze data in raw data files• Proven at petabyte scale
• Compatible with ALL major BI tools such as Tableau, Excel, MicroStrategy, Business Objects, etc…
SensorMobile
WeblogOperational
/MPP
SQLQueries
Page 54 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Comparing SQL Options In HDP
Project Strengths UseCases UniqueCapabilities
ApacheHive MostcomprehensiveSQLScaleMaturity
ETLOffloadReportingLarge-scaleaggregations
Robustcost-basedoptimizerMatureecosystem(BI,backup,securityandreplication)
SparkSQL In-memoryLowlatency
ExploratoryanalyticsDashboards
Language-integratedQuery
ApachePhoenix Real-timeread/writeTransactionsHighconcurrency
DashboardsSystem-of-engagementDrill-down/Drill-up
Real-timeread/write
Page 55 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Comparing Streaming Options In HDP
Apache Storm SparkStreaming
OneAtA Time MicroBatch(minimum batch latency=500ms)
LowLatency HigherThroughput
OperatesonTupleStream OperatesonStreamsofTuple Batches
AtLeastOnce(TridentForExactlyOnce)
ExactlyOnce
MultipleLanguageSupport MultipleLanguage Support
Page 56 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Sizing
Page 57 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
HDF Sizing & Best Practices Sustained Throughput
For Sustained Throughput of 50MB/sec and thousands of events
per second
• 1-2 nodes• 8+ cores per node
(more is better)• 6+ disks per node
(SSD or Spinning)• 2 GB of mem per node• 1GB bonded NICs
ideally
For Sustained Throughput of
100MB/sec and tens of thousands of events per
second
• 3-4 nodes• 8+ cores per node
(more is better)• 6+ disks per node
(SSD or Spinning)• 2 GB of mem per node• 1GB bonded NICs
ideally
For Sustained Throughput of
200MB/sec and hundreds of thousands of events per second
• 5-7 nodes• 24+ cores per node
(effective cpus)• 12+ disks per node
(SSD or spinning)• 4GB of mem per node• 10GB bonded NICs
For Sustained Throughput of 400-
500MB/sec and hundreds of thousands of events per second
• 7-10 nodes• 24+ cores per node
(effective cpus)• 12+ disks per node
(SSD or spinning)• 6GB of mem per node• 10GB bonded NICs
Page 58 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Kafka - Sizing & Best Practices
§ Cluster Sizing – Rule of Thumb– 10 MB/sec/Node or 100,000/sec/Node
• Higher throughput for large batch size
§ Configuration Best Practices– Num Of Partitions = max (Total Producer Throughput / Throughput per partition, Total Consumer
Throughput / Throughput per partition)• Over-estimate number of partitions per topic. Cannot increase partition count without breaking
message ordering guarantees– Collocate Kafka and Storm process
• Storm is CPU bound while Kafka is throughput bound• In high throughput scenarios, separate Kafka and Storm into independent nodes.
Page 59 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Storm - Sizing & Best Practices
§ Cluster Sizing – Rule of Thumb– 100,000 events per second per supervisor node
• Predicated on work being performed by Bolt’s execute method• Mileage will vary by project• Testing is critical
§ Configuration Best Practices– 1 Worker / Machine / Topology– 1 Executor per CPU Core– Topology Parallelism = Num of Machines x (Num of Cores Per Machine -1 )
• Distribute total parallelism among spout and bolts to maximize topology throughput
Page 60 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hbase - Sizing & Best Practices
§ Cluster Sizing – Rule of Thumb– 10 MB/sec/node of Write Throughput– 1-3 TB per node of compressed data (non replicated)
• HDFS volume of 6-12 TB– Sizing = max(required ingestion rate / Write Throughput per node, Total data size/ Data Per Node)
§ Configuration Best Practices– Region Server Size ~ 10G– Number of Regions Per Region Server ~ 100-200– Cluster/Pre-Split tables– For IOT scenarios
• Consider using Hive to store raw data while using Phoenix to store aggregates• Batch insert data to Phoenix using MapReduce
– Tailor Batch interval to application SLAs
www.hortonworks.com
Ms. Brady knows to get a handle on sky-rocketing
premiums, she will need to better understand what is causing the incidents and
being able to prevent them.Ms. Brady sets the goal of reducing incidents by 5%
within 90 days.
Incidents of maintenance vehicles have continued to increase under COO Brady’s watch. The Department of Transportation has contacted
Mega Corporation.
2012
17.5M
2013 2014 2015
Insurance Premiums
Ms. Brady tasks, her Business Analyst, Tam with
gathering the necessary data to understand the cause of and reduce
incidents.
Business AnalystTam
Problem statement recap
www.hortonworks.com
Given the current premium cost of $3,500 per truck on 5,000 trucks, a 10% reduction in incidents will move the company from the high risk insurance category they are currently in and save the company $1000 on their insurance premium per truck per year or $5,000,000 annually.
Business AnalystTam
Problem statement recap
Page 63 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Sizing - Cluster Storage Requirement
Effective Capacity
× Intermediate Size× Replication Count× Temp Space
Compression Ratio
Rule of thumb§ Replication Count: 3§ Temp Space: x1.2
Vary greatly§ Intermediate/Materialized: 30-50%§ Compression Ratio: 2-4
Page 64 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Data Volume for Mega Corp§ Number of Trucks = 5000§ Events per second per truck = 10§ Size of each event = 128 Bytes
§ 1 year raw sensor data storage requirements: 5000 x 10 x 128 x 60 x 60 x 24 x 365 = 200 TB§ 5 year sensor data storage: 200TB X 5 X 1.5 (processing overhead) = 1.5 PB
§ Q: How many nodes are needed for storing 1.5PB? (answered later)
Page 65 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
HBase, Kafka, Storm and NiFi RequirementsIngest rate = 128 Bytes X 5000 trucks X 10 events/s = 6.4 KB/s
Q: For 6.4 KB/s ingest rate, how many NiFi, Kafka and Storm nodes are needed?
We will store last 15 days of data in Hbase.
Hbase storage needed: 5000 * 10 * 60 * 60 * 24 * 15 * 128 = 8.2 TB
Q: How many Hbase nodes are needed for 8.2TB storage?
Page 66 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Sizing - Number Of Worker Nodes for Sensor Data
§ # of Worker Nodes = = = 32Storage Per Server
Total Cluster Storage 1.5 PB
48 TB
Page 67 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Sizing – NiFi, Kafka, Hbase and Storm Nodes
DataNodes& Hbase
NiFi Kafka & Storm Ingest Nodes
Client Nodes
MasterNodes
Total
32 2 3 2 5 44
§ Recall that:§ NiFi can collect @ 50 MB/s/node§ Kafka can ingest @10MB/s/node or 100,000 events/s/node§ Storm can process @ 100,000 events/s/node§ Each HBase Region Server can store 1TB
§ So for 6.4 KB/s ingest rate: 1 NiFi , 1 Kafka, 1 Storm nodes are sufficient. § We will use 2 NiFi & 3 Kafka for HA.§ Hbase nodes needed = 1.5PB/1TB = 8 nodes§ Co-locate Kafka and Storm.§ Co-locate DataNode and Hbase.
www.hortonworks.com
NiFi 1
NiFi 2
Storm 1 Kafka 1
Storm 2 Kafka 2
Storm 3 Kafka 3
DataNode 1 HBase 1
Truck 1
Truck 2
Truck 3
Truck 5000
NiFi Nodes
Edge Nodes
Master NodesClients 1
Clients 2
DataNode 2Hbase 2
DataNode 3 Hbase 3
DataNode 4 Hbase 4
DataNode 5Hbase 5
DataNode 6 Hbase 6
DataNode 7Hbase 7
DataNode 8Hbase 8
DataNode 9 DataNode 10
DataNode 31 DataNode 32
Master 1
Master 2
Master 3
Master 4
Master 5
Worker Nodes
HDF
HDPWorld
MegacorpDatacenter
Page 69 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Ingest Node 1Master Node 4
StormHiveserver
WebHCat
Falcon
Worker Node 1
Node Manager
Datanode
hBaseRegion
Worker Node 2
Node Manager
Datanode
hBaseRegion
Worker Node 3
Node Manager
Datanode
hBaseRegion
Worker Node 4
Node Manager
Datanode
hBaseRegion
Worker Node 5
Node Manager
Datanode
hBaseRegion
hBase Master 1
Master Node 3Master Node 2Master Node 1
Namenode 1
Zookeeper
Oozie
Zookeeper
Namenode 2
Resource Manager 1
Zookeeper
History Server
Timeline Server
Hiveserver 2
Journal Keeper
Journal Keeper
Journal Keeper
Resource Manager 2
hBase Master 2
Kafka
Master Node 5
Zookeeper
History Server
Ambari
Monitoring & Metrics
Worker Node 32
Node Manager
Datanode
hBaseRegion
Ingest Node 2
Storm
Kafka
Ingest Node 3
Storm
Kafka
Edge Node 1
Clients
Knox
Edge Node 1
Clients
Knox
HDP Service Layout
Page 70 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Master Node Specs
12 + Cores 128 - 256 GB RAM(1 X 256GB SSD Drive for OS)(2 X 1TB Drives)2 X 1 – 10 Gb Switch
Approximate Cost Per Node $8,000 - $18,000
Page 71 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
NiFi Nodes Specs
8+ Cores 16 GB RAM(1 X 256GB SSD Drive for OS)(2 X 1TB Drives)2 X 1 – 10 Gb Switch
Approximate Cost Per Node $5,000 - $8,000
Page 72 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Slave Nodes Specs
12 + Cores 32 - 64 GB RAM12 X 1 TB SATA Drives (Processing/IOPS Optimized)12 X 2 TB SATA Drives (Balanced)12 X 4 TB SATA Drives (Storage Optimized)1 X 1 – 10 Gb Switch
Approximate Cost Per Node $5,000 - $12,000
Page 73 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
IoT on HDP
Problem Statement
Reference Architecture& Sizing
Solution Design& Customer Case Studies
Implementation Plan
Page 73 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Project Cost & ROI
Page 74 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Project Plan
Strategy10 days
Training 10 days
Design & Build60 days
Test30 days
Promote10 days
Use Case Workshop
Cluster Build-out
Solution Build-out
Prove-out
Promote Solution
Tam puts together a quick project plan and
estimates it will take 120 days to get Ms. Brady
her solution
www.hortonworks.com
75Resource Plan
Data Scientist Consultant
TamData Flow
Consultant
VarunArchitect
Consultant
JeffDeveloper Consultant
Sue
Project ManagerJen
Engagement Manager Consultant
Jim
Enterprise ArchitectFrank
Business AnalystSue
DeveloperJim
IoT on HDP
Problem Statement
Reference Architecture& Sizing
Solution Design& Customer Case Studies
Implementation Plan
Page 76 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Project Cost & ROI
Project CostComponent Quantity Unit Cost Total Cost
Hardware 44 $10,000 $440K
Software – HDP 11 SKUs $18,000/SKU $198K
Software – HDF 2 SKUs $36000/SKU $72K
Dev and Test Consulting
3040 hrs* $300/hr $912K
Engagement Consulting
360 hrs* $300/hr $108K
Training 30** $2500 $75K
Travel & Expense $100K
Total $1.885M
* 4 resources x 8 hrs x 95 days, engagement mgr for 45 days** Admin, Analyst & Data Science Training for 30 associates
Page 78 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Project ROI§ Insurance Cost Reduction – 5M
§ Project Cost – 1.885M
§ First year savings ~ 3.1M
Page 79 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Tweet: #hadooproadshow
Thank You