solving big data problems using hortonworks

Solving Big Data Problems using Hortonworks

© Hortonworks Inc. 2011 – 2015. All Rights Reserved

ON

LY 100open source

Apache Hadoop data platform

% Founded in 2011

HADOOP1STprovider to go public

IPO 4Q14 (NASDAQ: HDP)

employees across800+

countries

technology partners1,350

17TM

Hortonworks Company Profile

Fastest company to reach $100 M in revenue

Let’s talk about Big Data

, September 2014 survey of 100 CIOs from the US and Europe

What problems and opportunities does Big Data create?

Data that traditional platforms

cannot handleNEW

TRADITIONAL

The OpportunityUnlock transformational business valuefrom a full fidelity of data and analyticsfor all data.

Geolocation

Server logs

Files & emails

ERP, CRM, SCM

Traditional Data Sources

New Data Sources

Sensorsand machines

Clickstream

Social media

The Future of Data: Actionable Intelligence

D A T A I N M O T I O N

ST

OR

AG

E

ST

OR

AG

E

GR OU P 2GR OU P 1

GR OU P 4GR OU P 3

D A T A A T R E S T

INTERNETOF

ANYTHING

Hortonworks Data Platform

H O R T O N W O R K S D A TA P L A T F O R M

Batch Interactive Search Streaming Machine Learning

YARN Resource Management System

CLICKSTREAM SENSOR SOCIAL MOBILE GEOLOCATIONS SERVER LOG EXISTING

HDP is a collection of Apache Projects

HORTONWORKS DATA PLATFORM

Had

oop

&YA

RN

Flum

e

Ooz

ie

Pig

Hiv

e

Tez

Sqoo

p

Clo

udbr

eak

Am

bari

Slid

er

Kaf

ka

Kno

x

Solr

Zook

eepe

r

Spar

k

Falc

on

Ran

ger

HB

ase

Atla

s

Acc

umul

o

Stor

m

Phoe

nix

4.10.2

DATA MGMT DATA ACCESS GOVERNANCE & INTEGRATION OPERATIONS SECURITY

HDP 2.2Dec 2014

HDP 2.1April 2014

HDP 2.0Oct 2013

HDP 2.2Dec 2014

HDP 2.1April 2014

HDP 2.0Oct 2013 0.12.0 0.12.0

0.12.1 0.13.0 0.4.0

1.4.4 1.4.4 3.3.23.4.5

0.4.00.5.0

0.14.0 0.14.0 3.4.6 0.5.0 0.4.00.9.30.5.2

4.0.04.7.2

1.2.1 0.60.0 0.98.4 4.2.0 1.6.1 0.6.0 1.5.21.4.5 4.1.01.7.0

1.4.0 1.5.1 4.0.0

1.3.1

1.5.1 1.4.4 3.4.5

1.3.1

2.2.0

2.4.0

2.6.0

2.7.1 1.4.6 1.0.0 0.6.0 0.5.02.1.00.8.2 3.4.61.5.25.2.1 0.80.0 1.1.1 0.5.01.7.04.4.0 0.10.0 0.6.10.7.01.2.10.15.0HDP 2.3

July 2015 4.2.0

Ongoing Innovation in Apache

0.96.1

0.98.0 0.9.1

0.8.1

Hortonworks Data Flow

Visual User InterfaceDrag and drop for efficient, agile operations

Immediate FeedbackStart, stop, tune, replay dataflows in real-time

Adaptive to Volume and BandwidthAny data, big or small

Event Level Data ProvenanceGovernance, compliance & data evaluation

Secure Data Acquisition & TransportFine grained encryption for controlled data sharing and selective data democratization

Powered by Apache NiFi

HDF and HDP Deliver a Complete Big Data Solution

• HDF dynamically connects HDP to data at the edge

• HDF secures and encrypts the movement of data into HDP

• HDF includes mature IoAT data protocols that improve device extensibility

• HDF supports easily adjustable bi-direction IoAT dataflows

• HDF offers traceability of IoAT data with lineage and audit trails

• HDF brings a real-time, visual user interface to manipulate live dataflows

ST

OR

AG

E

ST

OR

AG

E

Hortonworks Revenue Model

HDP and HDF are 100% free and Open Source – no license. Our customers subscribe to support, consulting experts and training programsAnnual Subscriptionsalign your success with ours

Expert Consulting & Traininghelp your team get to actionable intelligence as efficiently as possible

ARCHITECT&

DEVELOP

DEPLOY

OPERATE

Project 1

Project 5

Project 4

Project 3

Project 2

Project 6

EXPAND

Sales Plays

Hadoop Driver: Cost optimization

Archive Data off EDWMove rarely used data to Hadoop as active archive, store more data longer

Offload costly ETL processFree your EDW to perform high-value functions like analytics & operations, not ETL

Enrich the value of your EDWUse Hadoop to refine new data sources, such as web and machine data for new analytical context

ANAL

YTIC

S

Data Marts

Business Analytics

Visualization& Dashboards

HDP helps you reduce costs and optimize the value associated with your EDW

ANAL

YTIC

SD

ATA

SYS

TEM

S

Data Marts

Business Analytics


HDP 2.3

ELT°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

N

Cold Data, Deeper Archive& New Sources

Enterprise Data Warehouse

Hot

MPP

In-Memory

Clickstream Web&Social

Geolocation Sensor& Machine

ServerLogs

Unstructured

Existing Systems

ERP CRM SCM

SOU

RC

ES

Single ViewImprove acquisition and retention

Predictive Analytics Identify your next best action

Data DiscoveryUncover new findings

Financial ServicesNew Account Risk Screens Trading Risk Insurance Underwriting

Improved Customer Service Insurance Underwriting Aggregate Banking Data as a Service

Cross-sell & Upsell of Financial Products Risk Analysis for Usage-Based Car Insurance Identify Claims Errors for Reimbursement

TelecomUnified Household View of the Customer Searchable Data for NPTB Recommendations Protect Customer Data from Employee Misuse

Analyze Call Center Contacts Records Network Infrastructure Capacity Planning Call Detail Records (CDR) Analysis

Inferred Demographics for Improved Targeting Proactive Maintenance on Transmission Equipment Tiered Service for High-Value Customers

Retail360° View of the Customer Supply Chain Optimization Website Optimization for Path to Purchase

Localized, Personalized Promotions A/B Testing for Online Advertisements Data-Driven Pricing, improved loyalty programs

Customer Segmentation Personalized, Real-time Offers In-Store Shopper Behavior

ManufacturingSupply Chain and Logistics Optimize Warehouse Inventory Levels Product Insight from Electronic Usage Data

Assembly Line Quality Assurance Proactive Equipment Maintenance Crowdsource Quality Assurance

Single View of a Product Throughout Lifecycle Connected Car Data for Ongoing Innovation Improve Manufacturing Yields

HealthcareElectronic Medical Records Monitor Patient Vitals in Real-Time Use Genomic Data in Medical Trials

Improving Lifelong Care for Epilepsy Rapid Stroke Detection and Intervention Monitor Medical Supply Chain to Reduce Waste

Reduce Patient Re-Admittance Rates Video Analysis for Surgical Decision Support Healthcare Analytics as a Service

Oil & GasUnify Exploration & Production Data Monitor Rig Safety in Real-Time Geographic exploration

DCA to Slow Well Declines Curves Proactive Maintenance for Oil Field Equipment Define Operational Set Points for Wells

GovernmentSingle View of Entity CBM & Autonomic Logistic Analysis Sentiment Analysis on Program Effectiveness

Prevent Fraud, Waste and Abuse Proactive Maintenance for Public Infrastructure Meet Deadlines for Government Reporting

Hadoop Driver: Advanced analytic applications

NiFi and HDF Drivers

Optimize Splunk: Reduce costs by pre-filtering data so that only relevant content is forwarded into Splunk

Ingest Logs for Cyber Security: Integrated and secure log collection for real-time data analytics and threat detection

Feed Data to Streaming Analytics: Accelerate big data ROI by streaming data into analytics systems such as Apache Storm or Apache Spark Streaming

Move Data Internally: Optimize resource utilization by moving data between data centers or between on-premises infrastructure and cloud infrastructure

Capture IoT Data: Transport disparate and often remote IoTdata in real time, despite any limitations in device footprint, power or connectivity—avoiding data loss

Hadoop Driver: Enabling the data lakeSC

ALE

SCOPE

Data Lake Definition• Centralized Architecture

Multiple applications on a shared data set with consistent levels of service

• Any App, Any DataMultiple applications accessing all data affording new insights and opportunities.

• Unlocks ‘Systems of Insight’Advanced algorithms and applications used to derive new value and optimize existing value.

Drivers:1. Cost Optimization2. Advanced Analytic Apps

Goal:• Centralized Architecture• Data-driven Business

DATA LAKE

Journey to the Data Lake with Hadoop

Systems of Insight

Case Study: 12 month Hadoop evolution at TrueCarD

ata

Plat

form

Cap

abili

ties

12 months execution plan

June 2013Begin Hadoop Execution

July 2013Hortonworks Partnership

May ‘14IPO

Aug 2013Training & DevBegins

Nov 2013Production Cluster60 Nodes2 PB

Jan 201440% DevStaff Perficient

Dec 2013Three Production Apps(3 total)

Feb 2014Three More Production Apps(6 total)

12 Month Results at TRUECar• Six Production Hadoop Applications• Sixty nodes/2PB data• Storage Costs/Compute Costs

from $19/GB to $0.12/GB

“We addressed our data platform capabilities strategically as a pre-cursor to IPO.”

Hortonworks Data Platform

Hadoop emerged as foundation of new data architecture

Apache Hadoop is an open source data platform for managing large volumes of high velocity and variety of data• Built by Yahoo! to be the heartbeat of its ad & search business

• Donated to Apache Software Foundation in 2005 with rapid adoption by large web properties & early adopter enterprises

• Incredibly disruptive to current platform economics

Traditional Hadoop Advantagesü Manages new data paradigmü Handles data at scaleü Cost effectiveü Open source

Traditional Hadoop Had LimitationsBatch-only architecture Single purpose clusters, specific data setsDifficult to integrate with existing investmentsNot enterprise-grade

Application

StorageHDFS

Batch ProcessingMapReduce

20092006

1 ° ° ° ° °

° ° ° ° ° N

HDFS(HadoopDistributedFileSystem)

MapReduceLargelyBatchProcessing

Hadoop w/MapReduce

YARN: Data Operating System

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° °

°

°N

HDFS (Hadoop Distributed File System)

Hadoop2 & YARN based Architecture

Silo’d clustersLargely batch systemDifficult to integrate

MR-279:YARN

Hadoop 2 & YARN

Interactive Real-TimeBatch

Architected & led development of YARN to enable the Modern Data Architecture

October 23, 2013

Apache Hadoop – Data Operating System

Shared Compute & Workload Management• Common data platform, many applications• Support multi-tenant access & processing• Batch, interactive & real-time use cases

Common & Shared Scale Out Storage• Shared data assets• Flexible schema• Cross workload access

YARN: Data Operating System(Cluster Resource Management)

1 ° ° ° ° ° ° °

° ° ° ° ° ° ° °

Script

Pig

SQL

Hive

TezTez

JavaScala

Cascading

Tez

° °

° °

° ° ° ° °

° ° ° ° °

Others

ISV Engines

HDFS (Hadoop Distributed File System)

Stream

Storm

Search

Solr

NoSQL

HBaseAccumulo

Slider Slider

BATCH, INTERACTIVE & REAL-TIME DATA ACCESS

In-Memory

Spark

Enterprise Hadoop

Core Capabilities of Enterprise Hadoop

Load data and manage according

to policy

Deploy and effectively

manage the platform

Store and process all of your Corporate Data Assets

Access your data simultaneously in multiple ways(batch, interactive, real-time) Provide layered

approach tosecurity through Authentication, Authorization,

Accounting, and Data Protection

DATAMANAGEMENT

SECURITYDATAACCESSGOVERNANCE&INTEGRATION OPERATIONS

Enable both existing and new application toprovide value to the organization

PRESENTATION&APPLICATION

Empower existing operations and security tools to manage Hadoop

ENTERPRISEMGMT&SECURITY

Provide deployment choice across physical, virtual, cloud

DEPLOYMENTOPTIONS

Hortonworks Data Platform 2.3

YARN : Data Operating System

DATA ACCESS SECURITYGOVERNANCE & INTEGRATION OPERATIONS

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° °

°

N

AdministrationAuthenticationAuthorizationAuditingData Protection

RangerKnoxAtlasHDFS EncryptionData Workflow

SqoopFlumeKafkaNFSWebHDFS

Provisioning, Managing, & Monitoring

AmbariCloudbreakZookeeper

Scheduling

Oozie

Batch

MapReduce

Script

Pig

Search

Solr

SQL

Hive

NoSQL

HBaseAccumuloPhoenix

Stream

Storm

In-memory

Spark

Others

ISV Engines

TezTez Tez Slider Slider

HDFS Hadoop Distributed File System

DATA MANAGEMENT

Hortonworks Data Platform 2.3

DeploymentChoiceLinux Windows On-Premise Cloud

Data Lifecycle & Governance

FalconAtlas

Architectures

Basic EDW Cost Optimization Architecture

Batch

Sqoop

Transform

Processed

Hive

Raw

HDFS

Interactive

HiveServer

Reporting

BI Tools

Load

EDW

Existing Analytics

Fetch

1

2

3

4

ExternalTables

More than save cost, Enrich With New Data

Batch

Sqoop

Transform

Processed

Hive

Raw

HDFS

Interactive

HiveServer

Reporting

BI Tools

Load

EDW

New Sources

Streaming

NiFi

Load

Existing Analytics

Fetch

New Analytics

1

2

3

4

5

6

ExternalTables

Streaming Solution Architecture

HDP 2.x Data Lake

YARN

HDFS

APACHEKAFKA

SearchSolrSlider

OnlineDataProcessingHBaseAccumulo

RealTimeStreamProcessingStorm SQL

HiveStreaming Ingest

HDFS

HDP 2.x

Real-time data feeds

Key Tenants of Lambda Architecture

§ Batch Layer§ Manages master data§ Immutable, append-only set of raw data§ Cleanse, Normalize & Pre-Compute

Batch Views§ Advanced Statistical Calculations

§ Speed layer§ Real Time Event Stream Processing§ Computes Real-Time Views

§ Serving Layer § Low-latency, ad-hoc query§ Reporting, BI & Dashboard

New Data Stream

Store Pre-Compute Views

Process Streams

Incremental Views

Business View

Business View

Query

SPEED LAYER

BATCH LAYER

SERVING LAYER

HDP and HDF

High Level Big Data IoT Architecture

IoT on HDP

Problem Statement

Reference Architecture& Sizing

Solution Design& Customer Case Studies

Implementation Plan


Project Cost & ROI

www.hortonworks.com

Ms. Brady knows to get a handle on sky-rocketing

premiums, she will need to better understand what is causing the incidents and

being able to prevent them.Ms. Brady sets the goal of reducing incidents by 5%

within 90 days.

Incidents of maintenance vehicles have continued to increase under COO Brady’s watch

2012

17.5M

2013 2014 2015

Insurance Premiums

Ms. Brady tasks, her Business Analyst, Tam with

gathering the necessary data to understand the cause of and reduce

incidents.

Business AnalystTam

Mega Corp has a problem

www.hortonworks.com

Given the current premium cost of $3,500 per truck on 5,000 trucks, a 10% reduction in incidents will move the company from the high risk insurance category they are currently in and save the company $1000 on their insurance premium per truck per year or $5,000,000 annually.

Business AnalystTam

www.hortonworks.com

Tam considers four questions she must answer to better understand and mitigate incidents. The are:

1) Is there a correlation of driver training to incidents?

2) Is there a correlation of weather to incidents?

3) Is there a correlation between certain driving behavior and incidents?

4) Is is possible to predict incidents before they occur?

Business AnalystTam

…to Behavioral InsightFrom reaction to human activity

…to Resource OptimizationFrom static resource planning

From break then fix

Shift from Reactive……to…... Proactive & Proscriptive

…to Preventative Maintenance

www.hortonworks.com

Initially, Tam’s team is concerned that they may not be able to capture all the necessary data to answer the

questions Tam has posed and help her mitigate incidents. They know that the data is not all

structured and some of it is created in real-time and transmitted over the Internet. In addition, some data

will have to be captured from external sources.

Vehicle Data

Route Data

Weather Data

Structured Driver Data

Semi-Structured Maintenance Data

SueVarun Jeff

© Hortonworks Inc. 2011 – 2015. All Rights ReservedD

ATA

SYS

TEM

SEnterprise Data Warehouse

Hot

MPP

In-Memory

1

2

Clickstream Web&Social

Geolocation Sensor& Machine

ServerLogs

Unstructured

RDBMS ERPCRM

Systems of Record

The Team Recognizes The Current Data Architectures Limits Predictive Capabilities

1. Data Silos: difficult to find predictive correlations

2. Data Volumes: cannot store enough data to find patterns

3. New Data Sources: unable to capture and use new data for real-time analysis

ANAL

YTIC

S

Data Marts

Business Analytics


3


DA

TA S

YSTE

MSEnterprise Data

Warehouse

Hot

MPP

In-Memory

RDBMS ERPCRM

Systems of Record

The Team Leverages HDF & HDP to Expand The Capabilities of Their Existing Data Platform

ANAL

YTIC

S

Data Marts

Business Analytics


www.hortonworks.com

+HDP Data Analyst

Training

=HDP Data Analyst

+Developer Training

=HDP Developer

+HDP System Admin

Training

=HDP Sys Admin

+Data Science Training

=HDP Data Scientist

Developer System Admin SMESueVarun Jeff

Business AnalystTam

Then team engages their favorite SI and attends Hortonworks University training to get the project under

way


IoT on HDP

Problem Statement



Implementation Plan

Page 36 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Project Cost & ROI


StreamProcessing&Modeling(Kafka,Storm&Spark)

Solution Architecture

DistributedStorage:HDFS

ManyWorkloads:YARN

Real-timeServing&Searching(Hbase)

Alerts&Events

Real-TimeWebApp

InteractiveQuery(HiveonTez)SQL

Singleclusterwithconsistentsecurity,governance&operations

Collect,Conduct&Curate(HDF– BidirectionalDataFlow)

TruckSensors

The chosen solution provides XYZ company with the foundation to capture all the required data, analyze

correlations, and ultimately create a model that allows them to predict and mitigate incidents before they happen.

WeatherData

EDW

Sqoop

www.hortonworks.com

Tam and Varun build the application

HDP AnalystTam Varun

Developer Analyst

www.hortonworks.com

Ms. Brady is happy with the results. She is able to

determine that a subset of drivers are responsible for the increased cost. But like most managers she is not happy for long. Now she wants to be able

to predict future incidents.

Data Scientist

Machine Leaning

Jeff points out that HDP has tremendous statistical algorithm libraryand he can use these library to predict which drivers are likely to

have an event before the event occurs.

Jeff

www.hortonworks.com

Jeff implements predicted violations logic using HDP

Machine Learningand is able to predict events

before they happen

www.hortonworks.com

Ms. Brady is happy now that she can isolate where problems

exist, identify causal events and build models that help predict events before they

occur.

www.hortonworks.com

< TODO: Show St. Louis Case Study >

http://hortonworks.com/blog/st-louis-buses-run-with-lhp-telematics-and-hortonworks/


IoT on HDP

Problem Statement



Implementation Plan


Project Cost & ROI


Big Data Functional ArchitectureKey Tenants of Lambda Architecture

§ Batch Layer§ Manages master data§ Immutable, append-only set of raw data§ Cleanse, Normalize & Pre-Compute

Batch Views§ Advanced Statistical Calculations

§ Speed layer§ Real Time Event Stream Processing§ Computes Real-Time Views

§ Serving Layer § Low-latency, ad-hoc query§ Reporting, BI & Dashboard

New Data Stream

Store Pre-Compute Views

Process Streams

Incremental Views

Business View

Business View

Query

SPEED LAYER

BATCH LAYER

SERVING LAYER

HDP and HDF

High Level Big Data IoT Architecture


Storm/Spark Streaming

Storm

Detailed Reference Architecture for IoT Applications

HDF

Flume

Sink toHDFS

Transform

Interactive

UI Framework

Hive

Hive

HDFS

HDFS

SOURCE DATA

Server logs

Application Logs

Firewall Logs

CRM/ERP

Sensor

Kafka

Kafka

Stream toHDF

Forward to Storm

Real Time Storage

Spark-ML

Pig

Alerts

Bolt toHDFS

Dashboard

Silk

JMSAlerts

Hive Server

HiveServer

Reporting

BI Tools

High Speed Ingest

Real-Time

Batch Interactive

Machine LearningModels

Spark

Pig

Alerts SQOOP

Flume

Iterative ML

Hbase/Pheonix

HBaseEvent Enrichment

Spark-Thrift

Pig


Sample Ingest: NiFi


Apache Storm – Key Attributes

Open source, real-time event stream processing platform that provides fixed, continuous, & low latency processing for very high frequency streaming data

• Horizontally scalable like Hadoop• Eg: 10 node cluster can process 1M tuples per secondHighly scalable

• Automatically reassigns tasks on failed nodesFault-tolerant

• Supports at least once & exactly once processing semanticsGuarantees processing

• Processing logic can be defined in any languageLanguage agnostic

• Brand, governance & a large active communityApache project


Storm - Basic ConceptsSpouts: Generate streams.

Tuple: Most fundamental data structure and is a named list of values that can be of any datatype

Streams: Groups of tuples

Bolts: Contain data processing, persistence and alerting logic. Can also emit tuples for downstream bolts

Tuple Tree: First spout tuple and all the tuples that were emitted by the bolts that processed it

Topology: Group of spouts and bolts wired together into a workflow

Topology

© Hortonworks Inc. 2011 – 2015. All Rights ReservedHORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION

Distributed Database With Apache HBase

100%OpenSourceStoreandProcessPetabytesofDataFlexibleSchemaScaleoutonCommodityServersHighPerformance,HighAvailabilityIntegratedwithYARNSQLandNoSQL Interfaces

YARN:DataOperatingSystem

HBase

RegionServer

1 ° ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° ° N

HDFS(PermanentDataStorage)

HBase

RegionServer

HBase

RegionServer

Dynamic SchemaScales Horizontally to PB of DataDirectly Integrated with Hadoop

HDP

© Hortonworks Inc. 2011 – 2015. All Rights ReservedHORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION

Apache Phoenix – Relational Database Layer Over HBase

A SQL Skin for HBase• Provides a SQL interface for managing data in HBase.• Large subset of SQL:1999 mandatory featureset.• Create tables, insert and update data and perform low-latency point lookups through JDBC.• Phoenix JDBC driver easily embeddable in any app that supports JDBC.

Phoenix Makes HBase Better• Oriented toward online / transactional apps.• If HBase is a good fit for your app, Phoenix makes it even better.• Phoenix gets you out of the “one table per query” model many other NoSQL stores force you into.


In-Memory With Spark

Spark SQL

Spark Streaming MLlib GraphX

§ A data access engine for fast, large-scale data processing

§ Designed for iterative in-memory computations and interactive data mining

§ Provides expressive multi-language APIs for Scala, Java and Python


Spark ML for machine learning

Democratizes Machine Learning

Unsupervised tasks• Clustering (K-means)

• Recommendation

• Collaborative Filtering: alternating least squares

• Dimensionality reduction: PCA, SVD

Supervised tasks• Classification

• Naïve Bayes, Decision Tree, Random Forest, Gradient boosted trees

• Regression

• Linear models (SVM, linear regression, logistic regression)


Apache Hive: SQL in Hadoop

• Created by a team at Facebook

• Provides a standard SQL interface to data stored in Hadoop• Quickly analyze data in raw data files• Proven at petabyte scale

• Compatible with ALL major BI tools such as Tableau, Excel, MicroStrategy, Business Objects, etc…

SensorMobile

WeblogOperational

/MPP

SQLQueries


Comparing SQL Options In HDP

Project Strengths UseCases UniqueCapabilities

ApacheHive MostcomprehensiveSQLScaleMaturity

ETLOffloadReportingLarge-scaleaggregations

Robustcost-basedoptimizerMatureecosystem(BI,backup,securityandreplication)

SparkSQL In-memoryLowlatency

ExploratoryanalyticsDashboards

Language-integratedQuery

ApachePhoenix Real-timeread/writeTransactionsHighconcurrency

DashboardsSystem-of-engagementDrill-down/Drill-up

Real-timeread/write


Comparing Streaming Options In HDP

Apache Storm SparkStreaming

OneAtA Time MicroBatch(minimum batch latency=500ms)

LowLatency HigherThroughput

OperatesonTupleStream OperatesonStreamsofTuple Batches

AtLeastOnce(TridentForExactlyOnce)

ExactlyOnce

MultipleLanguageSupport MultipleLanguage Support


Sizing


HDF Sizing & Best Practices Sustained Throughput

For Sustained Throughput of 50MB/sec and thousands of events

per second

• 1-2 nodes• 8+ cores per node

(more is better)• 6+ disks per node

(SSD or Spinning)• 2 GB of mem per node• 1GB bonded NICs

ideally

For Sustained Throughput of

100MB/sec and tens of thousands of events per

second


(more is better)• 6+ disks per node

(SSD or Spinning)• 2 GB of mem per node• 1GB bonded NICs

ideally

For Sustained Throughput of

200MB/sec and hundreds of thousands of events per second


(effective cpus)• 12+ disks per node

(SSD or spinning)• 4GB of mem per node• 10GB bonded NICs

For Sustained Throughput of 400-

500MB/sec and hundreds of thousands of events per second


(effective cpus)• 12+ disks per node

(SSD or spinning)• 6GB of mem per node• 10GB bonded NICs


Kafka - Sizing & Best Practices

§ Cluster Sizing – Rule of Thumb– 10 MB/sec/Node or 100,000/sec/Node

• Higher throughput for large batch size

§ Configuration Best Practices– Num Of Partitions = max (Total Producer Throughput / Throughput per partition, Total Consumer

Throughput / Throughput per partition)• Over-estimate number of partitions per topic. Cannot increase partition count without breaking

message ordering guarantees– Collocate Kafka and Storm process

• Storm is CPU bound while Kafka is throughput bound• In high throughput scenarios, separate Kafka and Storm into independent nodes.


Storm - Sizing & Best Practices

§ Cluster Sizing – Rule of Thumb– 100,000 events per second per supervisor node

• Predicated on work being performed by Bolt’s execute method• Mileage will vary by project• Testing is critical

§ Configuration Best Practices– 1 Worker / Machine / Topology– 1 Executor per CPU Core– Topology Parallelism = Num of Machines x (Num of Cores Per Machine -1 )

• Distribute total parallelism among spout and bolts to maximize topology throughput


Hbase - Sizing & Best Practices

§ Cluster Sizing – Rule of Thumb– 10 MB/sec/node of Write Throughput– 1-3 TB per node of compressed data (non replicated)

• HDFS volume of 6-12 TB– Sizing = max(required ingestion rate / Write Throughput per node, Total data size/ Data Per Node)

§ Configuration Best Practices– Region Server Size ~ 10G– Number of Regions Per Region Server ~ 100-200– Cluster/Pre-Split tables– For IOT scenarios

• Consider using Hive to store raw data while using Phoenix to store aggregates• Batch insert data to Phoenix using MapReduce

– Tailor Batch interval to application SLAs

www.hortonworks.com

Ms. Brady knows to get a handle on sky-rocketing

premiums, she will need to better understand what is causing the incidents and

being able to prevent them.Ms. Brady sets the goal of reducing incidents by 5%

within 90 days.

Incidents of maintenance vehicles have continued to increase under COO Brady’s watch. The Department of Transportation has contacted

Mega Corporation.

2012

17.5M

2013 2014 2015

Insurance Premiums

Ms. Brady tasks, her Business Analyst, Tam with

gathering the necessary data to understand the cause of and reduce

incidents.

Business AnalystTam

Problem statement recap

www.hortonworks.com

Given the current premium cost of $3,500 per truck on 5,000 trucks, a 10% reduction in incidents will move the company from the high risk insurance category they are currently in and save the company $1000 on their insurance premium per truck per year or $5,000,000 annually.

Business AnalystTam

Problem statement recap


Sizing - Cluster Storage Requirement

Effective Capacity

× Intermediate Size× Replication Count× Temp Space

Compression Ratio

Rule of thumb§ Replication Count: 3§ Temp Space: x1.2

Vary greatly§ Intermediate/Materialized: 30-50%§ Compression Ratio: 2-4


Data Volume for Mega Corp§ Number of Trucks = 5000§ Events per second per truck = 10§ Size of each event = 128 Bytes

§ 1 year raw sensor data storage requirements: 5000 x 10 x 128 x 60 x 60 x 24 x 365 = 200 TB§ 5 year sensor data storage: 200TB X 5 X 1.5 (processing overhead) = 1.5 PB

§ Q: How many nodes are needed for storing 1.5PB? (answered later)


HBase, Kafka, Storm and NiFi RequirementsIngest rate = 128 Bytes X 5000 trucks X 10 events/s = 6.4 KB/s

Q: For 6.4 KB/s ingest rate, how many NiFi, Kafka and Storm nodes are needed?

We will store last 15 days of data in Hbase.

Hbase storage needed: 5000 * 10 * 60 * 60 * 24 * 15 * 128 = 8.2 TB

Q: How many Hbase nodes are needed for 8.2TB storage?


Sizing - Number Of Worker Nodes for Sensor Data

§ # of Worker Nodes = = = 32Storage Per Server

Total Cluster Storage 1.5 PB

48 TB


Sizing – NiFi, Kafka, Hbase and Storm Nodes

DataNodes& Hbase

NiFi Kafka & Storm Ingest Nodes

Client Nodes

MasterNodes

Total

32 2 3 2 5 44

§ Recall that:§ NiFi can collect @ 50 MB/s/node§ Kafka can ingest @10MB/s/node or 100,000 events/s/node§ Storm can process @ 100,000 events/s/node§ Each HBase Region Server can store 1TB

§ So for 6.4 KB/s ingest rate: 1 NiFi , 1 Kafka, 1 Storm nodes are sufficient. § We will use 2 NiFi & 3 Kafka for HA.§ Hbase nodes needed = 1.5PB/1TB = 8 nodes§ Co-locate Kafka and Storm.§ Co-locate DataNode and Hbase.

www.hortonworks.com

NiFi 1

NiFi 2

Storm 1 Kafka 1

Storm 2 Kafka 2

Storm 3 Kafka 3

DataNode 1 HBase 1

Truck 1

Truck 2

Truck 3

Truck 5000

NiFi Nodes

Edge Nodes

Master NodesClients 1

Clients 2

DataNode 2Hbase 2

DataNode 3 Hbase 3

DataNode 4 Hbase 4

DataNode 5Hbase 5

DataNode 6 Hbase 6

DataNode 7Hbase 7

DataNode 8Hbase 8

DataNode 9 DataNode 10

DataNode 31 DataNode 32

Master 1

Master 2

Master 3

Master 4

Master 5

Worker Nodes

HDF

HDPWorld

MegacorpDatacenter


Ingest Node 1Master Node 4

StormHiveserver

WebHCat

Falcon

Worker Node 1

Node Manager

Datanode

hBaseRegion

Worker Node 2

Node Manager

Datanode

hBaseRegion

Worker Node 3

Node Manager

Datanode

hBaseRegion

Worker Node 4

Node Manager

Datanode

hBaseRegion

Worker Node 5

Node Manager

Datanode

hBaseRegion

hBase Master 1

Master Node 3Master Node 2Master Node 1

Namenode 1

Zookeeper

Oozie

Zookeeper

Namenode 2

Resource Manager 1

Zookeeper

History Server

Timeline Server

Hiveserver 2

Journal Keeper

Journal Keeper

Journal Keeper

Resource Manager 2

hBase Master 2

Kafka

Master Node 5

Zookeeper

History Server

Ambari

Monitoring & Metrics

Worker Node 32

Node Manager

Datanode

hBaseRegion

Ingest Node 2

Storm

Kafka

Ingest Node 3

Storm

Kafka

Edge Node 1

Clients

Knox

Edge Node 1

Clients

Knox

HDP Service Layout


Master Node Specs

12 + Cores 128 - 256 GB RAM(1 X 256GB SSD Drive for OS)(2 X 1TB Drives)2 X 1 – 10 Gb Switch

Approximate Cost Per Node $8,000 - $18,000


NiFi Nodes Specs

8+ Cores 16 GB RAM(1 X 256GB SSD Drive for OS)(2 X 1TB Drives)2 X 1 – 10 Gb Switch



Slave Nodes Specs

12 + Cores 32 - 64 GB RAM12 X 1 TB SATA Drives (Processing/IOPS Optimized)12 X 2 TB SATA Drives (Balanced)12 X 4 TB SATA Drives (Storage Optimized)1 X 1 – 10 Gb Switch



IoT on HDP

Problem Statement



Implementation Plan


Project Cost & ROI


Project Plan

Strategy10 days

Training 10 days

Design & Build60 days

Test30 days

Promote10 days

Use Case Workshop

Cluster Build-out

Solution Build-out

Prove-out

Promote Solution

Tam puts together a quick project plan and

estimates it will take 120 days to get Ms. Brady

her solution

www.hortonworks.com

75Resource Plan

Data Scientist Consultant

TamData Flow

Consultant

VarunArchitect

Consultant

JeffDeveloper Consultant

Sue

Project ManagerJen

Engagement Manager Consultant

Jim

Enterprise ArchitectFrank

Business AnalystSue

DeveloperJim

IoT on HDP

Problem Statement



Implementation Plan


Project Cost & ROI

Project CostComponent Quantity Unit Cost Total Cost

Hardware 44 $10,000 $440K

Software – HDP 11 SKUs $18,000/SKU $198K

Software – HDF 2 SKUs $36000/SKU $72K

Dev and Test Consulting

3040 hrs* $300/hr $912K

Engagement Consulting

360 hrs* $300/hr $108K

Training 30** $2500 $75K

Travel & Expense $100K

Total $1.885M

* 4 resources x 8 hrs x 95 days, engagement mgr for 45 days** Admin, Analyst & Data Science Training for 30 associates


Project ROI§ Insurance Cost Reduction – 5M

§ Project Cost – 1.885M

§ First year savings ~ 3.1M

solving big data problems using hortonworks

Technology