discover hdp2.1: apache storm for stream data processing in hadoop

23
Page 1 © Hortonworks Inc. 2014 Discover HDP 2.1 Apache Storm for Stream Data Processing in Hadoop Hortonworks. We do Hadoop.

Upload: hortonworks

Post on 27-Aug-2014

1.015 views

Category:

Software


1 download

DESCRIPTION

For the first time, Hortonworks Data Platform ships with Apache Storm for processing stream data in Hadoop. In this presentation, Himanshu Bari, Hortonworks senior product manager, and Taylor Goetz, Hortonworks engineer and committer to Apache Storm, cover Storm and stream processing in HDP 2.1: + Key requirements of a streaming solution and common use cases + An overview of Apache Storm + Q & A

TRANSCRIPT

Page 1: Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop

Page 1 © Hortonworks Inc. 2014

Discover HDP 2.1 Apache Storm for Stream Data Processing in Hadoop

Hortonworks. We do Hadoop.

Page 2: Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop

Page 2 © Hortonworks Inc. 2014

Speakers

Justin Sears

Hortonworks Product Marketing Manager

Himanshu Bari

Hortonworks Senior Product Manager & PM for Apache Storm & Apache Falcon in Hortonworks Data Platform

Taylor Goetz

Hortonworks Engineer & Committer for Apache Storm, with deep expertise in master data management

Page 3: Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop

Page 3 © Hortonworks Inc. 2014

Agenda

•  Why Stream Processing?

•  Overview of Apache Storm

•  Q & A

Page 4: Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop

Page 4 © Hortonworks Inc. 2014

OPERATIONS  TOOLS  

Provision, Manage & Monitor

DEV  &  DATA  TOOLS  

Build & Test

A Modern Data Architecture AP

PLICAT

IONS  

DATA

   SYSTEM  

REPOSITORIES  

RDBMS   EDW   MPP  

Business    Analy<cs  

Custom  Applica<ons  

Packaged  Applica<ons  

Gov

erna

nce

&

Inte

grat

ion

ENTERPRISE HADOOP

Secu

rity

Ope

ratio

ns

Data Access

Data Management

SOURC

ES  

OLTP,  ERP,  CRM  Systems  

Documents,    Emails  

Web  Logs,  Click  Streams  

Social  Networks  

Machine  Generated  

Sensor  Data  

GeolocaCon  Data  

Page 5: Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop

Page 5 © Hortonworks Inc. 2014

HDP 2.1: Enterprise Hadoop

HDP 2.1 Hortonworks Data Platform

   

Provision,  Manage  &  Monitor  

 Ambari  

Zookeeper  

Scheduling    

Oozie  

Data  Workflow,  Lifecycle  &  Governance  

 Falcon  Sqoop  Flume  NFS  

WebHDFS  YARN  :  Data  Opera<ng  System  

DATA    MANAGEMENT  

DATA    ACCESS  GOVERNANCE  &  INTEGRATION   OPERATIONS  

Script    Pig      

Search    

Solr      

SQL    

Hive/Tez,  HCatalog  

   

NoSQL    

HBase  Accumulo  

   

Stream      

Storm  

     

Others    

In-­‐Memory  AnalyCcs,    ISV  engines  

1   °   °   °   °   °   °   °   °   °  

°   °   °   °   °   °   °   °   °   °  

°   °   °   °   °   °   °   °   °   °  

°  

°  

N  

HDFS    (Hadoop  Distributed  File  System)  

Batch    

Map  Reduce  

   

SECURITY  

Authen<ca<on  Authoriza<on  Accoun<ng  

Data  Protec<on    

Storage:  HDFS  Resources:  YARN  Access:  Hive,  …    Pipeline:  Falcon  Cluster:  Knox  

Page 6: Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop

Page 6 © Hortonworks Inc. 2014

HDP 2.1: Enterprise Hadoop

HDP 2.1 Hortonworks Data Platform

   

Provision,  Manage  &  Monitor  

 Ambari  

Zookeeper  

Scheduling    

Oozie  

Data  Workflow,  Lifecycle  &  Governance  

 Falcon  Sqoop  Flume  NFS  

WebHDFS  

DATA    MANAGEMENT  

GOVERNANCE  &  INTEGRATION   OPERATIONS  

Script    Pig      

Search    

Solr      

SQL    

Hive/Tez,  HCatalog  

   

NoSQL    

HBase  Accumulo  

   

Others    

In-­‐Memory  AnalyCcs,    ISV  engines  

1   °   °   °   °   °   °   °   °   °  

°   °   °   °   °   °   °   °   °   °  

°   °   °   °   °   °   °   °   °   °  

°  

°  

N  

HDFS    (Hadoop  Distributed  File  System)  

Batch    

Map  Reduce  

   

SECURITY  

Authen<ca<on  Authoriza<on  Accoun<ng  

Data  Protec<on    

Storage:  HDFS  Resources:  YARN  Access:  Hive,  …    Pipeline:  Falcon  Cluster:  Knox  

YARN  :  Data  Opera<ng  System  

DATA    ACCESS  

Stream      

Storm  

     

Page 7: Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop

Page 7 © Hortonworks Inc. 2014

Agenda

Why Stream Processing?

Storm Overview Q & A

Page 8: Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop

Page 8 © Hortonworks Inc. 2014

Why Stream Processing IN Hadoop?

What is the need? –  Exponential rise in real-time data –  Ability to process real-time data

opens new business opportunities

Why Now? –  Economics of Open source

software & commodity hardware –  YARN allows multiple computing

paradigms to co-exist in the data lake

HDFS2  (redundant,  reliable  storage)  

YARN  (cluster  resource  management)  

MapReduce  (batch)  

Apache    STORM  (streaming)  

HADOOP 2.x

Tez  (interacCve)  

Multi Use Data Platform Batch, Interactive, Online, Streaming, …

Stream processing has emerged as a key use case

Page 9: Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop

Page 9 © Hortonworks Inc. 2014

Why Apache Storm? Open source real-time event stream processing platform that provides fixed, continuous & low latency processing for very high frequency streaming data

•  Horizontally scalable like Hadoop •  Eg: 10 node cluster can process 1M tuples per

second per node

Highly scalable

•  Automatically reassigns tasks on failed nodes Fault-tolerant

•  Supports at least once & exactly once processing semantics

Guarantees processing

•  Processing logic can be defined in any language Language agnostic

•  Brand, governance & a large active community Apache project

Page 10: Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop

Page 10 © Hortonworks Inc. 2014

Typical Stream Processing Flow Real-time data feeds

Stream processing

solution

Persist data

Relational or non

relational data store

Batch processing

Batch Feeds Update event

models (Pattern

templates, KPIs & alerts)

Dashboards & Applications

Page 11: Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop

Page 11 © Hortonworks Inc. 2014

Who is Using Storm today?

AND  MANY  OTHERS…  

AD-­‐  TECH  

SOCIAL  MEDIA  

FINANCE  

TELCO  

Healthcare  

E-­‐COMMERCE  

Source: Storm-project.net

Page 12: Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop

Page 12 © Hortonworks Inc. 2014

Patterns Driving Most Streaming Use Cases

Prevent Optimize

Finance - Securities Fraud - Compliance violations

- Order routing - Pricing

Telco - Security breaches - Network Outages

- Bandwidth allocation - Customer service

Retail - Offers - Pricing

Manufacturing - Machine failures - Supply chain

Transportation - Driver & fleet issues - Routes - Pricing

Web - Application failures - Operational issues

- Site content

Sentiment Clickstream Machine/Sensor Server Logs Geo-location

----

Monitor real-time data to…

Page 13: Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop

Page 13 © Hortonworks Inc. 2014

A Key Storm Benefit: Flexibility

Storm  

KAFKA  or  JMS

-­‐Pump  data  into  storm  -­‐Send  no<fica<ons  from  Storm

HDFS    Data  lake  

Any  RDBMS  

Provide  reference  data  for  storm  topologies

In-­‐memory  caching  pla[orms

Temporary  data  storage    

Any    NoSQL    

database  

Real-­‐<me  views  for  opera<onal  dashboards

Any    search  pla[orm  

Search  interface  for  analysts  &  dashboards

Any  App  Development  Pla[orm  Simplify  development  of  Storm  topologies

Page 14: Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop

Page 14 © Hortonworks Inc. 2014

Agenda

Why Stream Processing?

Storm Overview Q & A

Page 15: Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop

Page 15 © Hortonworks Inc. 2014

Storm Architecture

Nimbus(Management server) •  Similar to job tracker •  Distributes code around cluster •  Assigns tasks •  Handles failures Supervisor(Worker nodes): •  Similar to task tracker •  Run bolts and spouts as ‘tasks’

Zookeper: •  Cluster co-ordination •  Nimbus HA •  Stores cluster metrics •  Consumption related metadata

for Trident topologies

Page 16: Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop

Page 16 © Hortonworks Inc. 2014

Basic Storm Concepts

Tuple: Most fundamental data structure and is a named list of values that can be of any datatype Streams: Groups of tuples Spouts: Generate streams. Bolts: Contain data processing, persistence and alerting logic. Can also emit tuples for downstream bolts Tuple Tree: First tuple and all the tuples that were emitted by the bolts that processed it Topology: Group of spouts and bolts wired together into a workflow

Page 17: Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop

Page 17 © Hortonworks Inc. 2014

Storm Topology

Page 18: Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop

Page 18 © Hortonworks Inc. 2014

What is Trident?

Provides exactly once processing semantics in Storm using real-time batch processing

Core concept: process a group of tuples as a ‘batch’ rather than process tuple at a time like core Storm

Provides a ‘higher level abstraction’ for Storm operations like what cascading does for MapReduce

All Trident topologies are automatically converted into core Storm concepts (Spouts & Bolts)

Page 19: Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop

Page 19 © Hortonworks Inc. 2014

Key Trident Concepts Spouts and Tuples •  Remain the same as core Storm topologies

Transactions •  Way of tagging tuples together so they can be processed with exactly once semantics

Batches •  All tuples tied to the same transactionID form a batch

Partitions •  Segments of a batch that are guaranteed to process their tuples in order. •  Multiple partitions in a given batch can/will be processed in parallel

Streams •  Series of batches form a stream (just like series of tuples form a stream in core

Storm)

Operations •  The higher level abstraction for processing tuples are called ‘operations’ •  Multiple inbuilt operations available for joins, grouping, aggregations & filtering

Page 20: Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop

Page 20 © Hortonworks Inc. 2014

Apache Storm and Apache Ambari

Apache Ambari is now integrated with Apache Storm •  Install Storm with Ambari •  Monitor Storm services with Ambari

Page 21: Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop

Page 21 © Hortonworks Inc. 2014

Agenda

Why Stream Processing?

Storm Overview Q & A

Page 22: Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop

Page 22 © Hortonworks Inc. 2014

Learn More About Stream Processing in Hadoop

Hortonworks.com/labs/storm/

Register for the final

Discover HDP 2.1 Webinar

Hortonworks.com/webinars

Final Webinar:

Using Apache Ambari to

Manage Hadoop Clusters Thursday, June 26, 10am Pacific

Page 23: Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop

Page 23 © Hortonworks Inc. 2014

Thank you!