webinar: increasing business agility with real-time processing with apache hadoop and spark

Increasing Business Agility with Real-time Processing using Apache

Hadoop and Spark

Powered by

Agenda

• Big Data and Real-time

Processing

– Use cases

– Why Hadoop and Spark?

– What’s required?

• Successfully Designing an

Elastic Compute

Infrastructure

• Solutions Demo

– Hadoop and Spark, powered by

Nebula and Scalr

Huy NguyenSr. Director, Product

Marketing

Thomas OrozcoProduct Manager

Presenters

Evolution of Big Data and its Impact

• Businesses are pressed to operate in real-time

for competitive edge

• Mere minutes can make the difference between

a brilliantly handled crisis and a full-blown

social media disaster

• User, machine, or sensor generated data must

be processed in real-time

• Weekly reports, scheduled jobs, and batch

reporting alone are no longer solutions

• Data after-the-fact is losing competitive

advantages

• Data is more relevant to the business if it’s

“fresh data”

• Ability to act right now as things are happening

Batch Processing and Real-time Processing: It’s all about ‘now’

Batch Processing

Acting on

“Data at Rest”

Real-time Processing

Acting on

“Data in Motion”

Static Infrastructure Requires an Elastic Infrastructure

ComputeCompute Compute

Uses for Real-time, Stream Processing

IT Management:

Log processing, analysis, and log driven alerting, infrastructure fault

protection, intelligence and surveillance, fraud detection, etc…

Brand Management and Customer Engagement:

Sentiment analysis, data mining on social media streams and user-

generated content, algorithmic trading, geospatial location , etc…

Conversion Optimization:

Clickstream analysis and real-time targeted offer generation

Why use Hadoop + Spark for Real-Time Processing?

Plenty of alternatives exist:

• Mesos (+ Spark), Storm, Message Queue (+ custom processing tier)

Hadoop + Spark stack offers unique benefits:

• Familiar and high-level API (HDFS distributed storage abstraction, YARN scheduling…

and rescheduling).

• Integrates naturally with traditional batch jobs (e.g. process log streams in real-time to

flag high-priority events, and run traditional map-reduce jobs on them later on).

What’s Required: The Move from Batch Processing to Real-time Processing

Hadoop YARN & Apache Spark: Builds processing workflows that parse, categorize, and

score information in real-time

Hadoop evolved from being “MapReduce

+ HDFS” to “YARN + HDFS”

YARN is used to distribute tasks across a

set of computing nodes — regardless of

whether these tasks are batch, interactive,

or real-time data access

Apache Spark, a cluster-computing platform

that supports real-time, streaming workloads,

backed by the robust HDFS storage engine

Big Data

Storage

Compute

Decouple the compute tier from

storage tier for real-time processing

• Dynamically scaling the storage tier would

result in major inefficiencies or data loss

Processing Tier

Processing tier (application and

infrastructure) must be able to “auto

scale” compute resources as the

volume, velocity, and variety of big data

increases

What’s Required: Decoupling the Compute/Storage Tier & Auto-scaling

Suggested Architecture for Real-time Big Data Processing

A Hadoop Compute Tier (YARN)• One resource manager

• One history server

• Multiple node managers

B Hadoop Storage Tier (HDFS)• One name node

• Multiple data nodes

C Client Nodes• Dispatch real-time data

processing jobs

D Intelligent Cloud Mgmt

Platform from Scalr • Orchestration and auto-

scaling of applicationsD

E Turnkey Private Cloud

Infrastructure from

Nebula • Elastic, on-demand cloud

computing infrastructureE

INTRODUCTION TO NEBULA

Nebula Turnkey Private Cloud

Fastest path to OpenStack

Nebula productizes OpenStack in a highly cost-efficient, fast

time-to-value, secure and scalable enterprise-class product

Cost-efficient: Software delivered using appliance with off-the-

shelf industry standard servers and storage – freedom of choice

Fast time-to-value: Curated OpenStack (rack integration or multi-

rack integration), enabling customers/partners to spend their

resources building applications, not building infrastructure

Open, Secure & Scalable: Identical clouds to deliver consistent

and predictable performance with open connectors for turnkey

eco-system

Enterprise-class: Highly available with connectors to existing

enterprise workflows & architecture (identity, storage, networking)

for zero disruption to IT

Nebula Turnkey Private Cloud

DevOPs / DevTestWorkloads

Genome SequencingWorkloads

Big Data / Real-timeWorkloads

Media RenderingWorkloads

Self-Service ITProcess Improvements API / Integration

Cosmos Software

StorageCompute Network

Management & Orchestration

Identity/Security

Active Directory

Identity

Storage

Networking

The Only Enterprise-ready,

Turnkey Solution for OpenStack Private Clouds

Traditional InfrastructureFixed Compute, Storage, Network

Private Cloud

Shared Resource Pool

•As real-time data feeds increase,

YARN tier can be provisioned to

scale-out across multiple servers

•As data feeds decrease,

resources can be de-provisioned

and returned to the shared pool

•Nebula enables resource pooling

of compute, storage, network

services for scale-out readinessYARN Tier w/

YARN Tier w/

SparkYARN Tier w/

Auto-scaling with Nebula and Scalr

INTRODUCTION TO SCALR

Scalr is used to:

Orchestrate

Resources

Provisioning

Templating

Auto-scaling

Define and Enforce

Policies

Lease Management

Network Policies

Centrally

Manage Clouds

Multi-Cloud

Cost Analytics

SSO, CMDB, ITSM

integrations

Scalr is trusted by:

SOLUTIONS DEMO

www.nebula.com or www.scalr.com

Nebula’s turnkey private cloud and Scalr’s intelligent Cloud

Management Platform meet these demands by delivering

an orchestrated infrastructure that can auto scale compute

and storage resources on-demand to process data feeds in

real-time

Summary

Emergent big data technology such as Hadoop YARN and

Apache Spark can build processing workflows that parse,

categorize, and score information in real-time

Data processing tiers (from application

to infrastructure) must be able to auto-

scale to accommodate the 3 Vs of Big

For more information:

Businesses need to operate in

real-time to maintain competitive

Benefits to Real-Time Processing

React to changing business conditions in real time

• Adapt and react quickly to data, market conditions and events happening in the

outside world

Faster time-to-market

• Development and deployment

Delivering the best user experience

• Personalized experience

THANK YOU

webinar: increasing business agility with real-time processing with apache hadoop and spark

realtime hadoop

log processing

processing workflows

mesos spark

yarn hdfs yarn

use hadoop spark

traditional batch jobs

scalr huy nguyensr

Software

apache spark: moving on from hadoop

big data – spark/hadoop data services · 2017-07-10 ·...

introduction to spark on hadoop

apache spark & hadoop

hadoop spark performance comparison

overview of hadoop and spark service at cern · overview of...

big data hadoop & spark - intellipaat

why spark on hadoop matters

webinar nebula&scalr : increasing business agility with...

introduction to apache spark - university of...

spark in the hadoop ecosystem

deep learning on hadoop/spark -nextml

brave new world: hadoop vs. spark - eth...

spark-on-yarn: empower spark applications on hadoop cluster

is spark replacing hadoop

installing hadoop / spark from scratch

hadoop & spark – using amazon emr

hadoop to spark-v2

adios hadoop, hola spark! t3chfest 2015

spark & hadoop at production at scale