hadoop on openstack - trove day 2014

Hadoop on OpenStack with Sahara

August 19, 2014

Matthew Farrellee (@spinningmatt)Emerging Technology and StrategyCTO Office, Red Hat

Hadoop is

8/19/14 tesora.com

• Narrow definition - Apache Hadoop - a specific Apache project originally from Yahoo!, based on papers from Google

• Broad definition - the ecosystem of projects, primarily within Apache, that integrate in some form with Apache Hadoop

• I’m going to use the broad definition

Hadoop often looks like

8/19/14 tesora.com

• Multiple, loosely coupled projects focused on data storage and processing

• Includes: workload, resource, system management; data ingest & storage; compute frameworks and domain languages

Hadoop is often used to

8/19/14 tesora.com

• Store data

• ETL data

• Analyze data• Structured and unstructured

Data today

8/19/14 tesora.com

• Structured or unstructured

• >2.5x more unstructured

• Rate of growth for unstructured is 2x structured

Data problems

8/19/14 tesora.com

• It’s not just that processing data is expensive• In hardware costs• In computational time• Most of all, in human time

• Data creation outpaces storage capacity

Value

Value

Data flows

8/19/14 tesora.com

DatabaseData

DB

Data

Many still look like this... ...but start to look like this...

The analysis itself is hard

8/19/14 tesora.com

• Data sources are hard to find, or create• Data is always dirty and needs cleaning• Clean data is always approximate

• Figuring out the right question to ask takes iterations

Sahara’s goal

8/19/14 tesora.com

Make managing data processing (e.g. Hadoop) infrastructure and tools so simple they just get out of

your way

Sahara’s history

8/19/14 tesora.com

• Started at the Portland summit (April 2013)• Joint effort by Red Hat, Mirantis and

Hortonworks• Originally called Savanna

• Incubated in Icehouse (released April 2014)• Supported Apache and Hortonworks Hadoop

• Integrated for Juno (release October 2014)

Sahara’s use cases

8/19/14 tesora.com

• Cluster• Start / stop / scale• Different shapes and sizes• Repeatable (template mechanism)

• Workload (Elastic Data Processing, a.k.a EDP)• Job = Analysis code + Data urls• Queued and run across clusters (ephemeral or

persistent)

Sahara’s architecture

8/19/14 tesora.com

Data Sources

Sahara Python Client R

ES

T A

PI

Cluster Configuration

Manager

Horizon

Keystone

Auth

Data Access Layer

Swift

Sahara Pages

HadoopVM

Vendors Plugins

HadoopVM

HadoopVM

HadoopVM

Resources Orchestration

Manager

Job Sources Job

Manager

Heat

Nova

Glance

Cinder

Neutron

Trove DB

Sahara Service

Sahara’s vendor plugins

8/19/14 tesora.com

• It’s how users pick different software versions• It’s how data processing frameworks are

integrated• e.g. Vanilla (ref. impl. w/ Apache versions),

HDP (via Ambari), Spark (based on Vanilla), CDH (spec approved), MapR (spec in review), IDH (being removed)

Sahara’s API

8/19/14 tesora.com

• Both REST and Python (of course)

• Accessible from CLI and Horizon

Sahara’s basic structures

8/19/14 tesora.com

• Plugins - controller for specific software collections• Images - in Glance, w/ special plugin specific tags• Templates

• Two kinds, node group and cluster• Combine node groups to form a cluster

template• Clusters - the live clusters

Sahara’s EDP structures

8/19/14 tesora.com

• Data sources• Input and output locations (Swift/HDFS/etc urls)

• Job binaries• Often JARs or scripts, stored in a data source

• Jobs• Templates for a job w/ parameters empty

• Job executions• Instances of templates w/ parameters filled

Juno roadmap

8/19/14 tesora.com

https://review.openstack.org/#/q/sahara-specs+AND+status:merged,n,zhttps://blueprints.launchpad.net/sahara• Highlights -

• Dashboard merged into Horizon• Spark w/ EDP• CDH plugin• Storm plugin• Security group and Swift auth

https://review.openstack.org/#/q/sahara-specs+AND+status:merged,n,z

https://review.openstack.org/#/q/sahara-specs+AND+status:merged,n,z

https://blueprints.launchpad.net/sahara

https://blueprints.launchpad.net/sahara

8/19/14 tesora.com

Demo video: http://youtu.be/vmry_kXqn4c

Questions?

http://youtu.be/vmry_kXqn4c