hadoop on openstack - trove day 2014

18
Hadoop on OpenStack with Sahara August 19, 2014 Matthew Farrellee (@spinningmatt) Emerging Technology and Strategy CTO Office, Red Hat

Upload: tesora

Post on 01-Jul-2015

375 views

Category:

Technology


1 download

DESCRIPTION

Presentation from OpenStack Trove Day 2014 by Matthew Farrellee, Emerging Technology & Strategy, CTO Office at Red Hat

TRANSCRIPT

Page 1: Hadoop on OpenStack - Trove Day 2014

Hadoop on OpenStack with Sahara

August 19, 2014

Matthew Farrellee (@spinningmatt)Emerging Technology and StrategyCTO Office, Red Hat

Page 2: Hadoop on OpenStack - Trove Day 2014

Hadoop is

8/19/14 tesora.com

• Narrow definition - Apache Hadoop - a specific Apache project originally from Yahoo!, based on papers from Google

• Broad definition - the ecosystem of projects, primarily within Apache, that integrate in some form with Apache Hadoop

• I’m going to use the broad definition

Page 3: Hadoop on OpenStack - Trove Day 2014

Hadoop often looks like

8/19/14 tesora.com

• Multiple, loosely coupled projects focused on data storage and processing

• Includes: workload, resource, system management; data ingest & storage; compute frameworks and domain languages

Page 4: Hadoop on OpenStack - Trove Day 2014

Hadoop is often used to

8/19/14 tesora.com

• Store data

• ETL data

• Analyze data• Structured and unstructured

Page 5: Hadoop on OpenStack - Trove Day 2014

Data today

8/19/14 tesora.com

• Structured or unstructured

• >2.5x more unstructured

• Rate of growth for unstructured is 2x structured

Page 6: Hadoop on OpenStack - Trove Day 2014

Data problems

8/19/14 tesora.com

• It’s not just that processing data is expensive• In hardware costs• In computational time• Most of all, in human time

• Data creation outpaces storage capacity

Page 7: Hadoop on OpenStack - Trove Day 2014

Value

Value

Data flows

8/19/14 tesora.com

DatabaseData

DB

Data

Many still look like this... ...but start to look like this...

Page 8: Hadoop on OpenStack - Trove Day 2014

The analysis itself is hard

8/19/14 tesora.com

• Data sources are hard to find, or create• Data is always dirty and needs cleaning• Clean data is always approximate

• Figuring out the right question to ask takes iterations

Page 9: Hadoop on OpenStack - Trove Day 2014

Sahara’s goal

8/19/14 tesora.com

Make managing data processing (e.g. Hadoop) infrastructure and tools so simple they just get out of

your way

Page 10: Hadoop on OpenStack - Trove Day 2014

Sahara’s history

8/19/14 tesora.com

• Started at the Portland summit (April 2013)• Joint effort by Red Hat, Mirantis and

Hortonworks• Originally called Savanna

• Incubated in Icehouse (released April 2014)• Supported Apache and Hortonworks Hadoop

• Integrated for Juno (release October 2014)

Page 11: Hadoop on OpenStack - Trove Day 2014

Sahara’s use cases

8/19/14 tesora.com

• Cluster• Start / stop / scale• Different shapes and sizes• Repeatable (template mechanism)

• Workload (Elastic Data Processing, a.k.a EDP)• Job = Analysis code + Data urls• Queued and run across clusters (ephemeral or

persistent)

Page 12: Hadoop on OpenStack - Trove Day 2014

Sahara’s architecture

8/19/14 tesora.com

Data Sources

Sahara Python Client R

ES

T A

PI

Cluster Configuration

Manager

Horizon

Keystone

Auth

Data Access Layer

Swift

Sahara Pages

HadoopVM

Vendors Plugins

HadoopVM

HadoopVM

HadoopVM

Resources Orchestration

Manager

Job Sources Job

Manager

Heat

Nova

Glance

Cinder

Neutron

Trove DB

Sahara Service

Page 13: Hadoop on OpenStack - Trove Day 2014

Sahara’s vendor plugins

8/19/14 tesora.com

• It’s how users pick different software versions• It’s how data processing frameworks are

integrated• e.g. Vanilla (ref. impl. w/ Apache versions),

HDP (via Ambari), Spark (based on Vanilla), CDH (spec approved), MapR (spec in review), IDH (being removed)

Page 14: Hadoop on OpenStack - Trove Day 2014

Sahara’s API

8/19/14 tesora.com

• Both REST and Python (of course)

• Accessible from CLI and Horizon

Page 15: Hadoop on OpenStack - Trove Day 2014

Sahara’s basic structures

8/19/14 tesora.com

• Plugins - controller for specific software collections• Images - in Glance, w/ special plugin specific tags• Templates

• Two kinds, node group and cluster• Combine node groups to form a cluster

template• Clusters - the live clusters

Page 16: Hadoop on OpenStack - Trove Day 2014

Sahara’s EDP structures

8/19/14 tesora.com

• Data sources• Input and output locations (Swift/HDFS/etc urls)

• Job binaries• Often JARs or scripts, stored in a data source

• Jobs• Templates for a job w/ parameters empty

• Job executions• Instances of templates w/ parameters filled

Page 17: Hadoop on OpenStack - Trove Day 2014

Juno roadmap

8/19/14 tesora.com

https://review.openstack.org/#/q/sahara-specs+AND+status:merged,n,zhttps://blueprints.launchpad.net/sahara• Highlights -

• Dashboard merged into Horizon• Spark w/ EDP• CDH plugin• Storm plugin• Security group and Swift auth

Page 18: Hadoop on OpenStack - Trove Day 2014

8/19/14 tesora.com

Demo video: http://youtu.be/vmry_kXqn4c

Questions?