hadoop on openstack - trove day 2014
DESCRIPTION
Presentation from OpenStack Trove Day 2014 by Matthew Farrellee, Emerging Technology & Strategy, CTO Office at Red HatTRANSCRIPT
Hadoop on OpenStack with Sahara
August 19, 2014
Matthew Farrellee (@spinningmatt)Emerging Technology and StrategyCTO Office, Red Hat
Hadoop is
8/19/14 tesora.com
• Narrow definition - Apache Hadoop - a specific Apache project originally from Yahoo!, based on papers from Google
• Broad definition - the ecosystem of projects, primarily within Apache, that integrate in some form with Apache Hadoop
• I’m going to use the broad definition
Hadoop often looks like
8/19/14 tesora.com
• Multiple, loosely coupled projects focused on data storage and processing
• Includes: workload, resource, system management; data ingest & storage; compute frameworks and domain languages
Hadoop is often used to
8/19/14 tesora.com
• Store data
• ETL data
• Analyze data• Structured and unstructured
Data today
8/19/14 tesora.com
• Structured or unstructured
• >2.5x more unstructured
• Rate of growth for unstructured is 2x structured
Data problems
8/19/14 tesora.com
• It’s not just that processing data is expensive• In hardware costs• In computational time• Most of all, in human time
• Data creation outpaces storage capacity
Value
Value
Data flows
8/19/14 tesora.com
DatabaseData
DB
Data
Many still look like this... ...but start to look like this...
The analysis itself is hard
8/19/14 tesora.com
• Data sources are hard to find, or create• Data is always dirty and needs cleaning• Clean data is always approximate
• Figuring out the right question to ask takes iterations
Sahara’s goal
8/19/14 tesora.com
Make managing data processing (e.g. Hadoop) infrastructure and tools so simple they just get out of
your way
Sahara’s history
8/19/14 tesora.com
• Started at the Portland summit (April 2013)• Joint effort by Red Hat, Mirantis and
Hortonworks• Originally called Savanna
• Incubated in Icehouse (released April 2014)• Supported Apache and Hortonworks Hadoop
• Integrated for Juno (release October 2014)
Sahara’s use cases
8/19/14 tesora.com
• Cluster• Start / stop / scale• Different shapes and sizes• Repeatable (template mechanism)
• Workload (Elastic Data Processing, a.k.a EDP)• Job = Analysis code + Data urls• Queued and run across clusters (ephemeral or
persistent)
Sahara’s architecture
8/19/14 tesora.com
Data Sources
Sahara Python Client R
ES
T A
PI
Cluster Configuration
Manager
Horizon
Keystone
Auth
Data Access Layer
Swift
Sahara Pages
HadoopVM
Vendors Plugins
HadoopVM
HadoopVM
HadoopVM
Resources Orchestration
Manager
Job Sources Job
Manager
Heat
Nova
Glance
Cinder
Neutron
Trove DB
Sahara Service
Sahara’s vendor plugins
8/19/14 tesora.com
• It’s how users pick different software versions• It’s how data processing frameworks are
integrated• e.g. Vanilla (ref. impl. w/ Apache versions),
HDP (via Ambari), Spark (based on Vanilla), CDH (spec approved), MapR (spec in review), IDH (being removed)
Sahara’s API
8/19/14 tesora.com
• Both REST and Python (of course)
• Accessible from CLI and Horizon
Sahara’s basic structures
8/19/14 tesora.com
• Plugins - controller for specific software collections• Images - in Glance, w/ special plugin specific tags• Templates
• Two kinds, node group and cluster• Combine node groups to form a cluster
template• Clusters - the live clusters
Sahara’s EDP structures
8/19/14 tesora.com
• Data sources• Input and output locations (Swift/HDFS/etc urls)
• Job binaries• Often JARs or scripts, stored in a data source
• Jobs• Templates for a job w/ parameters empty
• Job executions• Instances of templates w/ parameters filled
Juno roadmap
8/19/14 tesora.com
https://review.openstack.org/#/q/sahara-specs+AND+status:merged,n,zhttps://blueprints.launchpad.net/sahara• Highlights -
• Dashboard merged into Horizon• Spark w/ EDP• CDH plugin• Storm plugin• Security group and Swift auth