hadoop in production - meetupfiles.meetup.com/1767817/productionizinghadoop.pdf · hadoop in...
TRANSCRIPT
Hadoop in Production
Charles Zedlewski, VP, Product
Cloudera In One Slide
Hadoop…
Jeff Hammerbacher
Amr Awadallah
Doug Cutting
… meets enterprise
Mike Olson - CEO
Omer Trajman – Customer Solutions
John Kreisa – Marketing
Charles Zedlewski – Product
Ed Albanese – Business Development
Investors Greylock Partners, Accel Partners, Meritech Capital Partners, In-Q-Tel
Product category Data management
Business model Cloudera offers Software, Support, Training, and Professional Services
Employees 70+
Customers 75+
Headquarters Palo Alto, California
Elevator pitch The leading provider of Apache Hadoop-based software and services for the enterprise
Vision We enable organizations to profit from all of their data
Copyright 2011 Cloudera, Inc. All rights reserved. 2
Cloudera’s Distribution for Hadoop, Version 3
• Open – 100% Apache licensed
• Predictable – Regular releases & updates you can plan against
• Adopted – Widely used in web, retail, financial, media, telecom & federal. Deployed at 10 to 1,000+ node clusters, integrated with various enterprise technologies
• Supported – Employs project founders and committers for >90% of packages
Copyright 2011 Cloudera Inc. All Rights Reserved. 3
Hue Hue SDK
OozieOozie
HBaseFlume, Sqoop
Zookeeper
Hive
Pig/Hive
The Industry’s Leading Hadoop Distribution
Outline
• Know what you want
• The organization will evolve with Hadoop
• Some basics
• Avoid the web of scripts
• Multitenancy
• Wrap up
4Copyright 2011 Cloudera Inc. All rights reserved
First principles – what do you really want?
• What are the business outcomes Hadoop is supposed to deliver?
• New insights
• Lower business costs
• Lower IT costs
• More data under management
• More revenue through better targeting, conversion
• ?
5Copyright 2011 Cloudera Inc. All rights reserved
What are the operational outcomes you need to optimize for to get what you really want?
• Performance
• Utilization
• Cost of operations
• Availability
• Quality of service
• Flexibility / elasticity
• Security
• Transparency
• ?
6Copyright 2011 Cloudera Inc. All rights reserved
How the organization evolves – the hunter-gatherers
7Copyright 2011 Cloudera Inc. All rights reserved
Everyone does the same job
Little distinction between job code, Hadoop code & configuration
Still amazed by this fire thing
Life is nasty, brutish and short
How the organization evolves – the tribe
Copyright 2011 Cloudera Inc. All rights reserved
We have a chief that looks out for the tribe
Job code distinct from the rest of Hadoop
Make sure there’s enough fire for everyone
Survival of the tribe is still the main concern
How the organization evolves – the city
Copyright 2011 Cloudera Inc. All rights reserved
Patching, updating, upgrading, configuring and tuning are all distinct
Pollution, congestion, etc a concern
Fire is not a big deal any more.
More specialized roles
Some basics – use the right gear
• Hadoop needs to know where it’s hard drives are• Running on a virtualized layer is a bad idea• RAID is a bad idea• Running on remote storage the worst idea
• Servers - prioritize flexibility over bells and whistles• How easily will you be able to expand your cluster?• How easily can you evolve your core / spindle ratio?• How many companies support that exotic chip, card, drive, power
supply, etc?
• Network – prioritize quality over bells & whistles• 10G on the backplane is usually unnecessary• Amazingly high failure rates for some gear• Plan how to adapt your topology as your cluster grows
10Copyright 2011 Cloudera Inc. All rights reserved
Some basics – system should be available and never lose data
• Name node availability
• HA is tricky
• Manual recovery can be simple, fast, effective
• Backup Strategy
• Name node metadata – hourly, ~2 day retention
• User data• Log shipping style strategies
• DistCp
• “Fan out” to multiple clusters on ingestion
11Copyright 2011 Cloudera Inc. All rights reserved
The web of scripts – how it starts
• Script to run a check
• Script to import a file
• Script to preempt a job
• Script to instrument a daemon
• Script to….
12Copyright 2011 Cloudera Inc. All rights reserved
The web of scripts – where it ends
13Copyright 2011 Cloudera Inc. All rights reserved
Garish, jerry-rigged
Nothing ever changes or improves
One and only one person loves it
Time goes into maintaining scripts, not achieving the
operational objectives
Script avoidance – data ingestion
• Many data sources
• Streaming data sources (log files, mostly)
• RDBMS
• EDW
• Files (usually exports from 3rd party)
• Common place we see DIY
• You probably shouldn’t
• Sqoop, Flume, Oozie (but I’m biased)
• No matter what - fault tolerant, performant, monitored
14Copyright 2011 Cloudera Inc. All rights reserved
Script avoidance - ETL and Data Processing
• Non-interactive jobs
• Establish a common directory structure for processes
• Need tools to handle complex chains of jobs
• Workflow tools support
• Job dependencies, error handling
• Tracking
• Invocation based on time or events
• Most common mistake: depending on jobs always completing successfully or within a window of time.
• Monitor for SLA rather than pray
15Copyright 2011 Cloudera Inc. All rights reserved
Script avoidance - monitoring
• Helps keep things running
• System monitoring
• Duh.
• Traditional monitoring tools: Nagios, Hyperic, Zenoss – use what you already have
• Host checks, service checks
• When to alert? It’s tricky.
16Copyright 2011 Cloudera Inc. All rights reserved
Script avoidance - monitoring
• Hadoop aware cluster monitoring• Traditional tools don’t cut it; Hadoop monitoring is inherently
Hadoop specific
• Analogous to RDBMS monitoring tools
• Job level “monitoring”• More like analysis
• “What resources does this job use?”
• “How does this run compare to last run?”
• “How can I make this run faster, more resource efficient?”
• Two views we care about• Job perspective
• Resource perspective (task slots, scheduler pool)
17Copyright 2011 Cloudera Inc. All rights reserved
Multi-tenancy
• Definition – ability of disparate groups, users, data and workloads to operate concurrently on 1 logical Hadoop system
• Multi-tenancy helps you get more of what you really want
• Better performance
• Better cost of operations
• New insights
• Greater availability
• Multi-tenancy has some additional considerations
18Copyright 2011 Cloudera Inc. All rights reserved
Multi-tenancy – auth and auth
• Authentication• Don’t talk to strangers
• Should integrate with existing IT infrastructure
• Yahoo! security (Kerberos) patches now part of CDH3b3
• Authorization• Not everyone can access everything
• Ex. Production data sets are read-only to quants / analysts. Analysts have home or group directories for derived data sets.
• Mostly enforced via HDFS permissions; directory structure and organization is critical
• Not as fine grained as column level access in EDW, RDBMS (but this is coming)
19Copyright 2011 Cloudera Inc. All rights reserved
Multi-tenancy - resources
• “Stop hogging the cluster!”
• Tracking & establishing policies around cluster resources
• Files, bytes and quotas thereof
• Tasks, memory, IO, CPU, network and scheduling thereof
• By now you’ve almost certainly graduated to a fancy scheduler
• Policies to prevent bad behavior (e.g. auto-kill)
• Monitor and track resource utilization across all groups
20Copyright 2011 Cloudera Inc. All rights reserved
Metadata Management
• Tool independent metadata about…
• Data sets we know about and their location (on HDFS)
• Schema
• Authorization (currently HDFS permissions only)
• Partitioning
• Format and compression
• Guarantees (consistency, timeliness, permits duplicates)
• Popularity & usage
• Currently still DIY in many ways, tool-dependent
• Most people rely on prayer and hard coding
21Copyright 2011 Cloudera Inc. All rights reserved
Wrapping it up
• The basics are not a good place to get creative
• Visualize the city you want to become
• Think command center, not man cave
• Multi-tenancy is a big oppty with a few additional operational burdens
• There’s lots more work to do
22Copyright 2011 Cloudera Inc. All rights reserved