hadoop in production - meetupfiles.meetup.com/1767817/productionizinghadoop.pdf · hadoop in...

22
Hadoop in Production Charles Zedlewski, VP, Product

Upload: others

Post on 20-May-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hadoop in Production - Meetupfiles.meetup.com/1767817/ProductionizingHadoop.pdf · Hadoop in Production Charles Zedlewski, VP, Product. Cloudera In One Slide ... Cloudera’sDistribution

Hadoop in Production

Charles Zedlewski, VP, Product

Page 2: Hadoop in Production - Meetupfiles.meetup.com/1767817/ProductionizingHadoop.pdf · Hadoop in Production Charles Zedlewski, VP, Product. Cloudera In One Slide ... Cloudera’sDistribution

Cloudera In One Slide

Hadoop…

Jeff Hammerbacher

Amr Awadallah

Doug Cutting

… meets enterprise

Mike Olson - CEO

Omer Trajman – Customer Solutions

John Kreisa – Marketing

Charles Zedlewski – Product

Ed Albanese – Business Development

Investors Greylock Partners, Accel Partners, Meritech Capital Partners, In-Q-Tel

Product category Data management

Business model Cloudera offers Software, Support, Training, and Professional Services

Employees 70+

Customers 75+

Headquarters Palo Alto, California

Elevator pitch The leading provider of Apache Hadoop-based software and services for the enterprise

Vision We enable organizations to profit from all of their data

Copyright 2011 Cloudera, Inc. All rights reserved. 2

Page 3: Hadoop in Production - Meetupfiles.meetup.com/1767817/ProductionizingHadoop.pdf · Hadoop in Production Charles Zedlewski, VP, Product. Cloudera In One Slide ... Cloudera’sDistribution

Cloudera’s Distribution for Hadoop, Version 3

• Open – 100% Apache licensed

• Predictable – Regular releases & updates you can plan against

• Adopted – Widely used in web, retail, financial, media, telecom & federal. Deployed at 10 to 1,000+ node clusters, integrated with various enterprise technologies

• Supported – Employs project founders and committers for >90% of packages

Copyright 2011 Cloudera Inc. All Rights Reserved. 3

Hue Hue SDK

OozieOozie

HBaseFlume, Sqoop

Zookeeper

Hive

Pig/Hive

The Industry’s Leading Hadoop Distribution

Page 4: Hadoop in Production - Meetupfiles.meetup.com/1767817/ProductionizingHadoop.pdf · Hadoop in Production Charles Zedlewski, VP, Product. Cloudera In One Slide ... Cloudera’sDistribution

Outline

• Know what you want

• The organization will evolve with Hadoop

• Some basics

• Avoid the web of scripts

• Multitenancy

• Wrap up

4Copyright 2011 Cloudera Inc. All rights reserved

Page 5: Hadoop in Production - Meetupfiles.meetup.com/1767817/ProductionizingHadoop.pdf · Hadoop in Production Charles Zedlewski, VP, Product. Cloudera In One Slide ... Cloudera’sDistribution

First principles – what do you really want?

• What are the business outcomes Hadoop is supposed to deliver?

• New insights

• Lower business costs

• Lower IT costs

• More data under management

• More revenue through better targeting, conversion

• ?

5Copyright 2011 Cloudera Inc. All rights reserved

Page 6: Hadoop in Production - Meetupfiles.meetup.com/1767817/ProductionizingHadoop.pdf · Hadoop in Production Charles Zedlewski, VP, Product. Cloudera In One Slide ... Cloudera’sDistribution

What are the operational outcomes you need to optimize for to get what you really want?

• Performance

• Utilization

• Cost of operations

• Availability

• Quality of service

• Flexibility / elasticity

• Security

• Transparency

• ?

6Copyright 2011 Cloudera Inc. All rights reserved

Page 7: Hadoop in Production - Meetupfiles.meetup.com/1767817/ProductionizingHadoop.pdf · Hadoop in Production Charles Zedlewski, VP, Product. Cloudera In One Slide ... Cloudera’sDistribution

How the organization evolves – the hunter-gatherers

7Copyright 2011 Cloudera Inc. All rights reserved

Everyone does the same job

Little distinction between job code, Hadoop code & configuration

Still amazed by this fire thing

Life is nasty, brutish and short

Page 8: Hadoop in Production - Meetupfiles.meetup.com/1767817/ProductionizingHadoop.pdf · Hadoop in Production Charles Zedlewski, VP, Product. Cloudera In One Slide ... Cloudera’sDistribution

How the organization evolves – the tribe

Copyright 2011 Cloudera Inc. All rights reserved

We have a chief that looks out for the tribe

Job code distinct from the rest of Hadoop

Make sure there’s enough fire for everyone

Survival of the tribe is still the main concern

Page 9: Hadoop in Production - Meetupfiles.meetup.com/1767817/ProductionizingHadoop.pdf · Hadoop in Production Charles Zedlewski, VP, Product. Cloudera In One Slide ... Cloudera’sDistribution

How the organization evolves – the city

Copyright 2011 Cloudera Inc. All rights reserved

Patching, updating, upgrading, configuring and tuning are all distinct

Pollution, congestion, etc a concern

Fire is not a big deal any more.

More specialized roles

Page 10: Hadoop in Production - Meetupfiles.meetup.com/1767817/ProductionizingHadoop.pdf · Hadoop in Production Charles Zedlewski, VP, Product. Cloudera In One Slide ... Cloudera’sDistribution

Some basics – use the right gear

• Hadoop needs to know where it’s hard drives are• Running on a virtualized layer is a bad idea• RAID is a bad idea• Running on remote storage the worst idea

• Servers - prioritize flexibility over bells and whistles• How easily will you be able to expand your cluster?• How easily can you evolve your core / spindle ratio?• How many companies support that exotic chip, card, drive, power

supply, etc?

• Network – prioritize quality over bells & whistles• 10G on the backplane is usually unnecessary• Amazingly high failure rates for some gear• Plan how to adapt your topology as your cluster grows

10Copyright 2011 Cloudera Inc. All rights reserved

Page 11: Hadoop in Production - Meetupfiles.meetup.com/1767817/ProductionizingHadoop.pdf · Hadoop in Production Charles Zedlewski, VP, Product. Cloudera In One Slide ... Cloudera’sDistribution

Some basics – system should be available and never lose data

• Name node availability

• HA is tricky

• Manual recovery can be simple, fast, effective

• Backup Strategy

• Name node metadata – hourly, ~2 day retention

• User data• Log shipping style strategies

• DistCp

• “Fan out” to multiple clusters on ingestion

11Copyright 2011 Cloudera Inc. All rights reserved

Page 12: Hadoop in Production - Meetupfiles.meetup.com/1767817/ProductionizingHadoop.pdf · Hadoop in Production Charles Zedlewski, VP, Product. Cloudera In One Slide ... Cloudera’sDistribution

The web of scripts – how it starts

• Script to run a check

• Script to import a file

• Script to preempt a job

• Script to instrument a daemon

• Script to….

12Copyright 2011 Cloudera Inc. All rights reserved

Page 13: Hadoop in Production - Meetupfiles.meetup.com/1767817/ProductionizingHadoop.pdf · Hadoop in Production Charles Zedlewski, VP, Product. Cloudera In One Slide ... Cloudera’sDistribution

The web of scripts – where it ends

13Copyright 2011 Cloudera Inc. All rights reserved

Garish, jerry-rigged

Nothing ever changes or improves

One and only one person loves it

Time goes into maintaining scripts, not achieving the

operational objectives

Page 14: Hadoop in Production - Meetupfiles.meetup.com/1767817/ProductionizingHadoop.pdf · Hadoop in Production Charles Zedlewski, VP, Product. Cloudera In One Slide ... Cloudera’sDistribution

Script avoidance – data ingestion

• Many data sources

• Streaming data sources (log files, mostly)

• RDBMS

• EDW

• Files (usually exports from 3rd party)

• Common place we see DIY

• You probably shouldn’t

• Sqoop, Flume, Oozie (but I’m biased)

• No matter what - fault tolerant, performant, monitored

14Copyright 2011 Cloudera Inc. All rights reserved

Page 15: Hadoop in Production - Meetupfiles.meetup.com/1767817/ProductionizingHadoop.pdf · Hadoop in Production Charles Zedlewski, VP, Product. Cloudera In One Slide ... Cloudera’sDistribution

Script avoidance - ETL and Data Processing

• Non-interactive jobs

• Establish a common directory structure for processes

• Need tools to handle complex chains of jobs

• Workflow tools support

• Job dependencies, error handling

• Tracking

• Invocation based on time or events

• Most common mistake: depending on jobs always completing successfully or within a window of time.

• Monitor for SLA rather than pray

15Copyright 2011 Cloudera Inc. All rights reserved

Page 16: Hadoop in Production - Meetupfiles.meetup.com/1767817/ProductionizingHadoop.pdf · Hadoop in Production Charles Zedlewski, VP, Product. Cloudera In One Slide ... Cloudera’sDistribution

Script avoidance - monitoring

• Helps keep things running

• System monitoring

• Duh.

• Traditional monitoring tools: Nagios, Hyperic, Zenoss – use what you already have

• Host checks, service checks

• When to alert? It’s tricky.

16Copyright 2011 Cloudera Inc. All rights reserved

Page 17: Hadoop in Production - Meetupfiles.meetup.com/1767817/ProductionizingHadoop.pdf · Hadoop in Production Charles Zedlewski, VP, Product. Cloudera In One Slide ... Cloudera’sDistribution

Script avoidance - monitoring

• Hadoop aware cluster monitoring• Traditional tools don’t cut it; Hadoop monitoring is inherently

Hadoop specific

• Analogous to RDBMS monitoring tools

• Job level “monitoring”• More like analysis

• “What resources does this job use?”

• “How does this run compare to last run?”

• “How can I make this run faster, more resource efficient?”

• Two views we care about• Job perspective

• Resource perspective (task slots, scheduler pool)

17Copyright 2011 Cloudera Inc. All rights reserved

Page 18: Hadoop in Production - Meetupfiles.meetup.com/1767817/ProductionizingHadoop.pdf · Hadoop in Production Charles Zedlewski, VP, Product. Cloudera In One Slide ... Cloudera’sDistribution

Multi-tenancy

• Definition – ability of disparate groups, users, data and workloads to operate concurrently on 1 logical Hadoop system

• Multi-tenancy helps you get more of what you really want

• Better performance

• Better cost of operations

• New insights

• Greater availability

• Multi-tenancy has some additional considerations

18Copyright 2011 Cloudera Inc. All rights reserved

Page 19: Hadoop in Production - Meetupfiles.meetup.com/1767817/ProductionizingHadoop.pdf · Hadoop in Production Charles Zedlewski, VP, Product. Cloudera In One Slide ... Cloudera’sDistribution

Multi-tenancy – auth and auth

• Authentication• Don’t talk to strangers

• Should integrate with existing IT infrastructure

• Yahoo! security (Kerberos) patches now part of CDH3b3

• Authorization• Not everyone can access everything

• Ex. Production data sets are read-only to quants / analysts. Analysts have home or group directories for derived data sets.

• Mostly enforced via HDFS permissions; directory structure and organization is critical

• Not as fine grained as column level access in EDW, RDBMS (but this is coming)

19Copyright 2011 Cloudera Inc. All rights reserved

Page 20: Hadoop in Production - Meetupfiles.meetup.com/1767817/ProductionizingHadoop.pdf · Hadoop in Production Charles Zedlewski, VP, Product. Cloudera In One Slide ... Cloudera’sDistribution

Multi-tenancy - resources

• “Stop hogging the cluster!”

• Tracking & establishing policies around cluster resources

• Files, bytes and quotas thereof

• Tasks, memory, IO, CPU, network and scheduling thereof

• By now you’ve almost certainly graduated to a fancy scheduler

• Policies to prevent bad behavior (e.g. auto-kill)

• Monitor and track resource utilization across all groups

20Copyright 2011 Cloudera Inc. All rights reserved

Page 21: Hadoop in Production - Meetupfiles.meetup.com/1767817/ProductionizingHadoop.pdf · Hadoop in Production Charles Zedlewski, VP, Product. Cloudera In One Slide ... Cloudera’sDistribution

Metadata Management

• Tool independent metadata about…

• Data sets we know about and their location (on HDFS)

• Schema

• Authorization (currently HDFS permissions only)

• Partitioning

• Format and compression

• Guarantees (consistency, timeliness, permits duplicates)

• Popularity & usage

• Currently still DIY in many ways, tool-dependent

• Most people rely on prayer and hard coding

21Copyright 2011 Cloudera Inc. All rights reserved

Page 22: Hadoop in Production - Meetupfiles.meetup.com/1767817/ProductionizingHadoop.pdf · Hadoop in Production Charles Zedlewski, VP, Product. Cloudera In One Slide ... Cloudera’sDistribution

Wrapping it up

• The basics are not a good place to get creative

• Visualize the city you want to become

• Think command center, not man cave

• Multi-tenancy is a big oppty with a few additional operational burdens

• There’s lots more work to do

22Copyright 2011 Cloudera Inc. All rights reserved