big data anti-patterns: lessons from the front line

24
Big Data Anti-Patterns: Lessons from the Front Lines Strata NYC October 17, 2014 Douglas Moore

Upload: douglas-moore

Post on 01-Jul-2015

963 views

Category:

Technology


1 download

DESCRIPTION

October 2014 Strata NYC presentations

TRANSCRIPT

Page 1: Big Data Anti-Patterns: Lessons From the Front LIne

Big Data Anti-Patterns:

Lessons from the Front Lines

Strata NYC

October 17, 2014

Douglas Moore

Page 2: Big Data Anti-Patterns: Lessons From the Front LIne

| 2

Think Big – 3 Years

- Delivery

• BDW, Search, Streaming

- Roadmaps

- Tech Assessments

About Douglas Moore

2

Before Big Data

- Data Warehousing

- OLTP

- Systems Architecture

- Electricity

- High End Graphics

- Supercomputers

- Numerical Analysis

@douglas_ma

Contact me at:

Page 3: Big Data Anti-Patterns: Lessons From the Front LIne

| 3

Think Big

3

4yr Old “Big Data” Professional Services Firm

- Roadmaps

- Engineering

- Data Science

- Hands on Training

Recently acquired by Teradata

• Maintaining Independence

Page 4: Big Data Anti-Patterns: Lessons From the Front LIne

| 4

Content Drawn From Vast Amounts of Experience

4

50+ Clients

Leading

security

software

vendor

Leading

Discount

Retailer

Page 5: Big Data Anti-Patterns: Lessons From the Front LIne

| 5

I started out with just 3 topics…

Then while on the road to Strata,

I met 7 big data architects

- Who had 7 clients

• Who had 7 projects

• That demonstrated 7 Anti-Patterns

Introduction

5

Big Data Anti-pattern:

“Commonly applied but bad solution”

I95 Wikipedia

Page 6: Big Data Anti-Patterns: Lessons From the Front LIne

| 6

• Hardware and Infrastructure

• Tooling

• Big Data Warehousing

Three Focus Areas

6

Page 7: Big Data Anti-Patterns: Lessons From the Front LIne

| 7

Reference Architecture Driven

- 90’s & 00’s data center patterns

- Servers MUST NOT FAIL

- Standard Server Config

• $35,000/node

• Dual Power supply

• RAID

• SAS 15K RPM

• SAN

• VMs for Production

• Flat Network

Hardware & Infrastructure

7

[Image source: HP: The transformation

to HP Converged Infrastructure]Automated provisioning is a good thing!

Page 8: Big Data Anti-Patterns: Lessons From the Front LIne

| 8

Locality Locality Locality

- Bring Computation to Data

#1 Locality

8

Co-locate data and compute

Locally Attached Storage

Localize & isolate network traffic

Rack Awareness

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

...disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

diskdisk

diskdisk

diskdisk

diskdisk

diskdisk

diskdisk

diskdisk

diskdisk

diskdisk

diskdisk

diskdisk

diskdisk

diskdisk

diskdisk

diskdisk

diskdisk

diskdisk

diskdisk

diskdisk

diskdisk

diskdisk

diskdisk

diskdisk

diskdisk

diskdisk

diskdisk

diskdisk

diskdisk

diskdisk

diskdisk

CPU

coreCPU

coreCPU

coreCPU

core

CPU

coreCPU

coreCPU

coreCPU

core

CPU

coreCPU

coreCPU

coreCPU

core

CPU

coreCPU

coreCPU

coreCPU

core

CPU

coreCPU

coreCPU

coreCPU

core

CPU

coreCPU

coreCPU

coreCPU

core

CPU

coreCPU

coreCPU

coreCPU

core

CPU

coreCPU

coreCPU

coreCPU

core

CPU

coreCPU

coreCPU

coreCPU

core

CPU

coreCPU

coreCPU

coreCPU

core

CPU

coreCPU

coreCPU

coreCPU

core

CPU

coreCPU

coreCPU

coreCPU

core

CPU

coreCPU

coreCPU

coreCPU

core

CPU

coreCPU

coreCPU

coreCPU

core

CPU

coreCPU

coreCPU

coreCPU

core

VS.

Hadoop Cluster VM Cluster

Page 9: Big Data Anti-Patterns: Lessons From the Front LIne

| 9

Sequential IO >> Random Access

#2 Sequential IO

9

http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html

Image credit: Wikipedia.org

Large block IO

Append only writes

JBOD

Page 10: Big Data Anti-Patterns: Lessons From the Front LIne

| 10

Increase # parallel components

- Reduce component cost

Data block replication

- Availability

- Performance

Commodity++ (2014)

- High density data nodes

- $8-12,000

- ~12 drives

- ~12-16 cores

- Buy 4-5 servers for the cost of 1

• 4-5x spindles

• 4-5x cores

#3 Increase parallelism

10

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

...

Page 11: Big Data Anti-Patterns: Lessons From the Front LIne

| 11

Expect Failure1,2 Rack Awareness

Data Block Replication

Task Retry

Node Black Listing

Monitor Everything

Name Node HA

#4 Failure

11

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

...

Page 12: Big Data Anti-Patterns: Lessons From the Front LIne

| 12

Hadoop Ecosystem Tools

Tooling

12

Page 13: Big Data Anti-Patterns: Lessons From the Front LIne

| 13

“If it came in the box then I should use it”

Example

- Oozie for scheduling

Tooling: Just looking inside the box

13

Best Practice:

• Use your current enterprise scheduler

Page 14: Big Data Anti-Patterns: Lessons From the Front LIne

| 14

Tooling: NoSQL

14

• “Now I have all of my log data in NoSQL, let’s do analytics over it”

Example

- Streaming data into Mongo DB

• Running aggregates

• Running MR jobs

Page 15: Big Data Anti-Patterns: Lessons From the Front LIne

| 15

Best Practice

15

Best Practice:

• Split the stream

• Real-time access in NoSQL

• Batch analytics in Hadoop

Page 16: Big Data Anti-Patterns: Lessons From the Front LIne

| 18

Key Purpose

- Integrate legacy code

- Integrate analytic tools

• Data science libs

Hadoop supports integrating any type of application tooling

- Hadoop Streaming

• Python

• R

• C, C++

• Fortran

• Cobol

• Ruby

Right Framework, Right Need…

18

Page 17: Big Data Anti-Patterns: Lessons From the Front LIne

| 19

Got to love Ruby

- Very Cool (or it was)

- Dynamic Language

- Expressive

- Compact

- Fast Iteration

Got to Hate Ruby

- Slow

- Hard to follow & debug

- Does not play well with threading

Right Use Case – ETL, Wrong Framework

19

“It’s much faster to develop in,

developer time is valuable,

just throw a couple more boxes at it”

Bench tested at 5,000 records /

second

Page 18: Big Data Anti-Patterns: Lessons From the Front LIne

| 20

Right Use Case – ETL, Wrong Framework…

20

Best Practice:

• Write new code in fastest execution framework

• High value legacy code, analytic tools use Hadoop Streaming

DO THE MATH:

Storm Java: ~ 1MM+ events / second / Server

Storm Ruby: 5000 * 12 cores = 60,000 events / second / Server

= 16.67 times more servers

“Test and Learn!”

Page 19: Big Data Anti-Patterns: Lessons From the Front LIne

| 21

#1 ETL Offload

#2 Data Warehousing

Big Data Warehousing

21

Page 20: Big Data Anti-Patterns: Lessons From the Front LIne

| 22

Right Schema

22

order

order line

customer

product

contract

sales_person

3NF - Transactional Source System Schema

order

contractcustomer

product

order

line

sales_person

Dimensional Schema

Hadoop

Data Warehouse

OLTP

order lineordercontractcustomer product sales_person

De-normalized schema

Page 21: Big Data Anti-Patterns: Lessons From the Front LIne

| 23

Workload Hadoop NoSQL MPP, Reporting

DBs, Mainframe

ETL

Business Intelligence

Cross business reporting

Sub-set analytics

Full scan analytics

Decision Support TBs-PBs GB-TBs

Operational Reports

Complex security requirements

Search

Fast Lookup

Right Workload, Right Tool

Page 22: Big Data Anti-Patterns: Lessons From the Front LIne

| 24

Understand strengths & weaknesses of each choice

- Get help if needed

Deploy the right tool for the right workload

Test and Learn

Summary

24

Page 23: Big Data Anti-Patterns: Lessons From the Front LIne

| 25

Thank You

25

Work with the best on a wide variety of cool projects:

[email protected]

@douglas_ma

Douglas Moore

Page 24: Big Data Anti-Patterns: Lessons From the Front LIne

DATA SCIENTISTS

DATA ARCHITECTS

DATA SOLUTIONS

Think Big Start Smart Scale Fast

Work with the

Leading Innovator in Big Data

26