big data anti-patterns: lessons from the front line
DESCRIPTION
October 2014 Strata NYC presentationsTRANSCRIPT
Big Data Anti-Patterns:
Lessons from the Front Lines
Strata NYC
October 17, 2014
Douglas Moore
| 2
Think Big – 3 Years
- Delivery
• BDW, Search, Streaming
- Roadmaps
- Tech Assessments
About Douglas Moore
2
Before Big Data
- Data Warehousing
- OLTP
- Systems Architecture
- Electricity
- High End Graphics
- Supercomputers
- Numerical Analysis
@douglas_ma
Contact me at:
| 3
Think Big
3
4yr Old “Big Data” Professional Services Firm
- Roadmaps
- Engineering
- Data Science
- Hands on Training
Recently acquired by Teradata
• Maintaining Independence
| 4
Content Drawn From Vast Amounts of Experience
4
…
50+ Clients
Leading
security
software
vendor
Leading
Discount
Retailer
| 5
I started out with just 3 topics…
Then while on the road to Strata,
I met 7 big data architects
- Who had 7 clients
• Who had 7 projects
• That demonstrated 7 Anti-Patterns
Introduction
5
Big Data Anti-pattern:
“Commonly applied but bad solution”
I95 Wikipedia
| 6
• Hardware and Infrastructure
• Tooling
• Big Data Warehousing
Three Focus Areas
6
| 7
Reference Architecture Driven
- 90’s & 00’s data center patterns
- Servers MUST NOT FAIL
- Standard Server Config
• $35,000/node
• Dual Power supply
• RAID
• SAS 15K RPM
• SAN
• VMs for Production
• Flat Network
Hardware & Infrastructure
7
[Image source: HP: The transformation
to HP Converged Infrastructure]Automated provisioning is a good thing!
| 8
Locality Locality Locality
- Bring Computation to Data
#1 Locality
8
Co-locate data and compute
Locally Attached Storage
Localize & isolate network traffic
Rack Awareness
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
...disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
diskdisk
diskdisk
diskdisk
diskdisk
diskdisk
diskdisk
diskdisk
diskdisk
diskdisk
diskdisk
diskdisk
diskdisk
diskdisk
diskdisk
diskdisk
diskdisk
diskdisk
diskdisk
diskdisk
diskdisk
diskdisk
diskdisk
diskdisk
diskdisk
diskdisk
diskdisk
diskdisk
diskdisk
diskdisk
diskdisk
CPU
coreCPU
coreCPU
coreCPU
core
CPU
coreCPU
coreCPU
coreCPU
core
CPU
coreCPU
coreCPU
coreCPU
core
CPU
coreCPU
coreCPU
coreCPU
core
CPU
coreCPU
coreCPU
coreCPU
core
CPU
coreCPU
coreCPU
coreCPU
core
CPU
coreCPU
coreCPU
coreCPU
core
CPU
coreCPU
coreCPU
coreCPU
core
CPU
coreCPU
coreCPU
coreCPU
core
CPU
coreCPU
coreCPU
coreCPU
core
CPU
coreCPU
coreCPU
coreCPU
core
CPU
coreCPU
coreCPU
coreCPU
core
CPU
coreCPU
coreCPU
coreCPU
core
CPU
coreCPU
coreCPU
coreCPU
core
CPU
coreCPU
coreCPU
coreCPU
core
VS.
Hadoop Cluster VM Cluster
| 9
Sequential IO >> Random Access
#2 Sequential IO
9
http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html
Image credit: Wikipedia.org
Large block IO
Append only writes
JBOD
| 10
Increase # parallel components
- Reduce component cost
Data block replication
- Availability
- Performance
Commodity++ (2014)
- High density data nodes
- $8-12,000
- ~12 drives
- ~12-16 cores
- Buy 4-5 servers for the cost of 1
• 4-5x spindles
• 4-5x cores
#3 Increase parallelism
10
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
...
| 11
Expect Failure1,2 Rack Awareness
Data Block Replication
Task Retry
Node Black Listing
Monitor Everything
Name Node HA
#4 Failure
11
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
...
| 12
Hadoop Ecosystem Tools
Tooling
12
| 13
“If it came in the box then I should use it”
Example
- Oozie for scheduling
Tooling: Just looking inside the box
13
Best Practice:
• Use your current enterprise scheduler
| 14
Tooling: NoSQL
14
• “Now I have all of my log data in NoSQL, let’s do analytics over it”
Example
- Streaming data into Mongo DB
• Running aggregates
• Running MR jobs
| 15
Best Practice
15
Best Practice:
• Split the stream
• Real-time access in NoSQL
• Batch analytics in Hadoop
| 18
Key Purpose
- Integrate legacy code
- Integrate analytic tools
• Data science libs
Hadoop supports integrating any type of application tooling
- Hadoop Streaming
• Python
• R
• C, C++
• Fortran
• Cobol
• Ruby
Right Framework, Right Need…
18
| 19
Got to love Ruby
- Very Cool (or it was)
- Dynamic Language
- Expressive
- Compact
- Fast Iteration
Got to Hate Ruby
- Slow
- Hard to follow & debug
- Does not play well with threading
Right Use Case – ETL, Wrong Framework
19
“It’s much faster to develop in,
developer time is valuable,
just throw a couple more boxes at it”
Bench tested at 5,000 records /
second
| 20
Right Use Case – ETL, Wrong Framework…
20
Best Practice:
• Write new code in fastest execution framework
• High value legacy code, analytic tools use Hadoop Streaming
DO THE MATH:
Storm Java: ~ 1MM+ events / second / Server
Storm Ruby: 5000 * 12 cores = 60,000 events / second / Server
= 16.67 times more servers
“Test and Learn!”
| 21
#1 ETL Offload
#2 Data Warehousing
Big Data Warehousing
21
| 22
Right Schema
22
order
order line
customer
product
contract
sales_person
3NF - Transactional Source System Schema
order
contractcustomer
product
order
line
sales_person
Dimensional Schema
Hadoop
Data Warehouse
OLTP
order lineordercontractcustomer product sales_person
De-normalized schema
| 23
Workload Hadoop NoSQL MPP, Reporting
DBs, Mainframe
ETL
Business Intelligence
Cross business reporting
Sub-set analytics
Full scan analytics
Decision Support TBs-PBs GB-TBs
Operational Reports
Complex security requirements
Search
Fast Lookup
Right Workload, Right Tool
| 24
Understand strengths & weaknesses of each choice
- Get help if needed
Deploy the right tool for the right workload
Test and Learn
Summary
24
| 25
Thank You
25
Work with the best on a wide variety of cool projects:
@douglas_ma
Douglas Moore
DATA SCIENTISTS
DATA ARCHITECTS
DATA SOLUTIONS
Think Big Start Smart Scale Fast
Work with the
Leading Innovator in Big Data
26