performance management in ‘big data’ applications

Performance Management in ‘Big Data’ ApplicationsIt’s still about the Application

Michael Kopp, Technology Strategist

michael.kopp@compuware.com

@mikopp

blog.dynatrace.com

Edward Capriolo

edward@m6d.com

@edwardcapriolo

m6d.com/blog

High Volume/Low Latency DBs

JavaWeb

BigData

Key Benefits1) Fast Read/Write2) Horizontal Scalability3) Redundancy and High

Availability

Key Challenges1) Even Distribution2) Correct Schema and Access

patterns3) Understanding Application Impact

Hive high-levelmap/reducequery JOB

BigData

1234...

754Key Benefits1) Massive Horizontal Batch Job2) Split big Problems into smaller

Hive Server

batchtrigger

Large Parallel Batch Processing

Key Challenges1) Optimal Distribution 2) Unwieldy Configuration3) Can easily waste your

resources

What is m6d?

Impressions look like…

Map Reduce Performance

Typical MapReduce Job at m6d

Hadoop at m6d

• Critical piece of infrastructure

• Long Term Data Storage

– Raw logs

– Aggregations

– Reports

– Generated data (feed back loops)

• Numerous ETL (Extract Transform Load)

• Scheduled and adhoc processes

• Used directly by Tech-Team, Ad Ops, Data Science

Hadoop at m6d

• Two deployments 'production' and 'research'– ~ 500 TB - 40+ Nodes

– ~ 350 TB – 20+ Nodes

• Thousands of jobs – <5 minute jobs and 12 hour Job Flows

– Mostly Hive Jobs

– Some custom code and streaming jobs

Hadoop Design Tenants

• Linear scalability by adding more hardware

• HDFS Distributed file system

– User space file system

– Blocks are replicated across nodes

– Limited semantics

• MapReduce

– Paradigm which models using map/reduce

– Data Locality

– Split Job into Tasks by Data

– Retry in failure

Schema Design Challenges

• Partition data for good distribution

– By time interval (optionally a second level)

• Partition pruning with WHERE

– Clustering (aka bucketing)

• Optimized sampling and joins

– Columnar

• Column oriented • Raw Data Growth

• Data features change (more distinct X)

Key Performance Challenges

• Intermediate I/O

– Compression codec

– Block size

– Split-table formats

• Contentions between jobs

• Data and Map/Reduce Distribution• Data Skew

• Non Uniform Computation (long running tasks)

• ‘Cost' of new feature – is this justified?

• Tuning variables (spills, buffers, Etc, etc)

How to handle Performance Issues?

• Profile the Job / Query?– Who should do this?

(DBA, Dev, Ops, DevOps , NoOps, Big Data Guru)

– How should we do this?• Look at job run times day over day?

• Look at code and micro-benchmark?

• Collect Job Counters?

• Upgrade often for latest performance features?

• Investigate/purchase newer better hardware– More cores? RAM? 10G Ethernet? SSD

• Read blogs?Test Data is not like

Real Data

But how to optimize the job itself?

Understanding Map/Reduce Performance

Maximum Parallelism

Actual Mapping Parallelism

Also your own Code

Attention Data Volume!

Attention Potential Choke

Point!

Maximum Reduce

Parallelism

Actual Reduce Parallelism

Also your own Code

Millions of Executions!!!

Understanding Map/Reduce Performance

Map/Reduce Performance

Map/Reduce behind the scenesSerialize

De-Serialize and Serialize

Potentionally Inefficient

Too Many Files, Same Key

spread all over

Expensive Synchronous

Combine

De-Serialize and Serialize

Map/Reduce Combine and Spill Performance

1) Pre Combine in Mapping Step2) Avoid many intermediate files and combines

Map/Reduce “Map” Performance

Focus on Big HotspotsAvoid Brute ForceSave a lot of HardwareThen Optimize Hadoop

Map/Reduce to the Max!

• Ensure Data Locality

• Optimize Map/Reduce Hotspots

• Reduce Intermediate Data and “Overhead”

• Ensure optimal Data and Compute Distribution

• Tune Hadoop Environment

Cassandra and

Application Performance

1. Browsers visit Publishers and create impressions.2. Publishers sell impressions via Exchanges.3. Exchanges serve as auction houses for the impressions4. On behalf of the marketer, m6d bids the impressions via

the auction house. If m6d wins, we display our ad to the browser.

A High Level look at RTB

Cassandra at m6d for Real Time Bidding

• RTB limited data is provided from exchange

• System to store information on users

– Frequency Capping

– Visit History

– Segments (product service affinity)

• Low latency Requirements

– Less then 100ms

– Requires fast read/write on discrete data

Cassandra design

Key Cassandra Design Tennents

• Swap/paging not possible

• Mostly schema-less

• Writes do not read– Read/Write is an anti-pattern

• Optimize around put and get– Not for scan and query

• De-Normalize data– Attempt to get all data in single read*

Cassandra Design Challenges

• De-normailize

– Store data to optimize reads

– Composite (multi-column) keys

• Multi-column family and Multi-tenant scenarios

• Compress settings

– Disk and cache savings

– CPU and JVM costs

• Data/Compaction settings

– Size tiered vs LevelDB

• Caching, Memtable and other tuning

How to handle performance issues?

• Monitor standard vitals (cpu,disk) ?

• Read blogs and documentation?

• Use Cassandra JMX to track req/sec

• Use Cassandra JMX to track size of Column Families, rows and columns

• Upgrade often to get latest performance enhancements? *

What about the Application?

APM for Cassandra

NoSQL APM is not so different after all…

JavaWeb

Key APM Problems Identified1) Response Time Contribution2) data access patterns3) transaction to query

relationship (transaction flow)

Database

Response Time Contribution

Access PatternAccess PatternAccess Pattern

Contribution to Business Transaction Connection Pool

Statement Analysis

Contribution to Business Transaction

Executions per Transactions and

Average and Total Execution Time

Where, Why, How and which Transaction…

Where and why in my Transaction

Single Statement Performance

Which Web Service

Which Business Transaction

How does this apply to NoSQL Databases?

Key APM Problems Identified1) Response Time Contribution2) data access patterns3) transaction to query

relationship (transaction flow)

1) Data Access Distribution2) End-to-End Monitoring3) Storage (I/O, GC) Bottlenecks4) Consistency Level

JavaWeb

Real End-to-End Application Performance

Third Party

Services

External

End User

Our Application

End User Response Time Contribution

Understanding Cassandra’s Contribution

Which statements did the Transaction Execute?Which node where they executed against?Which Consistency Level was used?

Contribution of each StatmentToo many calls? Data Access patterns

Understand Response Time Contribution

4 Calls~15 ms Contribution

5 Calls~50-80 ms Contribution?

Access and Data Distribution

Why and how was a statement executed?

45ms latency? 60ms waiting on the server?

Any Hotspots on the Cassandra Nodes?

Much more load on Node3?Which Transactions are

responsible

Specific Cassandra Health Metrics

General Health of Cassandra

Too much GC Suspensions?

Memory Issues?

Too many requests?

Conclusion

Extend Performance Focus on Application

JavaWeb

A Fast Database doesn’t make a fast Application

Hive high-levelmap/reducequery JOB

1234...

master node

Hive Server

batchtrigger

data/task node

Intelligent MapReduce APM

data/task node

Simple Optimizations with big impact

Big Data is about solving Application Problems

APM is about Application Performance and Efficiency

THANK YOU

Michael Kopp, Technology Strategist

michael.kopp@compuware.com

@mikopp

blog.dynatrace.com

Edward Capriolo

edward@m6d.com

@edwardcapriolo

m6d.com/blog

performance management in ‘big data’ applications

data science9

generated data feed

typical mapreduce job

job queryjob

nodes thousands of jobs

fast readwrite2 correct

split big problems

loops numerous etl extract

Technology

big data in clouds - uva · big data in clouds cloud based...

here traffic innovations in big data applications and...

big table alon pluda. introduction data model api building...

performance and energy efficiency of big data applications

real time big data applications: file · web viewunit i....

the potential of big data applications for the healthcare...

jvm conﬁguration management and its performance impact for...

big east bball performance

applications of big data & hadoop

contributions to high-performance big data computing ·...

big data and its applications

measuring performance quality scenarios in big data ......

layered performance modelling and evaluation for cloud...

affordable price big performance!

applications performance management for enterprise...

contributions to high-performance big data...

applications of big data

high-performance modelling and simulation for …...why...

cedcom high performance architecture for big data...

20170518 chipset high performance modelling and simulation...