maximizing hadoop performance with hardware compression excels at sifting through huge masses of...

20
Maximizing Hadoop Performance with Hardware Compression Robert Reiner Director of Marketing Data Compression and Security Exar Corporation Santa Clara, CA November 2012 1

Upload: others

Post on 14-Jul-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Maximizing Hadoop Performance with Hardware Compression Excels at Sifting through Huge Masses of Data to Find what is Useful HDFS MapReduce Pig Hive Sqoop HBase. MapReduce Data Flow

Maximizing Hadoop Performance with Hardware Compression

Robert ReinerDirector of Marketing

Data Compression and SecurityExar Corporation

Santa Clara, CANovember 2012 1

Page 2: Maximizing Hadoop Performance with Hardware Compression Excels at Sifting through Huge Masses of Data to Find what is Useful HDFS MapReduce Pig Hive Sqoop HBase. MapReduce Data Flow

What is Big Data?

Santa Clara, CANovember 2012 2

� “Data sets whose size is beyond the ability of typical data base software tools to capture, store, and analyze”� McKinsey Global Institute

� “Data sets so large and complex that it becomes difficult to process using on-hand database management tools. ”� Wikipedia

� “Data that’s an order of magnitude bigger than what you’re accustomed to, Grasshopper”� eBay

Page 3: Maximizing Hadoop Performance with Hardware Compression Excels at Sifting through Huge Masses of Data to Find what is Useful HDFS MapReduce Pig Hive Sqoop HBase. MapReduce Data Flow

Sources of Big Data

Santa Clara, CANovember 2012 3

Type of Data Industry

Web Logs Social Networking, Online Transactions

RFID Retail, Manufacturing, Casinos

Smart Grid Utilities

Sensor Industrial Equipment, Engines

Telemetry Video Games

Telematics Auto Insurance

Text Analysis Multiple

Time and Location Analysis Multiple

Source: Franks, Taming the Big Data Tidal Wave

Page 4: Maximizing Hadoop Performance with Hardware Compression Excels at Sifting through Huge Masses of Data to Find what is Useful HDFS MapReduce Pig Hive Sqoop HBase. MapReduce Data Flow

Big Data Growth

Santa Clara, CANovember 2012 4

Twitter generates >7TB of data per day2

Facebook generates >10TB of data per day2

80% of world’s information is unstructured2

40% projected growth in global data generated/year1

5% growth in global IT spending1

Sources: 1) McKinsey, 2) Zikopoulos et al, Understanding Big Data

Page 5: Maximizing Hadoop Performance with Hardware Compression Excels at Sifting through Huge Masses of Data to Find what is Useful HDFS MapReduce Pig Hive Sqoop HBase. MapReduce Data Flow

Three Vs of Big Data

Santa Clara, CANovember 2012 5

• Terabytes• Records• Transactions• Tables, Files

• Batch• Near Time• Real Time• Streams

3 Vs• Structured• Unstructured• Semistructured• All of the

Above

Volume

Velocity Variety

Source: TDWI

Page 6: Maximizing Hadoop Performance with Hardware Compression Excels at Sifting through Huge Masses of Data to Find what is Useful HDFS MapReduce Pig Hive Sqoop HBase. MapReduce Data Flow

Hadoop Addresses Big Data Challenges

Santa Clara, CANovember 2012 6

Page 7: Maximizing Hadoop Performance with Hardware Compression Excels at Sifting through Huge Masses of Data to Find what is Useful HDFS MapReduce Pig Hive Sqoop HBase. MapReduce Data Flow

Map Reduce is Core of Hadoop

Santa Clara, CANovember 2012 7

� Parallel Programming Framework� Simplifies Data Processing Across Massive Data Sets

� Enables Processing Data in a File System without being Stored into a Data Base

� Ability to Process Unstructured Data� Excels at Sifting through Huge Masses of Data to Find what is Useful

HDFSMapReduce

Pig Hive Sqoop HBase

Page 8: Maximizing Hadoop Performance with Hardware Compression Excels at Sifting through Huge Masses of Data to Find what is Useful HDFS MapReduce Pig Hive Sqoop HBase. MapReduce Data Flow

MapReduce Data Flow

Santa Clara, CANovember 2012 8

Source: White, Hadoop, The Definitive Guide

DataInput 1(HDFS)

Map ReduceData

Output 1(HDFS)

DataInput 2(HDFS)

Map

DataInput 3(HDFS)

Map

ReduceData

Output 2(HDFS)

ReduceData

Output 2(HDFS)

Sort

Sort

Sort

CopyHDFS

Replication

HDFSReplication

HDFSReplication

Merge

Merge

Merge

8

Page 9: Maximizing Hadoop Performance with Hardware Compression Excels at Sifting through Huge Masses of Data to Find what is Useful HDFS MapReduce Pig Hive Sqoop HBase. MapReduce Data Flow

MapReduce I/O Bottlenecks

Santa Clara, CANovember 2012

9

DataInput 1(HDFS)

Map ReduceData

Output 1(HDFS)

DataInput 2(HDFS)

Map

DataInput 3(HDFS)

Map

ReduceData

Output 2(HDFS)

ReduceData

Output 2(HDFS)

Sort

Sort

Sort

CopyHDFS

Replication

HDFSReplication

HDFSReplication

Merge

Merge

Merge

Page 10: Maximizing Hadoop Performance with Hardware Compression Excels at Sifting through Huge Masses of Data to Find what is Useful HDFS MapReduce Pig Hive Sqoop HBase. MapReduce Data Flow

Compression Reduces Networking and Disk I/O – Addresses Bottlenecks

Santa Clara, CANovember 2012

DataInput 1(HDFS)

Map ReduceData

Output 1(HDFS)

DataInput 2(HDFS)

Map

DataInput 3(HDFS)

Map

ReduceData

Output 2(HDFS)

ReduceData

Output 2(HDFS)

Sort

Sort

Sort

CopyHDFS

Replication

HDFSReplication

HDFSReplication

Merge

Merge

Merge

Page 11: Maximizing Hadoop Performance with Hardware Compression Excels at Sifting through Huge Masses of Data to Find what is Useful HDFS MapReduce Pig Hive Sqoop HBase. MapReduce Data Flow

Hadoop Software Compression Codecs

Santa Clara, CANovember 2012 11

CodecCompressionPerformance

Compression Ratio

CPUOverhead

Deflate/ Gzip Low High High

Bzip2 Very Low Very High Very High

LZO Medium Medium Medium

Source: White, Hadoop, The Definitive Guide

Page 12: Maximizing Hadoop Performance with Hardware Compression Excels at Sifting through Huge Masses of Data to Find what is Useful HDFS MapReduce Pig Hive Sqoop HBase. MapReduce Data Flow

Hardware Accelerated Compression

Santa Clara, CANovember 2012 12

HDFSMapReduce

Pig Hive Scoop HBase

Codec

� Processor Intensive Compression Algorithms Executed in Hardware� Increases Performance� Reduces MapReduce Execution Time� Offloads CPU� Lowers Energy Consumption

Page 13: Maximizing Hadoop Performance with Hardware Compression Excels at Sifting through Huge Masses of Data to Find what is Useful HDFS MapReduce Pig Hive Sqoop HBase. MapReduce Data Flow

Hardware Accelerated Compression -Exar Solutions

Santa Clara, CANovember 2012 13

HDFSMapReduce

Pig Hive Scoop HBase

Codec

� DX1800 and DX1700 Series PCIe Cards� Java codec calls C libraries and

invokes card SDK

Page 14: Maximizing Hadoop Performance with Hardware Compression Excels at Sifting through Huge Masses of Data to Find what is Useful HDFS MapReduce Pig Hive Sqoop HBase. MapReduce Data Flow

Hardware and Software Compression for Hadoop

Santa Clara, CANovember 2012 14

CodecCompression Performance

Compression Ratio

CPUOverhead

HW eLZS Very High Medium Low

HW Deflate/Gzip

Very High High Low

Deflate/ Gzip Low High High

Bzip2 Very Low Very High Very High

LZO Medium Medium Medium

Page 15: Maximizing Hadoop Performance with Hardware Compression Excels at Sifting through Huge Masses of Data to Find what is Useful HDFS MapReduce Pig Hive Sqoop HBase. MapReduce Data Flow

Hadoop Codec Benchmarking

Santa Clara, CANovember 2012 15

� Benchmarked Multiple Codecs using Terasort

� Terasort Input Size: 100GB

� Three Node Hadoop Cluster� 1GbE Switch Interconnect

� Hadoop version 1.0.0

� Node Configuration� Dual E5620/ node (8 cores, 16 threads)� 16 GB DRAM� 4 x SATA-300 (108 MB/sec, 500 GB)� RHEL 5.4

Page 16: Maximizing Hadoop Performance with Hardware Compression Excels at Sifting through Huge Masses of Data to Find what is Useful HDFS MapReduce Pig Hive Sqoop HBase. MapReduce Data Flow

Benchmarking Results –SW LZO and HW eLZS Codecs

Santa Clara, CANovember 2012 16

SW LZO• Total Time: 24 min, 8 sec• Compression Ratio: 5.126• .2478 kWh

HW eLZS**• Total Time: 19 min, 8 sec• Compression Ratio: 5.136• .2061 kWh

19 min, 1 sec24 min, 10 sec

**HW eLZS benchmarked with DX 1845 Compression andSecurity Acceleration Card

Page 17: Maximizing Hadoop Performance with Hardware Compression Excels at Sifting through Huge Masses of Data to Find what is Useful HDFS MapReduce Pig Hive Sqoop HBase. MapReduce Data Flow

Benchmarking Results –SW Gzip and HW Gzip Codecs

Santa Clara, CANovember 2012 17

15 min, 56 sec44 min, 11 sec

SW Gzip• Total Time: 44 min, 11 sec• Compression Ratio: 7.645

HW Gzip**• Total Time: 15 min, 56 sec• Compression Ratio: 6.329

**HW Gzip benchmarked with next generation Compressionand Security Acceleration product prototype

Page 18: Maximizing Hadoop Performance with Hardware Compression Excels at Sifting through Huge Masses of Data to Find what is Useful HDFS MapReduce Pig Hive Sqoop HBase. MapReduce Data Flow

Terasort Benchmarking Summary

Santa Clara, CANovember 2012 18

� HW eLZS is 21% faster than SW LZO

� HW Gzip is 64% faster than SW Gzip

� HW Gzip provides the fastest Terasort time of all codecs benchmarked

� HW Accelerated Compression is more energy efficient than SW compression

Page 19: Maximizing Hadoop Performance with Hardware Compression Excels at Sifting through Huge Masses of Data to Find what is Useful HDFS MapReduce Pig Hive Sqoop HBase. MapReduce Data Flow

Future of Hardware Acceleration for Hadoop

Santa Clara, CANovember 2012 19

• Security• Add encryption to Hardware-

Accelerated Codec• Exar’s technology enables single

pass compression and encryption

• Enhanced Benchmarks• Expand beyond Terasort to

include benchmarks that represent additional production workloads

Compression

Encryption

Hashing

Page 20: Maximizing Hadoop Performance with Hardware Compression Excels at Sifting through Huge Masses of Data to Find what is Useful HDFS MapReduce Pig Hive Sqoop HBase. MapReduce Data Flow

Thank YouRobert Reiner

Director of MarketingData Compression and Security

Exar [email protected]

Santa Clara, CANovember 2012 20