gyorgy balogh modern_big_data_technologies_sec_world_2014

30
BIG DATA MODERN TECHNOLOGIES György Balogh LogDrill Ltd. SECWorld – 7 May 2014

Upload: logdrill

Post on 28-Nov-2014

258 views

Category:

Technology


0 download

DESCRIPTION

György Balogh has held a presentation at the SECWorld 2014 conference about the cutting-edge yet also affordable Big Data technologies.

TRANSCRIPT

Page 1: Gyorgy balogh modern_big_data_technologies_sec_world_2014

BIG DATAMODERN TECHNOLOGIES

György BaloghLogDrill Ltd.

SECWorld – 7 May 2014

Page 2: Gyorgy balogh modern_big_data_technologies_sec_world_2014

AGENDA

• What is Big Data?• Why do we have to talk about it?• Paradigm shift in

informationmanagement• Technology and efficiency

Page 3: Gyorgy balogh modern_big_data_technologies_sec_world_2014

WHAT IS BIG DATA?

• Data volume cannot be handled traditional solutions (eg.: relational database)

• More than 100 million data rows, typically multi billion

Page 4: Gyorgy balogh modern_big_data_technologies_sec_world_2014
Page 5: Gyorgy balogh modern_big_data_technologies_sec_world_2014

GLOBAL RATE OF DATA PRODUCTION (PER SECOND)

• 30 TB/sec (22000 films)• Digital media• 2 hours of YouTube video

• Communication• 3000 business emails• 300000 SMS

• Web• Half million page views

• Logs• Billions

Page 6: Gyorgy balogh modern_big_data_technologies_sec_world_2014

BIG DATA MARKET

Page 7: Gyorgy balogh modern_big_data_technologies_sec_world_2014

HYPE OR REALITY?

Page 8: Gyorgy balogh modern_big_data_technologies_sec_world_2014

WHY NOW?

● Long term trends○ Size of stored data doubles every 40 months

since 1980s○ Moore’s law: number of transistors on

integrated circuits doubles every 18 months

Page 9: Gyorgy balogh modern_big_data_technologies_sec_world_2014

DIFFERENT EXPONENTIAL TRENDS

Page 10: Gyorgy balogh modern_big_data_technologies_sec_world_2014

HARD DRIVES IN 1991 AND 2012

● 1991● 40 MB● 3500 RPM● 0.7 MB/sec● full scan: 1

minutes

● 2012● 4 TB ( x 100000)● 7200 RPM● 120 MB/sec ( x 170)● full scan: 8 hours (x

480)

Page 11: Gyorgy balogh modern_big_data_technologies_sec_world_2014

DATA ACCESS BECOMES THE SCARCE RESOURCE!

Page 12: Gyorgy balogh modern_big_data_technologies_sec_world_2014

GOOGLE’S HARDWARE IN 1998

Page 13: Gyorgy balogh modern_big_data_technologies_sec_world_2014

GOOGLE’S HARDWARE IN 2013

• 12 data centers worldwide• More than a million nodes• A data center costs $600 million to build• Oregon data center• 15000 m2• power of 30 000 homes

Page 14: Gyorgy balogh modern_big_data_technologies_sec_world_2014

GOOGLE’S HARDWARE IN 2013

• Cheap commodity hardware • each has its own battery!

• Modular data centers• Standard container• 1160 servers per container

• Efficiency: 11% overhead (power transformation, cooling)

Page 15: Gyorgy balogh modern_big_data_technologies_sec_world_2014

THE BIG DATA PARADIGM SHIFT

Page 16: Gyorgy balogh modern_big_data_technologies_sec_world_2014

TECHNOLOGIES

• Hadoop 2.0• Google BigQuery• Cloudera Impala• Apache Spark

Page 17: Gyorgy balogh modern_big_data_technologies_sec_world_2014

HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Page 18: Gyorgy balogh modern_big_data_technologies_sec_world_2014

HADOOP MAP REDUCE

Page 19: Gyorgy balogh modern_big_data_technologies_sec_world_2014

HADOOP

• Who uses Hadoop?• Facebook: 100 PB• Yahoo: 4000 nodes• More than half of Fortune 50 companies!

• History• Replica of Google architecture (GFS, BigTable)

in Java under Apache licence

• Hadoop 2.0• Full High Availability• Advanced resource managements (YARN)

Page 20: Gyorgy balogh modern_big_data_technologies_sec_world_2014

GOOGLE BIG QUERY

• SQL queries on terabytes of data in seconds

• Data is distributed over thousands of nodes

• Each node processes one part of the dataset

• Thousands of nodes work for us for a few milliseconds

select year, SUM(mother_age * record_weight) / SUM(record_weight) as age from publicdata:samples.natality where ever_born = 1 group by year order by year;

Page 21: Gyorgy balogh modern_big_data_technologies_sec_world_2014

GOOGLE BIG QUERY

• SQL queries on terabytes of data in seconds

• Data is distributed over thousands of nodes

• Each node processes one part of the dataset

• Thousands of nodes work for us for a few milliseconds

Page 22: Gyorgy balogh modern_big_data_technologies_sec_world_2014

CLOUDERA IMPALA

• Same as BigQuery on top of Hadoop• Standard SQL on Big Data.• On a 10 million Ft cluster terabytes of

data can be analyzed interactively• Scales to thousands of nodes• Technology sugars• Run-time code generation with LLVM• Parquet format (column oriented)

Page 23: Gyorgy balogh modern_big_data_technologies_sec_world_2014

APACHE SPARK

• Berkeley University • Achieves 100 times speed up compared

to Hadoop on certain tasks• In cluster memory computation

Page 24: Gyorgy balogh modern_big_data_technologies_sec_world_2014

INEFFICIENCY CAN WASTE HUGE AMOUNT OF RESOURCES

• 300 node cluster • Hadoop• Hive

= • 300 node cluster • One node• Vectorwise• Vectorwise holds world

speed record in analytical database queries on a single node

Page 25: Gyorgy balogh modern_big_data_technologies_sec_world_2014

CLEVER WAYS TO IMPROVE EFFICIENCY

• Lossless data compression (even 50x!)• Clever lossy compression of data (e.g.:

olap cubes)• Cache aware implementations

(asymmetric trends, memory access bottleneck)

Page 26: Gyorgy balogh modern_big_data_technologies_sec_world_2014

LOSSLESS DATA COMPRESSION

• compression can boost sequential data access even 50 times! (100 MB/sec -> 5 GB/sec)• Less data -> less I/O operation• One CPU can decompress data even at 5

GB/sec

• gzip decompression is very slow• snappy, lzo, lz4 can reach 1 GB/sec

decompression speed• decompression used by column oriented

databases can reach 5 GB/sec (PFOR)• two billion integers per second! (almost one

integer per clock cycle!!!)

Page 27: Gyorgy balogh modern_big_data_technologies_sec_world_2014

EXAMPLE: LOGDRILL

2011-01-08 00:00:01 X1 Y1 1.2.3.4 GET /a/b/c - 1.2.3.4 HTTP/1.1 Mozilla 200 0 0 22957 562

2011-01-08 00:00:09 X1 Y1 1.2.3.4 GET /a/b/c - 1.2.3.4 HTTP/1.1 Mozilla 200 0 0 2957 321

2011-01-08 00:01:04 X1 Y1 1.2.3.4 GET /a/b/c - 1.2.3.4 HTTP/1.1 Mozilla 200 0 0 43422 522

2011-01-08 00:01:08 X1 Y1 1.2.3.4 GET /a/b/c - 1.2.3.4 HTTP/1.1 Mozilla 200 0 0 234 425

2011-01-08 00:02:23 X1 Y1 1.2.3.4 GET /a/b/c - 1.2.3.4 HTTP/1.1 Mozilla 404 0 0 234 432

2011-01-08 00:02:45 X1 Y1 1.2.3.4 POST /a/b/c - 1.2.3.4 HTTP/1.1 Mozilla 200 0 0 4353 134

2011-01-08 00:00 GET 200 2

2011-01-08 00:01 GET 200 2

2011-01-08 00:02 GET 404 1

2011-01-08 00:02 POST 200 1

Page 28: Gyorgy balogh modern_big_data_technologies_sec_world_2014

CAHE AWARE PROGRAMMING

• CPU speed increasing about 60% a year• Memory speed increasing only 10% a

year• The increasing gap is covered with multi

level cache memories• Cache is under-exploited

100x speed up!!!

Page 29: Gyorgy balogh modern_big_data_technologies_sec_world_2014

LESSONS LEARNED

• Big Data is not a hype at least from the technological viewpoint

• Modern technologies (Impala, Spark) can reach theoretical limits of the cluster hardware configuration

• Deep understanding of both the problem and the technologies are required to create efficient Big Data solutions

Page 30: Gyorgy balogh modern_big_data_technologies_sec_world_2014

THANK YOU!Q&A?