gyorgy balogh modern_big_data_technologies_sec_world_2014
DESCRIPTION
György Balogh has held a presentation at the SECWorld 2014 conference about the cutting-edge yet also affordable Big Data technologies.TRANSCRIPT
BIG DATAMODERN TECHNOLOGIES
György BaloghLogDrill Ltd.
SECWorld – 7 May 2014
AGENDA
• What is Big Data?• Why do we have to talk about it?• Paradigm shift in
informationmanagement• Technology and efficiency
WHAT IS BIG DATA?
• Data volume cannot be handled traditional solutions (eg.: relational database)
• More than 100 million data rows, typically multi billion
GLOBAL RATE OF DATA PRODUCTION (PER SECOND)
• 30 TB/sec (22000 films)• Digital media• 2 hours of YouTube video
• Communication• 3000 business emails• 300000 SMS
• Web• Half million page views
• Logs• Billions
BIG DATA MARKET
HYPE OR REALITY?
WHY NOW?
● Long term trends○ Size of stored data doubles every 40 months
since 1980s○ Moore’s law: number of transistors on
integrated circuits doubles every 18 months
DIFFERENT EXPONENTIAL TRENDS
HARD DRIVES IN 1991 AND 2012
● 1991● 40 MB● 3500 RPM● 0.7 MB/sec● full scan: 1
minutes
● 2012● 4 TB ( x 100000)● 7200 RPM● 120 MB/sec ( x 170)● full scan: 8 hours (x
480)
DATA ACCESS BECOMES THE SCARCE RESOURCE!
GOOGLE’S HARDWARE IN 1998
GOOGLE’S HARDWARE IN 2013
• 12 data centers worldwide• More than a million nodes• A data center costs $600 million to build• Oregon data center• 15000 m2• power of 30 000 homes
GOOGLE’S HARDWARE IN 2013
• Cheap commodity hardware • each has its own battery!
• Modular data centers• Standard container• 1160 servers per container
• Efficiency: 11% overhead (power transformation, cooling)
THE BIG DATA PARADIGM SHIFT
TECHNOLOGIES
• Hadoop 2.0• Google BigQuery• Cloudera Impala• Apache Spark
HADOOP DISTRIBUTED FILE SYSTEM (HDFS)
HADOOP MAP REDUCE
HADOOP
• Who uses Hadoop?• Facebook: 100 PB• Yahoo: 4000 nodes• More than half of Fortune 50 companies!
• History• Replica of Google architecture (GFS, BigTable)
in Java under Apache licence
• Hadoop 2.0• Full High Availability• Advanced resource managements (YARN)
GOOGLE BIG QUERY
• SQL queries on terabytes of data in seconds
• Data is distributed over thousands of nodes
• Each node processes one part of the dataset
• Thousands of nodes work for us for a few milliseconds
select year, SUM(mother_age * record_weight) / SUM(record_weight) as age from publicdata:samples.natality where ever_born = 1 group by year order by year;
GOOGLE BIG QUERY
• SQL queries on terabytes of data in seconds
• Data is distributed over thousands of nodes
• Each node processes one part of the dataset
• Thousands of nodes work for us for a few milliseconds
CLOUDERA IMPALA
• Same as BigQuery on top of Hadoop• Standard SQL on Big Data.• On a 10 million Ft cluster terabytes of
data can be analyzed interactively• Scales to thousands of nodes• Technology sugars• Run-time code generation with LLVM• Parquet format (column oriented)
APACHE SPARK
• Berkeley University • Achieves 100 times speed up compared
to Hadoop on certain tasks• In cluster memory computation
INEFFICIENCY CAN WASTE HUGE AMOUNT OF RESOURCES
• 300 node cluster • Hadoop• Hive
= • 300 node cluster • One node• Vectorwise• Vectorwise holds world
speed record in analytical database queries on a single node
CLEVER WAYS TO IMPROVE EFFICIENCY
• Lossless data compression (even 50x!)• Clever lossy compression of data (e.g.:
olap cubes)• Cache aware implementations
(asymmetric trends, memory access bottleneck)
LOSSLESS DATA COMPRESSION
• compression can boost sequential data access even 50 times! (100 MB/sec -> 5 GB/sec)• Less data -> less I/O operation• One CPU can decompress data even at 5
GB/sec
• gzip decompression is very slow• snappy, lzo, lz4 can reach 1 GB/sec
decompression speed• decompression used by column oriented
databases can reach 5 GB/sec (PFOR)• two billion integers per second! (almost one
integer per clock cycle!!!)
EXAMPLE: LOGDRILL
2011-01-08 00:00:01 X1 Y1 1.2.3.4 GET /a/b/c - 1.2.3.4 HTTP/1.1 Mozilla 200 0 0 22957 562
2011-01-08 00:00:09 X1 Y1 1.2.3.4 GET /a/b/c - 1.2.3.4 HTTP/1.1 Mozilla 200 0 0 2957 321
2011-01-08 00:01:04 X1 Y1 1.2.3.4 GET /a/b/c - 1.2.3.4 HTTP/1.1 Mozilla 200 0 0 43422 522
2011-01-08 00:01:08 X1 Y1 1.2.3.4 GET /a/b/c - 1.2.3.4 HTTP/1.1 Mozilla 200 0 0 234 425
2011-01-08 00:02:23 X1 Y1 1.2.3.4 GET /a/b/c - 1.2.3.4 HTTP/1.1 Mozilla 404 0 0 234 432
2011-01-08 00:02:45 X1 Y1 1.2.3.4 POST /a/b/c - 1.2.3.4 HTTP/1.1 Mozilla 200 0 0 4353 134
2011-01-08 00:00 GET 200 2
2011-01-08 00:01 GET 200 2
2011-01-08 00:02 GET 404 1
2011-01-08 00:02 POST 200 1
CAHE AWARE PROGRAMMING
• CPU speed increasing about 60% a year• Memory speed increasing only 10% a
year• The increasing gap is covered with multi
level cache memories• Cache is under-exploited
100x speed up!!!
LESSONS LEARNED
• Big Data is not a hype at least from the technological viewpoint
• Modern technologies (Impala, Spark) can reach theoretical limits of the cluster hardware configuration
• Deep understanding of both the problem and the technologies are required to create efficient Big Data solutions
THANK YOU!Q&A?