making big data, small
TRANSCRIPT
![Page 1: Making Big Data, small](https://reader034.vdocuments.site/reader034/viewer/2022052619/555097c3b4c90595208b46d6/html5/thumbnails/1.jpg)
MAKING BIG DATA, SMALL
Using distributed systems for processing, analysing and managing
large huge data sets
Marcin Jedyk
Software Professional’s Network, Cheshire Datasystems Ltd
![Page 2: Making Big Data, small](https://reader034.vdocuments.site/reader034/viewer/2022052619/555097c3b4c90595208b46d6/html5/thumbnails/2.jpg)
WARM-UP QUESTIONS
How many of you heard about Big Data before?
How many about NoSQL?
Hadoop?
![Page 3: Making Big Data, small](https://reader034.vdocuments.site/reader034/viewer/2022052619/555097c3b4c90595208b46d6/html5/thumbnails/3.jpg)
AGENDA.
Intro – motivation, goal and ‘not about…’
What is Big Data?
NoSQL and systems classification
Hadoop & HDFS
MapReduce & live demo
HBase
![Page 4: Making Big Data, small](https://reader034.vdocuments.site/reader034/viewer/2022052619/555097c3b4c90595208b46d6/html5/thumbnails/4.jpg)
AGENDA
Pig
Building Hadoop cluster
Conclusions
Q&A
![Page 5: Making Big Data, small](https://reader034.vdocuments.site/reader034/viewer/2022052619/555097c3b4c90595208b46d6/html5/thumbnails/5.jpg)
MOTIVATION
Data is everywhere – why not to analyse it?
With Hadoop and NoSQL systems, building
distributed systems is easier than before
Relying on software & cheap hardware rather
than expensive hardware works better!
![Page 6: Making Big Data, small](https://reader034.vdocuments.site/reader034/viewer/2022052619/555097c3b4c90595208b46d6/html5/thumbnails/6.jpg)
MOTIVATION
![Page 7: Making Big Data, small](https://reader034.vdocuments.site/reader034/viewer/2022052619/555097c3b4c90595208b46d6/html5/thumbnails/7.jpg)
GOAL
To explain basic ideas behind Big Data
To present different approaches towards BD
To show that Big Data systems are easy to build
To show you where to start with such systems
![Page 8: Making Big Data, small](https://reader034.vdocuments.site/reader034/viewer/2022052619/555097c3b4c90595208b46d6/html5/thumbnails/8.jpg)
WHAT IT IS NOT ABOUT?
Not a detailed lecture on a single system
Not about advanced techniques in Big Data
Not only about technology – but also about its
application
![Page 9: Making Big Data, small](https://reader034.vdocuments.site/reader034/viewer/2022052619/555097c3b4c90595208b46d6/html5/thumbnails/9.jpg)
WHAT IS BIG DATA?
Data characterised by 3 Vs:
Volume
Variety
Velocity
The interesting ones: variety & velocity
![Page 10: Making Big Data, small](https://reader034.vdocuments.site/reader034/viewer/2022052619/555097c3b4c90595208b46d6/html5/thumbnails/10.jpg)
WHAT IS BIG DATA
Data of high velocity: cannot store? Process on
the fly!
Data of high variety: doesn’t fit into relational
schema? Don’t use schema, use NoSQL!
Data which is impractical to process on a single
server
![Page 11: Making Big Data, small](https://reader034.vdocuments.site/reader034/viewer/2022052619/555097c3b4c90595208b46d6/html5/thumbnails/11.jpg)
NO-SQL
Hand in and with Big Data
NoSQL – an umbrella term for non-relational
data bases or data storages
It’s not always possible to replace RDBMS with
NoSQL! (opposite is also true)
![Page 12: Making Big Data, small](https://reader034.vdocuments.site/reader034/viewer/2022052619/555097c3b4c90595208b46d6/html5/thumbnails/12.jpg)
NO-SQL
NoSQL DBs are built around different principles
Key-value stores: Redis, Riak
Document stores: i.e. MongoDB – record as a document; each entry has its own meta-data (JSON like, BSON)
Table stores: i.e. Hbase – data persisted in multiple columns (even millions), billions of rows and multiple versions of records
![Page 13: Making Big Data, small](https://reader034.vdocuments.site/reader034/viewer/2022052619/555097c3b4c90595208b46d6/html5/thumbnails/13.jpg)
HADOOP
Existed before ‘Big Data’ buzzword emerged
A simple idea – MapReduce
A primary purpose – to crunch tera- and
petabytes of data
HDFS as underlying distributed file system
![Page 14: Making Big Data, small](https://reader034.vdocuments.site/reader034/viewer/2022052619/555097c3b4c90595208b46d6/html5/thumbnails/14.jpg)
HADOOP – ARCHITECTURE BY EXAMPLE
Image you need to process 1TB of logs
What would you need?
A server!
![Page 15: Making Big Data, small](https://reader034.vdocuments.site/reader034/viewer/2022052619/555097c3b4c90595208b46d6/html5/thumbnails/15.jpg)
HADOOP – ARCHITECTURE BY EXAMPLE
But 1TB is quite a lot of data… we want it
quicker!
Ok, what about distributed environment?
![Page 16: Making Big Data, small](https://reader034.vdocuments.site/reader034/viewer/2022052619/555097c3b4c90595208b46d6/html5/thumbnails/16.jpg)
HADOOP – ARCHITECTURE BY EXAMPLE
So what about that Hadoop stuff?
Each node can: store data & process it (DataNode
& TaskTracker)
![Page 17: Making Big Data, small](https://reader034.vdocuments.site/reader034/viewer/2022052619/555097c3b4c90595208b46d6/html5/thumbnails/17.jpg)
HADOOP – ARCHITECTURE BY EXAMPLE
How about allocating jobs to slaves? We need a
JobTracker!
![Page 18: Making Big Data, small](https://reader034.vdocuments.site/reader034/viewer/2022052619/555097c3b4c90595208b46d6/html5/thumbnails/18.jpg)
HADOOP – ARCHITECTURE BY EXAMPLE
How about HDFS, how data blocks are
assembled into files?
NameNode does it.
![Page 19: Making Big Data, small](https://reader034.vdocuments.site/reader034/viewer/2022052619/555097c3b4c90595208b46d6/html5/thumbnails/19.jpg)
HADOOP – ARCHITECTURE BY EXAMPLE
NameNode – manages HDFS metadata, doesn’t deal with files directly
JobTracker – schedules, allocates and monitors job execution on slaves – TaskTrackers
TaskTracker – runs MapReduce operations
DataNode – stores blocks of HDFS – default replication level for each block: 3
![Page 20: Making Big Data, small](https://reader034.vdocuments.site/reader034/viewer/2022052619/555097c3b4c90595208b46d6/html5/thumbnails/20.jpg)
HADOOP - LIMITATIONS
DataNodes & TaskTrackers are fault tollerant
NameNode & JobTracker are NOT! (existing
workaround for this problem)
HDFS deals nicely with large files, doesn’t do
well with billions of small files
![Page 21: Making Big Data, small](https://reader034.vdocuments.site/reader034/viewer/2022052619/555097c3b4c90595208b46d6/html5/thumbnails/21.jpg)
MAP_REDUCE
MapReduce – parallelisation approach
Two main stages:
Map – do an actual bit of work, i.e.: extract info
Reduce – summarise, aggregate or filter outputs from Map operation
For each job, multiple Map and Reduce operations – each may run on different node = parallelism
![Page 22: Making Big Data, small](https://reader034.vdocuments.site/reader034/viewer/2022052619/555097c3b4c90595208b46d6/html5/thumbnails/22.jpg)
MAP_REDUCE – AN EXAMPLE
Let’s process 1TB of raw logs and extract traffic by host.
After submitting a job, JobTracker allocates tasks to slaves – possibly divided into 64MB packs = 16384 Map operations!
Map - analyse logs and return them as set of <key,value>
Reduce -> merge output of Map operations
![Page 23: Making Big Data, small](https://reader034.vdocuments.site/reader034/viewer/2022052619/555097c3b4c90595208b46d6/html5/thumbnails/23.jpg)
MAP_REDUCE – AN EXAMPLE
Take a look at mocked log extract:
[IP – bandwidth]
10.0.0.1 – 1234
10.0.0.1 – 900
10.0.0.2 – 1230
10.0.0.3 – 999
![Page 24: Making Big Data, small](https://reader034.vdocuments.site/reader034/viewer/2022052619/555097c3b4c90595208b46d6/html5/thumbnails/24.jpg)
MAP_REDUCE – AN EXAMPLE
It’s important to define key, in this case IP
<10.0.0.1;2134>
<10.0.0.2;1230>
<10.0.0.3;999>
Now, assume another Map operation returned:
<10.0.0.1;1500>
<10.0.0.3;1000>
<10.0.0.4;500>
![Page 25: Making Big Data, small](https://reader034.vdocuments.site/reader034/viewer/2022052619/555097c3b4c90595208b46d6/html5/thumbnails/25.jpg)
MAP_REDUCE – AN EXAMPLE
Now, Reduce will merge those results:
<10.0.0.1;3624>
<10.0.0.2;2230>
<10.0.0.3;1499>
<10.0.0.4;500>
![Page 26: Making Big Data, small](https://reader034.vdocuments.site/reader034/viewer/2022052619/555097c3b4c90595208b46d6/html5/thumbnails/26.jpg)
MAP_REDUCE
Selecting a key is important
It’s possible to define composite key, i.e.
IP+date
For more complex tasks, it’s possible to chain
MapReduce jobs
![Page 27: Making Big Data, small](https://reader034.vdocuments.site/reader034/viewer/2022052619/555097c3b4c90595208b46d6/html5/thumbnails/27.jpg)
HBASE
Another layer on top of Hadoop/HDFS
A distributed data storage
Not a replacement for RDBMS!
Can be used with MapReduce
Good for unstructured data – no need to worry
about exact schema in advance
![Page 28: Making Big Data, small](https://reader034.vdocuments.site/reader034/viewer/2022052619/555097c3b4c90595208b46d6/html5/thumbnails/28.jpg)
PIG – HBASE ENHANCEMENT
HBase - missing proper query language
Pig – makes life easier for HBase users
Translates queries into MapReduce jobs
When working with Pig or HBase, forget what
you know about SQL – it makes your life easier
![Page 29: Making Big Data, small](https://reader034.vdocuments.site/reader034/viewer/2022052619/555097c3b4c90595208b46d6/html5/thumbnails/29.jpg)
BUILDING HADOOP CLUSTER
Post production servers are ok
Don’t take ‘cheap hardware’ too literally
Good connection between nodes is a must!
>=1Gbps between nodes
>=10Gbps between racks
1 disk per CPU core
More RAM, more caching!
![Page 30: Making Big Data, small](https://reader034.vdocuments.site/reader034/viewer/2022052619/555097c3b4c90595208b46d6/html5/thumbnails/30.jpg)
FINAL CONCLUSIONS
Hadoop and NoSQL-like DB/DS scale very well
Hadoop ideal for crunching huge data sets
Does very well in production environment
Cluster of slaves is fault tolerant, NameNode
and JobTracker are not!
![Page 31: Making Big Data, small](https://reader034.vdocuments.site/reader034/viewer/2022052619/555097c3b4c90595208b46d6/html5/thumbnails/31.jpg)
EXTERNAL RESOURCES
Trending Topic – build on Wikipedia access logs: http://goo.gl/BWWO1
Building web crawler with Hadoop: http://goo.gl/xPTlJ
Analysing adverse drug events: http://goo.gl/HFXAx
Moving average for large data sets: http://goo.gl/O4oml
![Page 32: Making Big Data, small](https://reader034.vdocuments.site/reader034/viewer/2022052619/555097c3b4c90595208b46d6/html5/thumbnails/32.jpg)
EXTERNAL RESOURCES – USEFUL LINKS
http://www.slideshare.net/fullscreen/jpatanooga/la-hug-dec-2011-
recommendation-talk/1
https://ccp.cloudera.com/display/CDH4DOC/CDH4+Installation+Guide
http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
http://hstack.org/hbase-performance-testing/
http://www.theregister.co.uk/2012/06/12/hortonworks_data_platform_one/
http://wiki.apache.org/hadoop/MachineScaling
http://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-
ladis2009.pdf
http://www.cloudera.com/resource-types/video/
http://hstack.org/why-were-using-hbase-part-2/
![Page 33: Making Big Data, small](https://reader034.vdocuments.site/reader034/viewer/2022052619/555097c3b4c90595208b46d6/html5/thumbnails/33.jpg)
QUESTIONS?