hadoop clustering performance testing on the small scale. jonathan pingilley, garrison vaughan,...

Hadoop Clustering

Performance testing on the small scale.

Jonathan Pingilley, Garrison Vaughan, Calvin Sauerbier, Joshua Nester, Adam Albertson

Hadoop – A Quick Look

What is Hadoop?

• Distributed Computing framework for data-intensive distributed applications

• Commonly used in large clusters of Commercial-Off-The-Shelf Hardware

• Noted for Reliability and Speed and failure/fault tolorance.

THE QUESTION?Small Cluster Performance and reliability.

Testing Overview

• Three Main Tests– Speed and Data loss– Fault Tolerance– Node Recovery

• Hardware– repurposed Dell Optiplex 270 and 280 units for

compatibility reasons

Test 1

DataLoss Tolerance• Single simplest test of our testing procedure

• Word count on cluster, deleting all books on DFS I minute in and monitoring the result

Test 2

Speed Baselines• Baseline test, with only a single node

– Exact command not usable on just a single node, but a close duplicate was located to simulate similar results:

» Cat *.txt | tr ‘’ ‘\n’ |sort |uniq –ic

• Baseline with cluster– Nearly identical to the single node test, but using

the cluster as a whole, using 1-4 nodes

• Tests run 3 times and averaged for consistency

Test 3

Speed with Node Failure• Variable tests with 1 to 3 nodes removed and

complete task analysis.

• Each variation run 3 times and averaged for time comparisons

Test 4

Speed with Node Recovery• Variable tests with 1 to 3 nodes removed 1

minute in, reconnected 1 minute later and complete task analyzed.

• Each variation run 3 times and averaged for time comparisons

Test Parameters

• All books loaded onto the master node and DFS.

• Default timeout changed from 10 minutes to 30 seconds to allow for timely testing.

• Node removal was 1 minute in.

RESULTSYou are required to maneuver straight down this trench…

Data Loss Tolerance

• Test Group 1 Presentation.

Hadoop Speed• Test Group 1 Presentation– Independent Test

• 22m 33s– 1 node

• 29m 50s w/ 22s deviation– 2 nodes



• 3m 54s w/6s deviation

Speed w/ Node Failure

• One Node removed– 13m 57s w/ 17s deviation

• 2 nodes– 16m 5s w/ 25s deviation

• 3 nodes– 28m 19s w/ 19s deviation

Speed w/ Node Recovery

• One Node Removed and Recovered– 5m 9s w/ 6s deviation– Recovery: 1m 3s w/ 3s deviation

• 2 nodes – 5m 27s w/ 8s deviation– Recovery: 51s w/ 2s deviation

• 3 nodes– 5m 31s w/ 6s deviation– Recovery: 54s w/ 5s deviation

CONCLUSIONIs this the end?

Conclusion

• Hadoop overhead is large on clusters numbering less than 4 nodes– Roughly 24% overhead w/ a performance degradation

of 50%• Upon introduction of a 4th node, average node

performance dramatically increases up to 144% due to optimizations.

• Performance numbers were reflected in the tests performed, and loss of nodes impacted total time to compute minimally

Conclusion, Part Deux.

• Recovery performance was outstanding – nodes were disconnected for 1 minute and aside for a couple seconds of resync and overhead reintegrated without trouble.

The Final Word

• Ultimately, Hadoop performed above and beyond expectations, proving to be a valid and relatively inexpensive way to handle managing large volumes of certain kinds of data when used above 4 nodes.

• Excellent recovery and performance, and relatively easy to use.

hadoop clustering performance testing on the small scale. jonathan pingilley, garrison vaughan,...

Documents