hadoop clustering performance testing on the small scale. jonathan pingilley, garrison vaughan,...

19
Hadoop Clustering Performance testing on the small scale. Jonathan Pingilley, Garrison Vaughan, Calvin Sauerbier, Joshua Nester, Adam Albertson

Upload: amy-potter

Post on 05-Jan-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hadoop Clustering Performance testing on the small scale. Jonathan Pingilley, Garrison Vaughan, Calvin Sauerbier, Joshua Nester, Adam Albertson

Hadoop Clustering

Performance testing on the small scale.

Jonathan Pingilley, Garrison Vaughan, Calvin Sauerbier, Joshua Nester, Adam Albertson

Page 2: Hadoop Clustering Performance testing on the small scale. Jonathan Pingilley, Garrison Vaughan, Calvin Sauerbier, Joshua Nester, Adam Albertson

Hadoop – A Quick Look

What is Hadoop?

Page 3: Hadoop Clustering Performance testing on the small scale. Jonathan Pingilley, Garrison Vaughan, Calvin Sauerbier, Joshua Nester, Adam Albertson

• Distributed Computing framework for data-intensive distributed applications

• Commonly used in large clusters of Commercial-Off-The-Shelf Hardware

• Noted for Reliability and Speed and failure/fault tolorance.

Page 4: Hadoop Clustering Performance testing on the small scale. Jonathan Pingilley, Garrison Vaughan, Calvin Sauerbier, Joshua Nester, Adam Albertson

THE QUESTION?Small Cluster Performance and reliability.

Page 5: Hadoop Clustering Performance testing on the small scale. Jonathan Pingilley, Garrison Vaughan, Calvin Sauerbier, Joshua Nester, Adam Albertson

Testing Overview

• Three Main Tests– Speed and Data loss– Fault Tolerance– Node Recovery

• Hardware– repurposed Dell Optiplex 270 and 280 units for

compatibility reasons

Page 6: Hadoop Clustering Performance testing on the small scale. Jonathan Pingilley, Garrison Vaughan, Calvin Sauerbier, Joshua Nester, Adam Albertson

Test 1

DataLoss Tolerance• Single simplest test of our testing procedure

• Word count on cluster, deleting all books on DFS I minute in and monitoring the result

Page 7: Hadoop Clustering Performance testing on the small scale. Jonathan Pingilley, Garrison Vaughan, Calvin Sauerbier, Joshua Nester, Adam Albertson

Test 2

Speed Baselines• Baseline test, with only a single node

– Exact command not usable on just a single node, but a close duplicate was located to simulate similar results:

» Cat *.txt | tr ‘’ ‘\n’ |sort |uniq –ic

• Baseline with cluster– Nearly identical to the single node test, but using

the cluster as a whole, using 1-4 nodes

• Tests run 3 times and averaged for consistency

Page 8: Hadoop Clustering Performance testing on the small scale. Jonathan Pingilley, Garrison Vaughan, Calvin Sauerbier, Joshua Nester, Adam Albertson

Test 3

Speed with Node Failure• Variable tests with 1 to 3 nodes removed and

complete task analysis.

• Each variation run 3 times and averaged for time comparisons

Page 9: Hadoop Clustering Performance testing on the small scale. Jonathan Pingilley, Garrison Vaughan, Calvin Sauerbier, Joshua Nester, Adam Albertson

Test 4

Speed with Node Recovery• Variable tests with 1 to 3 nodes removed 1

minute in, reconnected 1 minute later and complete task analyzed.

• Each variation run 3 times and averaged for time comparisons

Page 10: Hadoop Clustering Performance testing on the small scale. Jonathan Pingilley, Garrison Vaughan, Calvin Sauerbier, Joshua Nester, Adam Albertson

Test Parameters

• All books loaded onto the master node and DFS.

• Default timeout changed from 10 minutes to 30 seconds to allow for timely testing.

• Node removal was 1 minute in.

Page 11: Hadoop Clustering Performance testing on the small scale. Jonathan Pingilley, Garrison Vaughan, Calvin Sauerbier, Joshua Nester, Adam Albertson

RESULTSYou are required to maneuver straight down this trench…

Page 12: Hadoop Clustering Performance testing on the small scale. Jonathan Pingilley, Garrison Vaughan, Calvin Sauerbier, Joshua Nester, Adam Albertson

Data Loss Tolerance

• Test Group 1 Presentation.

Page 13: Hadoop Clustering Performance testing on the small scale. Jonathan Pingilley, Garrison Vaughan, Calvin Sauerbier, Joshua Nester, Adam Albertson

Hadoop Speed• Test Group 1 Presentation– Independent Test

• 22m 33s– 1 node

• 29m 50s w/ 22s deviation– 2 nodes

• 17m 32s w/ 18s deviation– 3 nodes

• 15m 6s w/ 16s deviation– 4 nodes

• 3m 54s w/6s deviation

Page 14: Hadoop Clustering Performance testing on the small scale. Jonathan Pingilley, Garrison Vaughan, Calvin Sauerbier, Joshua Nester, Adam Albertson

Speed w/ Node Failure

• One Node removed– 13m 57s w/ 17s deviation

• 2 nodes– 16m 5s w/ 25s deviation

• 3 nodes– 28m 19s w/ 19s deviation

Page 15: Hadoop Clustering Performance testing on the small scale. Jonathan Pingilley, Garrison Vaughan, Calvin Sauerbier, Joshua Nester, Adam Albertson

Speed w/ Node Recovery

• One Node Removed and Recovered– 5m 9s w/ 6s deviation– Recovery: 1m 3s w/ 3s deviation

• 2 nodes – 5m 27s w/ 8s deviation– Recovery: 51s w/ 2s deviation

• 3 nodes– 5m 31s w/ 6s deviation– Recovery: 54s w/ 5s deviation

Page 16: Hadoop Clustering Performance testing on the small scale. Jonathan Pingilley, Garrison Vaughan, Calvin Sauerbier, Joshua Nester, Adam Albertson

CONCLUSIONIs this the end?

Page 17: Hadoop Clustering Performance testing on the small scale. Jonathan Pingilley, Garrison Vaughan, Calvin Sauerbier, Joshua Nester, Adam Albertson

Conclusion

• Hadoop overhead is large on clusters numbering less than 4 nodes– Roughly 24% overhead w/ a performance degradation

of 50%• Upon introduction of a 4th node, average node

performance dramatically increases up to 144% due to optimizations.

• Performance numbers were reflected in the tests performed, and loss of nodes impacted total time to compute minimally

Page 18: Hadoop Clustering Performance testing on the small scale. Jonathan Pingilley, Garrison Vaughan, Calvin Sauerbier, Joshua Nester, Adam Albertson

Conclusion, Part Deux.

• Recovery performance was outstanding – nodes were disconnected for 1 minute and aside for a couple seconds of resync and overhead reintegrated without trouble.

Page 19: Hadoop Clustering Performance testing on the small scale. Jonathan Pingilley, Garrison Vaughan, Calvin Sauerbier, Joshua Nester, Adam Albertson

The Final Word

• Ultimately, Hadoop performed above and beyond expectations, proving to be a valid and relatively inexpensive way to handle managing large volumes of certain kinds of data when used above 4 nodes.

• Excellent recovery and performance, and relatively easy to use.