hadoop clustering performance testing on the small scale. jonathan pingilley, garrison vaughan,...
TRANSCRIPT
![Page 1: Hadoop Clustering Performance testing on the small scale. Jonathan Pingilley, Garrison Vaughan, Calvin Sauerbier, Joshua Nester, Adam Albertson](https://reader036.vdocuments.site/reader036/viewer/2022083006/56649f345503460f94c50e9d/html5/thumbnails/1.jpg)
Hadoop Clustering
Performance testing on the small scale.
Jonathan Pingilley, Garrison Vaughan, Calvin Sauerbier, Joshua Nester, Adam Albertson
![Page 2: Hadoop Clustering Performance testing on the small scale. Jonathan Pingilley, Garrison Vaughan, Calvin Sauerbier, Joshua Nester, Adam Albertson](https://reader036.vdocuments.site/reader036/viewer/2022083006/56649f345503460f94c50e9d/html5/thumbnails/2.jpg)
Hadoop – A Quick Look
What is Hadoop?
![Page 3: Hadoop Clustering Performance testing on the small scale. Jonathan Pingilley, Garrison Vaughan, Calvin Sauerbier, Joshua Nester, Adam Albertson](https://reader036.vdocuments.site/reader036/viewer/2022083006/56649f345503460f94c50e9d/html5/thumbnails/3.jpg)
• Distributed Computing framework for data-intensive distributed applications
• Commonly used in large clusters of Commercial-Off-The-Shelf Hardware
• Noted for Reliability and Speed and failure/fault tolorance.
![Page 4: Hadoop Clustering Performance testing on the small scale. Jonathan Pingilley, Garrison Vaughan, Calvin Sauerbier, Joshua Nester, Adam Albertson](https://reader036.vdocuments.site/reader036/viewer/2022083006/56649f345503460f94c50e9d/html5/thumbnails/4.jpg)
THE QUESTION?Small Cluster Performance and reliability.
![Page 5: Hadoop Clustering Performance testing on the small scale. Jonathan Pingilley, Garrison Vaughan, Calvin Sauerbier, Joshua Nester, Adam Albertson](https://reader036.vdocuments.site/reader036/viewer/2022083006/56649f345503460f94c50e9d/html5/thumbnails/5.jpg)
Testing Overview
• Three Main Tests– Speed and Data loss– Fault Tolerance– Node Recovery
• Hardware– repurposed Dell Optiplex 270 and 280 units for
compatibility reasons
![Page 6: Hadoop Clustering Performance testing on the small scale. Jonathan Pingilley, Garrison Vaughan, Calvin Sauerbier, Joshua Nester, Adam Albertson](https://reader036.vdocuments.site/reader036/viewer/2022083006/56649f345503460f94c50e9d/html5/thumbnails/6.jpg)
Test 1
DataLoss Tolerance• Single simplest test of our testing procedure
• Word count on cluster, deleting all books on DFS I minute in and monitoring the result
![Page 7: Hadoop Clustering Performance testing on the small scale. Jonathan Pingilley, Garrison Vaughan, Calvin Sauerbier, Joshua Nester, Adam Albertson](https://reader036.vdocuments.site/reader036/viewer/2022083006/56649f345503460f94c50e9d/html5/thumbnails/7.jpg)
Test 2
Speed Baselines• Baseline test, with only a single node
– Exact command not usable on just a single node, but a close duplicate was located to simulate similar results:
» Cat *.txt | tr ‘’ ‘\n’ |sort |uniq –ic
• Baseline with cluster– Nearly identical to the single node test, but using
the cluster as a whole, using 1-4 nodes
• Tests run 3 times and averaged for consistency
![Page 8: Hadoop Clustering Performance testing on the small scale. Jonathan Pingilley, Garrison Vaughan, Calvin Sauerbier, Joshua Nester, Adam Albertson](https://reader036.vdocuments.site/reader036/viewer/2022083006/56649f345503460f94c50e9d/html5/thumbnails/8.jpg)
Test 3
Speed with Node Failure• Variable tests with 1 to 3 nodes removed and
complete task analysis.
• Each variation run 3 times and averaged for time comparisons
![Page 9: Hadoop Clustering Performance testing on the small scale. Jonathan Pingilley, Garrison Vaughan, Calvin Sauerbier, Joshua Nester, Adam Albertson](https://reader036.vdocuments.site/reader036/viewer/2022083006/56649f345503460f94c50e9d/html5/thumbnails/9.jpg)
Test 4
Speed with Node Recovery• Variable tests with 1 to 3 nodes removed 1
minute in, reconnected 1 minute later and complete task analyzed.
• Each variation run 3 times and averaged for time comparisons
![Page 10: Hadoop Clustering Performance testing on the small scale. Jonathan Pingilley, Garrison Vaughan, Calvin Sauerbier, Joshua Nester, Adam Albertson](https://reader036.vdocuments.site/reader036/viewer/2022083006/56649f345503460f94c50e9d/html5/thumbnails/10.jpg)
Test Parameters
• All books loaded onto the master node and DFS.
• Default timeout changed from 10 minutes to 30 seconds to allow for timely testing.
• Node removal was 1 minute in.
![Page 11: Hadoop Clustering Performance testing on the small scale. Jonathan Pingilley, Garrison Vaughan, Calvin Sauerbier, Joshua Nester, Adam Albertson](https://reader036.vdocuments.site/reader036/viewer/2022083006/56649f345503460f94c50e9d/html5/thumbnails/11.jpg)
RESULTSYou are required to maneuver straight down this trench…
![Page 12: Hadoop Clustering Performance testing on the small scale. Jonathan Pingilley, Garrison Vaughan, Calvin Sauerbier, Joshua Nester, Adam Albertson](https://reader036.vdocuments.site/reader036/viewer/2022083006/56649f345503460f94c50e9d/html5/thumbnails/12.jpg)
Data Loss Tolerance
• Test Group 1 Presentation.
![Page 13: Hadoop Clustering Performance testing on the small scale. Jonathan Pingilley, Garrison Vaughan, Calvin Sauerbier, Joshua Nester, Adam Albertson](https://reader036.vdocuments.site/reader036/viewer/2022083006/56649f345503460f94c50e9d/html5/thumbnails/13.jpg)
Hadoop Speed• Test Group 1 Presentation– Independent Test
• 22m 33s– 1 node
• 29m 50s w/ 22s deviation– 2 nodes
• 17m 32s w/ 18s deviation– 3 nodes
• 15m 6s w/ 16s deviation– 4 nodes
• 3m 54s w/6s deviation
![Page 14: Hadoop Clustering Performance testing on the small scale. Jonathan Pingilley, Garrison Vaughan, Calvin Sauerbier, Joshua Nester, Adam Albertson](https://reader036.vdocuments.site/reader036/viewer/2022083006/56649f345503460f94c50e9d/html5/thumbnails/14.jpg)
Speed w/ Node Failure
• One Node removed– 13m 57s w/ 17s deviation
• 2 nodes– 16m 5s w/ 25s deviation
• 3 nodes– 28m 19s w/ 19s deviation
![Page 15: Hadoop Clustering Performance testing on the small scale. Jonathan Pingilley, Garrison Vaughan, Calvin Sauerbier, Joshua Nester, Adam Albertson](https://reader036.vdocuments.site/reader036/viewer/2022083006/56649f345503460f94c50e9d/html5/thumbnails/15.jpg)
Speed w/ Node Recovery
• One Node Removed and Recovered– 5m 9s w/ 6s deviation– Recovery: 1m 3s w/ 3s deviation
• 2 nodes – 5m 27s w/ 8s deviation– Recovery: 51s w/ 2s deviation
• 3 nodes– 5m 31s w/ 6s deviation– Recovery: 54s w/ 5s deviation
![Page 16: Hadoop Clustering Performance testing on the small scale. Jonathan Pingilley, Garrison Vaughan, Calvin Sauerbier, Joshua Nester, Adam Albertson](https://reader036.vdocuments.site/reader036/viewer/2022083006/56649f345503460f94c50e9d/html5/thumbnails/16.jpg)
CONCLUSIONIs this the end?
![Page 17: Hadoop Clustering Performance testing on the small scale. Jonathan Pingilley, Garrison Vaughan, Calvin Sauerbier, Joshua Nester, Adam Albertson](https://reader036.vdocuments.site/reader036/viewer/2022083006/56649f345503460f94c50e9d/html5/thumbnails/17.jpg)
Conclusion
• Hadoop overhead is large on clusters numbering less than 4 nodes– Roughly 24% overhead w/ a performance degradation
of 50%• Upon introduction of a 4th node, average node
performance dramatically increases up to 144% due to optimizations.
• Performance numbers were reflected in the tests performed, and loss of nodes impacted total time to compute minimally
![Page 18: Hadoop Clustering Performance testing on the small scale. Jonathan Pingilley, Garrison Vaughan, Calvin Sauerbier, Joshua Nester, Adam Albertson](https://reader036.vdocuments.site/reader036/viewer/2022083006/56649f345503460f94c50e9d/html5/thumbnails/18.jpg)
Conclusion, Part Deux.
• Recovery performance was outstanding – nodes were disconnected for 1 minute and aside for a couple seconds of resync and overhead reintegrated without trouble.
![Page 19: Hadoop Clustering Performance testing on the small scale. Jonathan Pingilley, Garrison Vaughan, Calvin Sauerbier, Joshua Nester, Adam Albertson](https://reader036.vdocuments.site/reader036/viewer/2022083006/56649f345503460f94c50e9d/html5/thumbnails/19.jpg)
The Final Word
• Ultimately, Hadoop performed above and beyond expectations, proving to be a valid and relatively inexpensive way to handle managing large volumes of certain kinds of data when used above 4 nodes.
• Excellent recovery and performance, and relatively easy to use.