matt singer (@mattbytes) pachyzoom: understanding ......experimenting and analyzing results agenda....
TRANSCRIPT
Matt Singer (@mattbytes)
April 17th, 2019
Pachyzoom: Understanding & Optimizing Hadoop servers with Intel VPP
2
12345
Hadoop & TwitterProblem StatementProject ObjectivesExperimenting and AnalyzingResults
Agenda
What is Hadoop?● A file system: HDFS● Distributed applications API and compute runtime: YARN● Data processing frameworks: MapReduce, Hive, TEZ, Spark, …
Twitter uses Hadoop for: ● Data: Core Data (Tweets, Users, …) , Logs, Durable Storage● Analytics: Metrics● Insight: Model Training, Experiments, Ad-Hoc Analysis
Hadoop @ Twitter
Introduction
3
>1TEvents Per Day
>500KCompute Threads
>1 exabytePhysical Storage
>12,500Peak Cluster Size
Hadoop @ Twitter Scale
Introduction
4
5
Problem Statement
Twitter has an exascale Hadoop infrastructure built on many relatively small (1/2/4TB) HDDs.
How will we have enough IOPS if we use far fewer, but larger (6/8/12TB) HDDs?
6
1
Objective
2 3MEASUREI/O and CPU usage in our test and production Hadoop clusters.
ENABLEEnable adoption of bigger data HDDs
DENSIFYReduce the number of nodes needed for future clusters
The Project’s Objectives
7
Project Objectives
VTune™ Amplifier Platform Profiler
We made extensive use of VPP to visualize things like:
● CPU Utilization● Memory Use● Disk Throughput● Disk Latency● IOPS● Queue Depths● Network Throughput
7
8
Experiment Configuration
The Teams Involved
● Collaborate with Intel to test Intel Cache Acceleration Software
○ Try to reduce contention on disk access as there were more and more reads and writes per disk.
● Test Intel Optane SSD DC P4800X○ Test really fast, really high endurance storage
for this cache● Use a combination of Hadoop Benchmarks
@Twitter:● Hardware Engineering● Hadoop Development● Hadoop SRE● Infrastructure
Optimization and Performance
@Intel:● SSG: Hadoop● SSG: VPP● NSG: Engineering● NSG: Intel CAS Team● NSG: Technical Sales● SMG: Account Team
Collaboration
The Plan
Functional: Gridmix
● Capture of real production cluster workload trace (1000+ jobs)
● Replays reads, writes, shuffles, compute● Used over three generations of hardware● Standard (Apache Hadoop)
Base: TestDFSIO, Teragen, Terasort
● Low level I/O tests● Repeatable● Easy to use● Standard (Apache Hadoop)
9
1 2
Twitter Hadoop Test Approach
Experiment Configuration
Experiment Configuration
● Dual Socket E5-2640 v4 CPU● 128GB RAM● 12x 6TB 7200 RPM SATA DISK● 1x SATA SSD Boot Disk● 1x 750GB Intel P4800X NVMe SSD● 25G Ethernet
102 nodes spread across 6 racks
3.33 CPU Threads / HDD
Twitter ExperimentSystem Setup
10
● Dual Socket Xeon 8180● 128GB RAM● 8x 4TB 7200 RPM SATA DISK● 1x SATA SSD Boot Disk● 1x 750GB Intel P4800X NVMe SSD● 2x10G Ethernet
9 Nodes
14 CPU Threads / HDD
Intel Lab Test System Setup
11
Experiment Configuration
Why does Twitter use YARN on HDD?
Twitter ran tests in 10+ configs:● Baseline Config:
● 12 HDDs with YARN temp files/logs on HDDs● 12 HDDs HDFS Cached● YARN on Optane + HDFS Cached● YARN on HDDs + HDFS Cached● YARN on Optane● Baseline with 6 HDDs● YARN on Optane, 6 HDDs● YARN on Optane, 3 HDDs● YARN, Optane, 6 HDDs, HDFS Cached● YARN, Optane, 3 HDDs, HDFS Cached● Baseline with half the threads● YARN on Optane with half the threads
Intel ran Gridmix in 10 configs (some similar, some different)
In the past, we haven't wanted to make an a priori determination of how much spill a job could produce. Also, SSDs used to be much smaller and more expensive.
Test Configurations
Experiment Configuration
This is data from one of the 12 disks in each system in the test cluster. VPP is showing us:
Throughput: Peaks to 200MB/Sec, but average is 32MB/Sec
IOPS averages about 70R + 80W during the run. 150 IOPS is really high for a HDD!
About ⅔ of the samples show a QDepth > 20, but block size is all over the place.
Baseline Example
12
13
Results
Intel’s Cluster Baseline Config, Gridmix75
Source: Intel Corporation
14
Results
Results
15
Results
Results
We saw a significant reduction in runtimes on both gridmix (27.5%) and terasort (52%) with the same system.
27%
52%
16
Results
Why?
HDD Utilization drops
dramatically now that YARN and
HDFS aren’t contending for the
same disks.
Gridmix HDD Utilization, Baseline Config (37MB/Sec)
Gridmix HDD Utilization, YARN on Optane Config (6MB/Sec)
17
Results
Why?
Since the benchmark was
significantly I/O bound before,
CPU Utilization increases from
an average of 40% to an
average of 57%. The CPU was
doing work at 1.4x the rate,
which correlates with the
reduced runtime.
CPU Utilization, Baseline Config (40%)
CPU Utilization, New Config (57%)
18
Results
Why?
We can now see how the
temporary data really wants to
behave.
Peaks to 20K IOPS, average >
6000 total RW IOPS
There was no way that spinning
disks could handle this alone, let
alone when sharing the HDD
with HDFS load.
19
Gridmix 75
From Phase 1 (Baseline) to Phase 2 (YARN Data on Optane), the processing time was reduced by 51.7%
Adding in some HDFS Caching (Phase 5), the processing time was reduced by 56.2% (or about 9% incrementally)
In comparison, the max reduction in runtime in the Twitter test cluster was 27.5%
In Contrast: Intel’s ResultsSource: Intel Corporation
Results
51.7% Reduction in Runtime
9% Incremental Reduction in Runtime
Phase 2: 750G Optane + 1.6TB NAND
20
Different Systems!
● Much more compute power in Intel’s system.
● Twitter retrofitted the NVMe Disks into an existing system, and couldn’t accommodate the optimal attachment of the NVMe disk to CPU Socket 0.
Intel did some testing that indicated that this may have had a 10% penalty when trying to move big data from HDD to Cache.
Why didn’t we see the same scaling?
Results
Intel Config Twitter Config
CPU Xeon 8160 x 2 E5-2640 v4 x 2
Threads 112 40
HDDs 8 12
Threads / Disk 14 3.33
21
1
Deck title
2 3Moving the YARN data to the Optane SSD dramatically changed the disk utilization pattern
Our test cluster only had 3.33 Threads/Disk, vs 14 Threads/Disk in Intel’s Mini Cluster
We couldn’t push the HDDs hard enough with just HDFS Data to see a benefit from caching metadata and small files with Intel’s Cache Acceleration Software
Opportunity to apply what we discovered
Explore the other dimensions of testing: CPU and HDD Count
22
Results
Experiment: Removing disks from the cluster
Gridmix
Runtime
Terasort
Runtime22
23
12 Disk 6 Disk 3 Disk
27%38%
68%
23
40MB/s
45MB/s
23 MB/s
9MB/s
24
Results
Strawman Example - 2X CPU with ¼ HDD
Gridmix Runtime
+27%
-9%
-27%
-55%
YARN on SSD enabled
scaling.
Benchmark to run in
less than half the time,
with 75% fewer HDDs.
24
Results
With ½ of Cores Disabled, no NVMe
25
CPU goes right to 100%*
*Due to the way that we disabled the cores in the scheduler (rather than BIOS) Linux reports 50%.
Cluster is CPU bound for most of the run!
Results
½ Cores Disabled, with NVMe SSD
26
HDD Utilization drops dramatically, but we’re still CPU bound. Thus, a minimal impact to runtime.
Results
All cores, with NVMe SSD, 3 HDDs
27
CPU is now very utilized (but not pegged at 100%) for most of the run.
The HDDs are running a lot of data without a lot of IOPS.
Great condition to be in!
Results
CPU Utilization and Disk Access In Real Life
CPU Utilization, Processing Cluster 1
28
Results
CPU Utilization and Disk Access In Real LifeHDD Utilization, Processing Cluster 1
29
Results
CPU Utilization and Disk Access In Real Life
CPU Utilization, Realtime Cluster
30
Results
CPU Utilization and Disk Access In Real LifeDisk Utilization, Realtime Cluster
31
Results
vs. The experiment clusters
Intel 8 HDD: ~30MB/Sec
(HDFS Was Partially Cached)Twitter 12 HDD: ~9MB/Sec
Results
vs. The experiment clusters
Twitter 6 HDD: ~22MB/SecIntel 8 HDD: ~30MB/Sec
(HDFS Was Partially Cached)Twitter 3 HDD: ~45MB/Sec
800MB/s
3000-IOPS
34
Impact and Next Steps
35
Shifting the Balance
32 Threads to 8 Disks is a 4:1 Ratio, a better fit for today’s more compute intensive workloads.
3-6T SSD for YARN data enables this shift in Threads:HDDs
Impact
Impact and Next Steps
Legacy Config Possible Config
CPU Xeon E3 Series 4-Core 6262V CLX 24-Core
Memory 32-64GB 192GB
HDD 12 x 2TB HDD 8x 6TTB HDD
Boot 240GB Boot 240GB Boot
YARN Storage N/A 6.4TB High Endurance NVMe
Compute 1X >4X
Storage 1X 4X
Racks 4X 1
Original Goal End Result
Scaling 2X-3X 4-5X
30%more cost efficient.
Impact and Next Steps
36
Estimated
37
MEASURE
1 2 3EXPERIMENT REPEAT
A visualization tool such as VTune™ Amplifier Platform Profiler made it really easy to measure what was happening in the experiment and production clusters.
Challenge prior assumptions about how things are and how things will work.
Learn from the data that you collected, adjust your experiments, and try again!
Best Practices
Section
38
YARN Data on NVMe SSD
1 2 3Adopt High Density Drives More Compute Power for Every
Disk
This single tweak alone changed the disk access patterns dramatically.
Once the YARN data was moved to the SSD, it became clear that we didn’t need as many HDDs.
Now we know that the next platform that we design needs to have more compute threads for each disk in the system.
Learnings
Section
39
1
Objective
2 3YARN Temporary Capacity Size Analysis
SSD Endurance Needs Analysis
(Total YARN Daily Write Load)
Determine Optimal Balance of HDD, Threads, and NVMe SSD
What’s Next?
40
Thank you!
Thank you Intel Team!Ali Alavi (@TheAliAlavi), Felipe Barajas, Mauricio Cuervo (@mauriciocuervo), Milind Damle, Juan Fernandez (@cachegordon), Uma Gangumalla, Fabrizio Giamello (@giame), Andrzej Jakowski, David Leone, Anup Navare, Chris Parry, Brien Porter, Rakeshr Radhakrishnan Potty, David Tuhy, Barrie Wheeler (@AndBarrie), Michal Wysoczanski
Thank you Twitter Team!Tu Lam (@tulam_tu), Brian Martin (@brayniac), Derrick Tseng(@dstseng), Mark Schonbach (@markbach), Matt Silver (@msilver)
#Collaborate
Twitter @ VTune Summit 2019
Thank You!
April 17th, 2019