@let@token tacc stats: a comprehensive and transparent ...€¦ · linked into splunk logs - link...

22
TACC Stats: A Comprehensive and Transparent Resource Usage Monitoring Tool for HPC Systems Texas Advanced Computing Center R. Todd Evans, Bill Barth, James Browne

Upload: others

Post on 19-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: @let@token TACC Stats: A Comprehensive and Transparent ...€¦ · Linked into Splunk logs - link takes user to logs ltered by host and time range for a given job. Used for day to

TACC Stats: A Comprehensive andTransparent Resource Usage Monitoring Tool

for HPC Systems

Texas Advanced Computing CenterR. Todd Evans, Bill Barth, James Browne

Page 2: @let@token TACC Stats: A Comprehensive and Transparent ...€¦ · Linked into Splunk logs - link takes user to logs ltered by host and time range for a given job. Used for day to

TACC Stats is NOT a Line-levelPerformance Analysis Tool

TACC Stats . . .

• Is NOT like: TAU, IPM, gprof, Vampir, Perfexpert, IntelVTune etc.

• Does NOT provide profiling at the level of function calls,control structures, or statements

• Does NOT require instrumentation

• Does NOT need to be invoked by the user or system managers

R.T. Evans 2

Page 3: @let@token TACC Stats: A Comprehensive and Transparent ...€¦ · Linked into Splunk logs - link takes user to logs ltered by host and time range for a given job. Used for day to

TACC Stats IS a Transparent ResourceUsage Monitoring Tool

TACC Stats . . .

• Monitors all jobs on all nodes automatically

• Collects hardware counters: Flops, MBW, cache hit rates . . .

• Linux/IB/Lustre stats: I/O, memory usage, CPU usage . . .

• Tests jobs automatically for inefficient resource utilization

• Uses a SQL database w/ Web Interface

• Is Open Source: https://github.com/TACC/tacc_stats

• TACC Stats + XALT provideslibraries, modules, threads, executables . . .

R.T. Evans 3

Page 4: @let@token TACC Stats: A Comprehensive and Transparent ...€¦ · Linked into Splunk logs - link takes user to logs ltered by host and time range for a given job. Used for day to

TACC Stats OperationDeployment

• Distributed as a Python package

• Light-weight component (raw C) runs on compute nodes

• Starts automatically after install

Automated Operation

• Measurements for all jobs & nodes in prolog/epilog + every10 minutes → at least two measurements per job

• rsync’d once a day at random times between 2 and 4 am.

• pickled once per day

• ingested into database once per day

• tests run once per day

• Every Production Job is subjected to Daily Tests

R.T. Evans 4

Page 5: @let@token TACC Stats: A Comprehensive and Transparent ...€¦ · Linked into Splunk logs - link takes user to logs ltered by host and time range for a given job. Used for day to

TACC Stats OperationNode0

(tacc stats)Node1

(tacc stats)NodeN

(tacc stats)

Scratch(Archive: raw stats/hostfiles/accounting file)

Corral(Processed data: Python Pickles)

Server(Database, Web Server)

rsync rsync rsync

pickle

ingest into databaseoptionally combine with XALT

R.T. Evans 5

Page 6: @let@token TACC Stats: A Comprehensive and Transparent ...€¦ · Linked into Splunk logs - link takes user to logs ltered by host and time range for a given job. Used for day to

Data Collected

Summary: This data is collected for every job

• I/O: Block/Lustre/NFS/GigE

• CPU utilization

• IB stats

• Chip stats - ”on-core” and ”off-core” - instructions, cache hitrates, flops (SSE scalar/SSE vector/256), memoryreads/writes . . .

• Memory, Virtual Memory, NUMA data

• 150KB per host per day - Stampede 0.8GB/day

R.T. Evans 6

Page 7: @let@token TACC Stats: A Comprehensive and Transparent ...€¦ · Linked into Splunk logs - link takes user to logs ltered by host and time range for a given job. Used for day to

Data Computed

Metrics computed - averaged over time and hosts/cores• CPU Usage

• Cycles per Instruction (CPI)

• Cycles per L1D replacement (CPLD)

• Load L1Hits

• Load L2 Hits

• Load LLC Hits

• All Load Operations

• Memory Bandwidth

• Memory Usage High Water Mark

• Infiniband Packet Rate

• Infiniband Packet Size

• FLOPS

• Vectorization Fraction

• Ethernet Bandwidth

R.T. Evans 7

Page 8: @let@token TACC Stats: A Comprehensive and Transparent ...€¦ · Linked into Splunk logs - link takes user to logs ltered by host and time range for a given job. Used for day to

Tests

We flag jobs for the following every day

• High CPI

• High Memory Usage on normal nodes

• Low memory usage on 1 TB nodes (mean is ∼120 GB)

• Low vectorization percentage VecPercent - often indicatesuser is not using AVX

• MPI over GigE network instead of IB

• Will soon test Lustre usage and MDS load

• Inefficiently using their reserved nodes (to be continued) . . .

R.T. Evans 8

Page 9: @let@token TACC Stats: A Comprehensive and Transparent ...€¦ · Linked into Splunk logs - link takes user to logs ltered by host and time range for a given job. Used for day to

Case Studies - Many SUs & CPISU = Service Unit

Node hours

MILC code• One project using MILC found to be running higher than

expected CPI (1.1 vs 0.7)

• Members were not using available vectorized intrinsics

• 11% reduction in runtime

DNS Turbulence Code

• CPI of 1.1 w/ lots of SUs

• Line-level profiling revealed MPI/OpenMP hot spots

• Converted OpenMP workshare block to parallel do block

• Improvement: 7% overall, 10% in main loop, 76% in codeblock

R.T. Evans 9

Page 10: @let@token TACC Stats: A Comprehensive and Transparent ...€¦ · Linked into Splunk logs - link takes user to logs ltered by host and time range for a given job. Used for day to

Case Studies - CPI

GR application - Singularity

• CPI of 1.15 w/ lots of SUs: 239,631 for 1st quarter 2014

• Code was making many extraneous calls to cat and rm

• Code was not using any optimization flags (-O3 or -xhost)

• 26% decrease in run time after simple changes made

R.T. Evans 10

Page 11: @let@token TACC Stats: A Comprehensive and Transparent ...€¦ · Linked into Splunk logs - link takes user to logs ltered by host and time range for a given job. Used for day to

Case Studies - Performance Drop

R.T. Evans 11

Page 12: @let@token TACC Stats: A Comprehensive and Transparent ...€¦ · Linked into Splunk logs - link takes user to logs ltered by host and time range for a given job. Used for day to

Case Studies - Idle nodes

• Roughly ∼ 1% of SUsspend on productionjobs w/ idle hosts

0 5 10 15Time (hrs)

c453-203(0.45)

c453-803(2.46)

c455-001(2.70)

c455-203(2.42)

c455-301(2.67)

c455-701(2.50)

c455-801(2.53)

c457-002(2.46)

CLOCKS_UNHALTED_REF/INSTRUCTIONS_RETIRED¯Mean=2.28±0.74

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

ID: 4420133, u: userxxx, q: normal, N: p1, D: 2014-11-10 18:03:40, NH: 8

E: unknown

R.T. Evans 12

Page 13: @let@token TACC Stats: A Comprehensive and Transparent ...€¦ · Linked into Splunk logs - link takes user to logs ltered by host and time range for a given job. Used for day to

Case Studies - Undersubscription

Test if MPI/OpenMP

• Alerts us to usersusing less tasks pernode than they should

• CPU Usage is 50%,memory usage is<20%

• Halve nodes for job,double MPI wayness

R.T. Evans 13

Page 14: @let@token TACC Stats: A Comprehensive and Transparent ...€¦ · Linked into Splunk logs - link takes user to logs ltered by host and time range for a given job. Used for day to

Web Interface

Backed by SQL Database

• Combined queries are supported in search fields

• Search by jobid, date, user, executable, project, nodes, runtime, metrics . . .

• Data is ingested from accounting file, XALT, processed statsdata

• Raw time-series data is served and plots are generated on thefly

• Linked into Splunk logs - link takes user to logs filtered byhost and time range for a given job.

• Used for day to day consultant activities typically fordiagnosing issues experienced by our users

R.T. Evans 14

Page 15: @let@token TACC Stats: A Comprehensive and Transparent ...€¦ · Linked into Splunk logs - link takes user to logs ltered by host and time range for a given job. Used for day to

Web Interface11/18/2014 TACC Stats Stampede

http://tacc-stats.tacc.utexas.edu/stampede/ 1/2

TACC Stats Stampede

Search Using JOBID:

Search

Search Using Combined Fields:

Start Date mm/dd/yyyy End Date mm/dd/yyyy

UID

User Name

Project

Executable

Search Field Value

Search

2013/ 09 17 18 19 20 21 22 23 24 25 26 27 28 29 30

2013/ 10 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

2013/ 11 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

2013/ 12 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

2014/ 01 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

2014/ 02 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

2014/ 03 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

2014/ 04 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

2014/ 05 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

2014/ 06 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 17 18 19 20 21 22 23 24 25 26 27 28 29 30

2014/ 07 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

2014/ 08 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

2014/ 09 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

R.T. Evans 15

Page 16: @let@token TACC Stats: A Comprehensive and Transparent ...€¦ · Linked into Splunk logs - link takes user to logs ltered by host and time range for a given job. Used for day to

Web Interface11/18/2014 TACC Stats

http://tacc-stats.tacc.utexas.edu/stampede/date/2014-11-17/ 1/265

[date=2014­11­17]­search

# Jobs over 1 mn in run time = 3626

Job ListingJobs flagged for:

GigE BW > 1.049e+06 B

lagarder 4457699(1.612e+06)

iplant 4457846(5.406e+06) 4456558(2.384e+06)

unknown 4456785(2.247e+06) 4453596(1.806e+06) 4453592(1.768e+06) 4453591(1.612e+06)

dodo5575 4453658(4.480e+06) 4453656(4.531e+06) 4453655(4.517e+06)

smk5g5 4458359(5.084e+07)

subhahdr 4457020(8.163e+06) 4457018(8.092e+06) 4457017(7.950e+06) 4457015(7.874e+06) 4456907(8.041e+06) 4456891(7.670e+06) 4456888

beckverm 4459588(1.207e+06)

jcesar 4457694(6.063e+06)

Memory HWM < 30 GB on a Largemem Node

aa55784 4458440(23.085)

rogerab 4460225(22.543) 4459343(21.842)

xiaq 4455836(21.498) 4455153(29.324)

smk5g5 4458461(21.257)

ruhollah 4446077(21.165) 4446075(21.149) 4446063(21.269)

Idle-ness

mturilli 4458498(0.993) 4456031(0.993)

lagarder 4456658(1.0) 4456164(0.999) 4456084(1.0)

priyaunt 4460299(0.996)

mzivel 4457586(0.992)

R.T. Evans 16

Page 17: @let@token TACC Stats: A Comprehensive and Transparent ...€¦ · Linked into Splunk logs - link takes user to logs ltered by host and time range for a given job. Used for day to

Breakdown of MPI/OpenMP Usage

Q1 2015Type % SUsSerial 22Hybrid 19

Pure OpenMP 0.2Pure MPI 59

Hybrid Jobs (MPI&OpenMP) Tasks/Threads Breakdown(> 500,000 SUs March 2014 - March 2015)

Tasks/node Threads/task CPI CPU Usage MemUsage (GB) SUs16 1 0.54(6) 1580(17) 7(5) 14,000,0002 8 0.76(9) 1130(450) 13(7) 3,400,0001 16 0.78(13) 532(264) 17(6) 1,200,0004 4 0.69(10) 1150(335) 9(6) 580,0004 1 0.65(18) 437(305) 9(3) 820,0008 1 0.65(21) 731(133) 11(6) 850,0001 1 0.59(4) 198(297) 7(1) 540,000

R.T. Evans 17

Page 18: @let@token TACC Stats: A Comprehensive and Transparent ...€¦ · Linked into Splunk logs - link takes user to logs ltered by host and time range for a given job. Used for day to

Top Applications by SUs: March2014-March2015

Production runs: Jobs/SUs = 983,675/548,368,901

Top 10 Executables

Application Executable SUs Memory Usage (GB)

NAMD namd2 59,423,911 6(3)GROMACS mdrun mpi 18,668,824 4.0(2)LAMMPS lmp stampede 13,119,414 4.2(9)

VASP vasp std 11,639,474 10(5)FLASH flash4 9,313,134 14(7)WRF wrf.exe 8,973,222 10(7)

CHARMM charmm 8,507,615 6(1)VASP vasp 8,452,074 9(5)

KORAL ko 7,862,293 6(3)ENZO enzo.exe 7,691,763 21(6)

R.T. Evans 18

Page 19: @let@token TACC Stats: A Comprehensive and Transparent ...€¦ · Linked into Splunk logs - link takes user to logs ltered by host and time range for a given job. Used for day to

MPI comparisons of Applications

NAMD - MPI comparison

36,812 Jobs from March 2014 - March 201537,000,000 SUs mvapich2 74,000 SUs impi (0.2%)

metric mvapich2 impi

cpi 0.49(2) 0.49(8)cpld 25(12) 31(3)mem bw 0.12(3) 0.09(1)L1 reuse 0.88(5) 0.93(1)GFLOPs 17(6) 16(6)Vec Fraction 0.05(1) 0.062(5)

R.T. Evans 19

Page 20: @let@token TACC Stats: A Comprehensive and Transparent ...€¦ · Linked into Splunk logs - link takes user to logs ltered by host and time range for a given job. Used for day to

Scaling of Applications

NAMD Jobs from March 2014 - March 2015

NAMD Jobs over 1 hour

Nodes GLOPS/node2 14(7)4 23(5)8 21(5)

16 20(8)32 17(3)64 15(5)

128 14(6)256 19(5)

R.T. Evans 20

Page 21: @let@token TACC Stats: A Comprehensive and Transparent ...€¦ · Linked into Splunk logs - link takes user to logs ltered by host and time range for a given job. Used for day to

Work in Progress

• Build a basis of metrics on which applications can be mapped.

• User/project performance can be compared to expectedperformance

• Validating/interpreting raw data

• Real-time, continual stats updates via the network: syslog orRabbitMQ (available but not deployed)

• Deployment at additional Centers - TACC (Lonestar,Stampede, Wrangler)/SDSC(Gordon,Comet)/LSU(SuperMIC)/Purdue(Conte,others???)

• Adding support for additional architectures and accelerators:MIC, GPU

• Data is being incorporated into XDMoD (XD Metrics onDemand Portal, a comprehensive HPC management tool)

R.T. Evans 21

Page 22: @let@token TACC Stats: A Comprehensive and Transparent ...€¦ · Linked into Splunk logs - link takes user to logs ltered by host and time range for a given job. Used for day to

Acknowledgements

• NSF-funded STCI project ACI-1203604

• Robert McLay (TACC), Mark Fahey (NICS) for XALT

• University at Buffalo, SUNY Robert Deleon, Thomas Furlani,Steven Gallo, Matthew Jones, Abani Patra

R.T. Evans 22