deep and wide sc11 submitted revised2715... · 2020. 4. 7. · deep and wide metrics for hpc...

7
Deep and Wide Metrics for HPC Resource Capability and Project Usage David Hart National Center for Atmospheric Research PO Box 3000 Boulder, CO 80307-3000 +1 303-497-1234 [email protected] ABSTRACT This paper defines and demonstrates application of possible quantitative metrics for the qualitative notions of “deep” and “wide” HPC system use along with the related concepts of capability and capacity computing. By summarizing HPC workloads according to the science-oriented projects using the systems, rather than solely on job sizes and distributions, one can identify differences between the workloads on different systems as well as highlight certain instances of unique usage modalities. Specific definitions of depth and width are suggested, along with metrics that permit comparisons to determine which systems are deeper or wider than others. Similarly, the same data permit an alternate means of defining the degree to which HPC system activities are capability- or capacity-oriented. Categories and Subject Descriptors K.6.2 [Management of Computing and Information Systems]: Installation Management – performance and usage measurement. General Terms Management, Measurement, Performance Keywords High-performance computing, capability, capacity, deep, wide, TeraGrid, metrics, measurement 1. INTRODUCTION HPC workload analysis presents many challenges and resists simple categorization, but has been subjected to anecdotal and intuitive descriptions. The HPC community has adopted the terms “capability” and “capacity” to distinguish HPC systems and their workloads. The distinctions are qualitative, although intuitively “capability” computing is characterized by workloads that can only be carried out on large-scale systems or which require a large fraction of a given system, while “capacity” refers to high-volume workloads comprised of relatively small work units [1]. Similarly, the TeraGrid has used the related terms “deep” and “wide” to describe the needs of users and their science problems [2]. “Deep” problems are those whose science objectives generally require capability computing or at least large amounts of computing, while “wide” describes the larger community of users who can benefit from HPC resources whose computing needs individually may be small but collectively represent a potentially large capacity computing challenge. 1.1 Issues and Challenges While useful for qualitative characterization, these pairs of terms have remained subjective and open to interpretation. But the notions of capability and capacity offer a starting point for some quantitative definitions. Capability and capacity may be viewed as end points on a continuum describing the scale of job size, although the point at which capability computing becomes capacity computing cannot easily be pinned down. That is, the capability work of one system may be considered capacity work in the context of a much larger system. One measurable target for “capability” was defined by NERSC as part of the Office of Management and Budget’s Performance Assessment Rating Tool (PART) for the Department of Energy’s Advanced Scientific Computing Research program. For that purpose, NERSC set a performance target of 40% of computing time for capability work, defined as computations that required at least 1/8 of the resource’s total cores, which was then 512 cores of the Seaborg IBM SP RS/6000 Power3 system [3]. Using an analogous perspective, “deep” and “wide” can be viewed as end points on two additional dimensions. In common parlance, both “deep” and “wide” suggest largeness, in vertical and horizontal directions, respectively. In the computing realm, we take the largeness implied by “deep” as referring to the scale of the scientific need with respect to computing. In contrast, the largeness of “wide” refers to the scale of the community, and users in the widest portion of the community are associated with smaller computational demands. Thus, the dimensions of deep and wide are actually somewhat orthogonal from one another, just as their common definitions would suggest. Figure 1 suggests a visual relationship between these three dimensions of scale for computing, science need, and community. The arrangement intentionally reflects the shape of the Branscomb pyramid [4], with the capability computing, deep science need, and narrow (as in “not wide”) community ends of each spectrum converging at the apex of the pyramid and collectively representing the small number of users with extensive computing needs requiring large-scale HPC capability. In contrast, the capacity, shallow (as in “not deep”), and wide ends of the dimensions diverge to reflect the variety of computing, science and user needs. (“Shallow” is a somewhat unfortunate contrast for “deep,” and it should be emphasized that the shallowness refers to only the need for computing associated with the science problem, not the scientific problem itself.) While we think abstractly and subjectively in these dimensions, the jobs and workloads of HPC systems have not typically been Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyright is held by the author/owner(s). SC’11, November 12–18, 2011, Seattle, WA, USA. ACM 978-1-4503-0771-0/11/11

Upload: others

Post on 14-Mar-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Deep and Wide SC11 submitted revised2715... · 2020. 4. 7. · Deep and Wide Metrics for HPC Resource Capability and Project Usage David Hart National Center for Atmospheric Research

Deep and Wide Metrics for HPC Resource Capability and Project Usage

David Hart National Center for Atmospheric Research

PO Box 3000 Boulder, CO 80307-3000

+1 303-497-1234

[email protected]

ABSTRACT This paper defines and demonstrates application of possible quantitative metrics for the qualitative notions of “deep” and “wide” HPC system use along with the related concepts of capability and capacity computing. By summarizing HPC workloads according to the science-oriented projects using the systems, rather than solely on job sizes and distributions, one can identify differences between the workloads on different systems as well as highlight certain instances of unique usage modalities. Specific definitions of depth and width are suggested, along with metrics that permit comparisons to determine which systems are deeper or wider than others. Similarly, the same data permit an alternate means of defining the degree to which HPC system activities are capability- or capacity-oriented.

Categories and Subject Descriptors K.6.2 [Management of Computing and Information Systems]: Installation Management – performance and usage measurement.

General Terms Management, Measurement, Performance

Keywords High-performance computing, capability, capacity, deep, wide, TeraGrid, metrics, measurement

1. INTRODUCTION HPC workload analysis presents many challenges and resists simple categorization, but has been subjected to anecdotal and intuitive descriptions. The HPC community has adopted the terms “capability” and “capacity” to distinguish HPC systems and their workloads. The distinctions are qualitative, although intuitively “capability” computing is characterized by workloads that can only be carried out on large-scale systems or which require a large fraction of a given system, while “capacity” refers to high-volume workloads comprised of relatively small work units [1].

Similarly, the TeraGrid has used the related terms “deep” and “wide” to describe the needs of users and their science problems [2]. “Deep” problems are those whose science objectives generally require capability computing or at least large amounts of computing, while “wide” describes the larger community of users who can benefit from HPC resources whose computing needs

individually may be small but collectively represent a potentially large capacity computing challenge.

1.1 Issues and Challenges While useful for qualitative characterization, these pairs of terms have remained subjective and open to interpretation. But the notions of capability and capacity offer a starting point for some quantitative definitions. Capability and capacity may be viewed as end points on a continuum describing the scale of job size, although the point at which capability computing becomes capacity computing cannot easily be pinned down. That is, the capability work of one system may be considered capacity work in the context of a much larger system. One measurable target for “capability” was defined by NERSC as part of the Office of Management and Budget’s Performance Assessment Rating Tool (PART) for the Department of Energy’s Advanced Scientific Computing Research program. For that purpose, NERSC set a performance target of 40% of computing time for capability work, defined as computations that required at least 1/8 of the resource’s total cores, which was then 512 cores of the Seaborg IBM SP RS/6000 Power3 system [3].

Using an analogous perspective, “deep” and “wide” can be viewed as end points on two additional dimensions. In common parlance, both “deep” and “wide” suggest largeness, in vertical and horizontal directions, respectively. In the computing realm, we take the largeness implied by “deep” as referring to the scale of the scientific need with respect to computing. In contrast, the largeness of “wide” refers to the scale of the community, and users in the widest portion of the community are associated with smaller computational demands. Thus, the dimensions of deep and wide are actually somewhat orthogonal from one another, just as their common definitions would suggest.

Figure 1 suggests a visual relationship between these three dimensions of scale for computing, science need, and community. The arrangement intentionally reflects the shape of the Branscomb pyramid [4], with the capability computing, deep science need, and narrow (as in “not wide”) community ends of each spectrum converging at the apex of the pyramid and collectively representing the small number of users with extensive computing needs requiring large-scale HPC capability. In contrast, the capacity, shallow (as in “not deep”), and wide ends of the dimensions diverge to reflect the variety of computing, science and user needs. (“Shallow” is a somewhat unfortunate contrast for “deep,” and it should be emphasized that the shallowness refers to only the need for computing associated with the science problem, not the scientific problem itself.)

While we think abstractly and subjectively in these dimensions, the jobs and workloads of HPC systems have not typically been

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyright is held by the author/owner(s). SC’11, November 12–18, 2011, Seattle, WA, USA. ACM 978-1-4503-0771-0/11/11

Page 2: Deep and Wide SC11 submitted revised2715... · 2020. 4. 7. · Deep and Wide Metrics for HPC Resource Capability and Project Usage David Hart National Center for Atmospheric Research

described in comparable terms. Existing workloads, including those on TeraGrid, are recognized to be complex and multidimensional in terms of usage, job size and job duration. At the same time, workloads reliably demonstrate a number of common features [5]. For example, a standard plot of HPC usage by different groups on a system shows a power-law distribution as shown in Figure 2 for TeraGrid usage on the Abe cluster at NCSA in 2010. But aside from confirming that a few projects use most of the computing time, such distributions are not terribly informative. That is, we can infer no further distinguishing characteristics of any given project, notably how it would be categorized along the “deep” or “capability” dimensions, or distinguish one resource’s usage patterns from another’s.

The challenge then is to present measurable metrics of the intuitive deep, wide, and capability dimensions that can be compared across systems. At present, there is no generally accepted way to measure when a project is “deep” or not, how deep those projects are, and how “wide” is the capacity-oriented community of a resource. While the NERSC definitions can apply to the system workload in aggregate and provide some notion of capability versus capacity, we can show that the definitions do not necessarily help in understanding the deep and wide dimensions of a system’s science and user community.

In Section 2, we present some definitions that help us translate deep and wide into measurable terms. Section 3 describes proposed metrics and calculations for capturing measures of these dimensions, and Section 4 applies these metrics to usage data

from TeraGrid resources and Bluefire at NCAR’s Computational and Information Systems Laboratory.

2. TERMS AND METRICS For the purposes of metrics, we will take the standard notion of job as the basic unit of work on shared HPC resources. We further define the size of the job as the number of cores used by the job. Because the number of cores per node may vary across systems, using the number of cores permits comparisons between resources better than the number of nodes.

Next, we define project to be the basic unit of scientific work. A project is comprised of all the jobs conducted by a set of users in pursuit of a common scientific objective. For TeraGrid, NCAR, and other shared facilities, allocation processes award HPC time to exactly this notion of project, and the corresponding accounting systems track and report usage by those projects [6][7][8]. For completeness, we define a user to be an individual with access to an HPC system who submits jobs as part of a project.

In practice, it could be argued that the basic unit of work is the sum of all jobs submitted by an individual user. This alternate definition is often equivalent to the proposed definition; in many cases, a project is comprised mainly of the work of a primary user. However, the collaborative nature of the deepest computational work is often the work of multiple individuals. Thus, we prefer to use the allocated project as the unit of scientific work, which is consistent with the goals of the allocations processes and policies for TeraGrid, NCAR, and other shared facilities.

2.1 Defining Deep and Wide Defining “deep,” “wide,” and “capability” presents greater challenges, though, since each term is relative. For example, a 2,048-core job is undoubtedly deep for a 2,048-core system, but considerably less deep for a system with 131,072 (217) cores.

We can, however, measure depth and width. Without a great stretch, the depth of a job is thus defined to be its size in cores. More significantly, we define the depth of a project as the maximum depth of any job run as part of that project. (The depth of a user can be defined similarly.) The notion of width does not provide such a straightforward measure, although we have shown it should measure some aspect of the user community on a resource. One possibility is to simply define width as the number of projects (or users) on a resource. In this case, a wider system would have more projects or users than a narrower system. However, this simple interpretation does not capture the full scope implied by the notion of “wide.” It also does not permit fine-grained comparisons between resources. For example, if a system has a width of 100 projects or users, you cannot tell whether the width represents a small number of projects on a very large system or a relatively large number of projects on a smaller system. Thus, we must define “width” as a relative measure. Namely, the width of a system is defined as the fraction of projects at or below a given depth.

Finally, for “capability,” we generalize from the NERSC definition, which is actually a specific target for a metric. Namely, the capability of a system is defined as the fraction of use by projects above a given depth. This definition introduces an important distinction from the NERSC definition. Where NERSC defined capability in terms of job depth and usage, we define it in terms of project depth and usage.

The rationale for this definition recognizes the use of HPC systems in practice. No project conducts work or submits jobs at only its largest job size. The scientific objectives may require test

Figure 1. Dimensions of scale in HPC systems, science, and

communities

Figure 2. NCSA Abe projects and usage in 2010.

Page 3: Deep and Wide SC11 submitted revised2715... · 2020. 4. 7. · Deep and Wide Metrics for HPC Resource Capability and Project Usage David Hart National Center for Atmospheric Research

runs, pre-processing, post-processing, low-resolution as well as high-resolution experiments, and so forth. While all jobs may not be capability jobs, all the jobs by a given project support the scientific objective, and accomplishing the scientific objective requires the computing capability to complete all the jobs.

2.2 Calculating Depth and Width Given these definitions, we now need a way to measure, present, and compare these metrics. While summary descriptions of complex HPC workflows over time may lend themselves to graphical representations, we ideally would like small sets of numbers that capture the relevant dimensions and permit meaningful comparisons between systems.

To achieve these objectives, job usage data must be summarized appropriately. The critical summarization point proposed here is to classify each job by rounding the job size to the next greatest power of two. That is, a job’s depth is its size in cores rounded to 2 !"#! !"#$% . Therefore, the depth of a project or user is max(2 !"#! !"#$% ) for all jobs in the time period in question.

The rationale for this rounding is as follows. First, it reduces the noisiness in the data to a manageable number of data points. Second, many HPC analyses have shown that most HPC system usage, including that on TeraGrid, is the result of jobs using a power-of-2 number of cores [5], so these are natural points on which to focus.1 Finally, this summarization allows us to state that all the work at a given data point could be completed on a machine of at least that many cores (given sufficient time).

Thus, the basic data collection for these metrics is reduced to a relatively straightforward operation. In generic SQL:

select ProjectID, max(power(2,ceil(log(2, JobCores)))) as Depth, count(JobID) as NumberOfJobs sum(JobCores * wallclock) as Usage, from jobs_table where EndTime between [start] and [end] group by ProjectID, Depth

From these results, we further summarize to tally the number of projects, number of jobs, and the amount of usage at a given depth. Although a single nested SQL query can return fully summarized data, we will show there is value in retaining project-level summaries for further exploration. Finally, and as a key element of this analysis, we plot the cumulative percentage of projects, jobs, and usage at increasing depths.

We note that this query may lead to potential “edge cases” due to the fixed start and end times, while projects may cross those calendar boundaries. On TeraGrid, for example, larger projects begin four times per year, and small projects can start anytime. 1 Obviously, hex-core processors require revisiting this

assumption. A quick look at 2010 usage on Kraken suggests some evidence to reinforce the assumption, at least in part. Distinct usage spikes appear at some node counts needed to provide power-of-two cores. E.g., the data show notable spikes at 43 nodes, or 512+4 cores; 86 nodes, or 1,024+8 cores; and 342 nodes, or 4,096+8 cores. There are also spikes at power-of-two node counts including 128, 256, and 512, which provide 3 × 2x cores. Thus, while the power-of-two assumption may largely hold, the straightforward calculation for rounding to the nearest power of two was refined for Kraken in this analysis; namely, we took the floor of the log if the core count was less than 12 cores more than a power of 2.

Thus, projects starting up or winding down near the end or beginning, respectively, of the time period in question may not be reported at their “true” depth, were their full workload represented. Conceivably, at the expense of a more complex query, you could exclude some of these edge cases by considering only projects that were allowed to consume resources for a “significant” portion of the time period in question.

2.3 Examining the Results The result of applying this analysis to TeraGrid HPC jobs from calendar year 2010 is shown in Figure 3. The query excludes jobs and usage from non-HPC systems, such as high-throughput Condor systems, which skew the job numbers and reflect non-HPC usage modalities [9]. Thus, the data reflect usage spanning 13 HPC systems, and minimal data cleaning was performed aside from the noted adjustment for Kraken’s hex-core chips and excluding jobs with a wall-clock time of zero.

The figure shows a striking “80-20” rule spanning TeraGrid. At a depth of 1,024 cores, 77% of TeraGrid projects consume 18% of the computational usage on TeraGrid. And conversely, the remaining projects consume the remaining delivered core-hours. More formally, it is possible to calculate the “joint ratio” for projects versus usage [10]. In this case, the joint ratio gives the specific values at which x% of the projects use y% of the resources, and the remaining y% of the projects use the other x% of the resources. To compare projects and usage across resources, the joint ratio offers a more meaningful point of reference than the median or average project usage as might be calculated from the

Figure 3. Project depth and usage for TeraGrid, 2010

Figure 4. Job depth and usage for TeraGrid, 2010

Page 4: Deep and Wide SC11 submitted revised2715... · 2020. 4. 7. · Deep and Wide Metrics for HPC Resource Capability and Project Usage David Hart National Center for Atmospheric Research

data in Figure 2. By interpolating between the data points at 1,024 and 2,048, we arrive at a joint ratio that is exactly 79.9/20.1 or—entirely coincidentally—80/20 at a depth just over 1,024 cores.

For comparison, Figure 4 shows the same set of jobs, but categorized only by job depth. The difference between the figures is dramatic. In Figure 3, the joint ratio for jobs versus usage is 88/12 at a depth of 64 cores. By looking only at the job-based summary, you might (incorrectly) assume that TeraGrid supports 90% very-small-scale projects and only small fraction of the projects need significant capability—only 1.9% of jobs are larger than 1,024 cores. In contrast, Figure 3 shows that more than 20% of TeraGrid projects need a system with more than 1,024 cores to complete their scientific objectives.

Figure 3 also overlays the cumulative distribution of jobs at each depth. In this case, the figure does show an “interesting” leap in jobs at 128 cores. The 10% of additional projects at depth 128 submitted 35% of all the TeraGrid HPC jobs in 2010. We show the job distribution because, as we will demonstrate, it can provide pointers for identifying such interesting areas for further analysis and exploration of project behavior. By including all three distributions, we represent aspects of the science need (usage), community (projects), and computing scale (jobs).

Figures 3 and 4 also reveal two different ways to achieve the NERSC capability target of 40% of system activity associated with jobs 1/8 the system size. Figure 4 shows only about 13% of TeraGrid usage is due to “capability” jobs at 1/8 the total (hypothetical single-system) size, and to hit 40% of the usage, you have to include all jobs larger than 2,048 cores, or 1/64 of the system. However, Figure 3 shows that 33% of the usage is due to projects that need nearly the 1/8-scale capability (more than 8,192 cores) for at least some of their work.

3. SINGLE-RESOURCE EXAMPLES While the global behavior of TeraGrid users does mimic many usage patterns seen in single-system HPC use [5], we need to consider the same analysis on individual systems to see whether this analysis provides insight into user behavior and allows resource comparisons. We therefore look at several different TeraGrid systems as well as on the non-TeraGrid Bluefire system at NCAR. We show 2010 calendar year data for Kraken, the 112,896-core Cray XT5 at the National Institute for Computational Sciences (NICS); Abe, the 9,600-core Dell PowerEdge cluster at NCSA; Steele, the 7,144-core Dell/Intel cluster at Purdue; and Bluefire, the 4,064-core IBM Power6 cluster at NCAR, which is not part of TeraGrid.

In Figure 5, Kraken shows a 71/29 joint ratio at a depth of just above 4,096 cores. That is, Kraken’s width at 4,096 cores is 231 projects (71% of Kraken’s 326 projects), with 95 at greater depth consuming 71% of the system’s delivered core-hours. Assuming a target capability of 1/8 of the system (i.e., depth greater than 8,192), we see about 25% of Kraken’s projects have such a depth and they consume 67% of the delivered core-hours. Several features in the figure highlight the capability orientation of Kraken. Notably, at high core counts the slopes of both the project and usage curves continue to climb rather than flatten out; 30% of Kraken’s usage is attributed to projects at the full-machine depth.

In Figure 6, NCSA’s Abe shows a joint ratio of 69/31 at a depth just below 128 cores. Abe’s width is therefore 323 projects (69% of 468) at 128 cores, which consume 31% of the system’s core-hours. At 1/8 of the system size (a depth of 2,048 cores), Abe has only about 7% of its projects, which consume 26% of its usage.

Figure 5. NICS Kraken project depth and usage, 2010

Figure 6. NCSA Abe project depth and usage, 2010

Figure 7. Purdue Steele project depth and usage, 2010

Figure 8. NCAR Bluefire project depth and usage, 2010

Page 5: Deep and Wide SC11 submitted revised2715... · 2020. 4. 7. · Deep and Wide Metrics for HPC Resource Capability and Project Usage David Hart National Center for Atmospheric Research

The figure shows several notable features. First, about a third of Abe’s sizeable user community runs only single-node jobs (8 cores) and are responsible for nearly half of the jobs submitted. Collectively, though, they use less than 10% of the system’s core hours. Similarly, projects at a depth of 64 cores are producing another 30% of the jobs. From a usage standpoint, however, the heaviest consumers are running at 256 and 2,048 cores, each of which accounts for 25% of the usage. In contrast to Kraken, Abe’s curves flatten out at the highest core counts.

Figure 7 shows the data for Purdue’s Steele. Steele has a joint ratio of 78/22 at just below 64 cores. Thus, 122 projects (78% of 157) need no more than 64 cores, and they consume almost a quarter of the core-hours. At greater depths, very little capability work is conducted on Steele.

Finally, in Figure 8, NCAR’s Bluefire use for 2010 is shown, with a joint ratio of 69/31 at 256 cores.2 Bluefire’s width is therefore 338 projects (69% of 490) at 256 cores, with those projects consuming 31% of the delivered usage. At 1/8 of the system size (depth greater than 256 cores), Bluefire sees only about 11% of its projects, which consume 31% of the delivered usage. The “sweet spots” for Bluefire usage is clearly at a depth of 512 cores, where the projects consume 40% of the system’s usage.

The figure shows two distinct features. First, a sharp “knee” appears below 32 cores, which is the node size on the system. A small but non-zero number of projects need only very small jobs. Second, the figure shows Bluefire projects to be concentrated at middle depths, with a steep climb in projects and usage between depths of 64 and 1,024, with little usage outside that range. Thus, Bluefire’s pattern shows steeper growth than the much flatter growth curves for Abe, but tapers off far more rapidly at larger core counts than the distributions for Kraken.

4. DISCUSSION Independently, the distributions for the four resources paint unique pictures of the system usage. But since we are using a standard summarization method, we are also able to discuss and compare usage patterns among different systems.

4.1 “Interesting” Modalities We have already seen some examples of “interesting” features in the distributions. Steele (Figure 7) clearly shows an unusual pattern at a depth of 128. While the project curve increases by 8%, the usage curve jumps by more than 60%, and the jobs curve jumps by 87%. The sharp discontinuity in the Steele job and usage distributions points to a possible job “flurry” [11] or a very-large-scale ensemble project. In either case, this chart shows evidence of a “narrow” set of one or more projects that is “deep” in terms of total computing need but runs only capacity jobs. In fact, closer examination of the Steele data points to a single project being responsible for the bulk of the jobs and usage on the system. Clearly, the deepest project on Steele is not a capability project.

The Abe and Steele data point to other interesting avenues for exploration, notably the large fraction of projects running only on

2 You might note that Bluefire has a system size of 4,064 cores,

but reports some jobs at a depth of 8,192 cores. This is due to a feature of the Power6 processors: Each physical core can act like two virtual cores by taking advantage of idle cycles. For determining depth, I reported the number of virtual cores used if they were greater than the number of physical cores used, to reflect the same capability on other systems. Usage was calculated based on physical cores used.

a single node, suggesting many projects with serial computing needs. Science gateways [12] represent a possible set of projects that typically represent many small-scale jobs, but the number of gateways operating on TeraGrid is much smaller than the number of projects at depth 8 on Abe.

In the case of Bluefire, it may be worth examining the projects that need less than 32 cores (1 node) to see if they are making efficient use of the resources or are perhaps edge cases in the data.

4.2 Absolute and Relative Depth and Width Initially, it was expected that the joint ratio would be a simple, yet distinguishing characteristic of these project and usage distribution curves, but in fact, the calculated joint ratios for the example systems were all close to one another; Abe and Bluefire showed exactly the same joint ratio (69/31) for markedly different project and usage curves.

Instead, the value of calculating the joint ratio lies in identifying the depth at which the joint ratio occurs. Each example system had a different joint ratio depth: 64 for Steele, 128 for Abe, 256 for Bluefire, and 4,096 for Kraken. In absolute terms, Kraken is clearly more capability oriented, but it is also three levels deeper—that is, O(23) more cores—than the next largest system here. So, we can also consider relative depth of the systems. In Kraken’s case, the joint ratio depth is five levels below the maximum system depth. For Abe and Steele, the distance is 7, while Bluefire shows a distance of 5. Thus, Abe and Steele are also relatively more capacity-oriented for their size than Kraken. Comparing Kraken and Bluefire is not so cut and dried, however, and we must look more closely at the shape of the distributions. On the other hand, extreme values in a joint ratio may be useful for indicating extreme resource usage patterns. For example, a joint ratio of 90/10 would indicate a very wide user community making very shallow use of the resource, while a very narrow portion of the community would be extremely deep users. In contrast, a joint ratio of 50/50 would indicate the resource usage being split evenly between projects at lesser depths and at greater depths. Such a situation may suggest a wider community with deeper computing needs comprised of capacity-scale, ensemble-oriented projects, for example. A similar analysis allows us to compare the “width” of the systems, either in absolute or relative terms. And because the width dimension describes the scale of the user community, it is most appropriate to compare the percentage and number of projects at a given depth. For example, at a depth of 64, Bluefire has a width of 36% (179 projects). At the same depth, Abe has a width of 60% (280 projects), and Steele has a width of 80% (126 projects), while Kraken’s width is only 10% (39 projects). In this case, Bluefire is “wider” than Steele in absolute terms (number of projects), but Steele is twice as wide in relative terms. Similarly, Abe is widest and Kraken is narrowest in both absolute and relative terms.

4.3 Stability of the Analysis Many studies of HPC workloads have identified short-duration jobs as being complications in calculating meaningful metrics and workload predictions. At the same time, such short jobs represent very small portions of an HPC system’s overall workload. In [5], we showed that excluding such short-duration jobs and focusing only on “high-use” jobs can reveal further insights in a system’s usage patterns. Here, we want to determine whether these depth and width distributions are affected or skewed by short jobs.

Page 6: Deep and Wide SC11 submitted revised2715... · 2020. 4. 7. · Deep and Wide Metrics for HPC Resource Capability and Project Usage David Hart National Center for Atmospheric Research

In the current depth and width analysis, focusing on high-use jobs allows us to see whether the all-job analysis included a significant number of outlier projects or edge cases. Excluding any projects limited to very short jobs would provide an opportunity to examine the excluded projects to determine whether they represent further interesting usage modalities—for example, a science gateway—or simply represent edge cases, such as projects that recorded only a few jobs as the project was winding down or starting up during the time period under consideration.

Looking only at high-use jobs also helps to assess the validity of summarizing usage by projects according to their single deepest job. That is, would the patterns change significantly due to many projects being categorized at greater depths than the bulk of their workload. Therefore, Figures 9 and 10 show the high-use depth patterns for NCSA’s Abe and NCAR’s Bluefire, which we selected here because both had large numbers of jobs and projects, as well as distinct workload distributions. Figure 9 shows that, in the case of Abe, a high-use analysis retains 87% of the projects and 96% of the usage, while only encompassing 17% of the jobs. The joint ratio for the high-use case is 68/32, about halfway between 64 and 128 cores. Thus, both visual inspection and the calculated joint ratio indicate that the overall usage pattern persists, even when looking at only 1/6 of the system’s jobs. Examination of the data further confirms that the excluded projects were distributed across the various depths and not limited to either high- or low-depth projects.

In Figure 10, Bluefire’s depth distributions demonstrate similar resilience. The high-use analysis captures 93% of the projects and 92% of the usage, while encompassing an even smaller fraction (13%) of the year’s jobs. The calculated joint ratio is 68/32 at 256 cores, essentially the same as before. In this example, we can see at least one edge that case was adjusted—the sole project classified at a depth of 8,192 in Figure 8 has disappeared, indicating that its largest jobs were in fact not a large amount of work. As with Abe, Bluefire’s excluded projects were distributed across the various depths and not limited to either high- or low-depth projects.

Thus, the high-use analysis shows that the depth and width distributions capture a stable pattern out of the all-job workload. Conversely, the all-job results are not dramatically altered by the presence of the short jobs, as may be the case with other metrics.

5. CONCLUSIONS The analyses conducted thus far suggest that, for system operations and management staff, summarizing project-oriented HPC system use according to a project’s largest recorded job size can capture a sense of how a system’s user community makes use of a system. In addition, these analyses allow us to translate a number of intuitive terms—deep, wide, capability, capacity—into measurable and comparable notions of depth and width aligned with the science need and community size on a resource.

However, further examination and study is needed to understand the limits and application of these analyses. For example, the general project-usage distribution in Figure 2 is known to persist across different time periods (yearly, quarterly, monthly, etc.). Whether the depth and width distributions persist or not at shorter time scales, and the implications of any differences, needs to be explored.

There may also be additional information or relationships that can be inferred from the depth and width distributions about system differences and potentially the factors contributing to the differences. For example, are there predictable relationships between depth and width distributions and more familiar metrics, such as wait time or turnaround time? Do the science domains or dominant applications affect the distributions? Bluefire use is dedicated to atmospheric and related sciences, while TeraGrid systems serve a much broader portfolio of domains. In addition, Bluefire’s workload is dominated by a few community models, notably the Community Earth System Model (CESM) [13] and the Weather Research and Forecasting (WRF) model [14], which may lend themselves to “preferred” job sizes.

Finally, it remains to be considered if operational controls can affect the distributions in predictable ways. The Trestles system at SDSC, for example, was deployed in early 2011 with the goal of serving a wider community with less need for capability computing [15]. In particular, SDSC adopted allocation size limits, a maximum job size of 1,024 cores, and flexible scheduling policies designed to support a variety of shallower usage modalities and a wider range of users. Further analysis is needed to determine whether modifying usage policies or scheduling approaches can “steer” the distributions in a desired direction.

6. ACKNOWLEDGMENTS This work was supported by NSF grant number OCI-0503697, which funds the TeraGrid’s Grid Infrastructure Group, including the accounting and allocations systems, and by NSF AGS-0753581, which supports the activities of the National Center for Atmospheric Research and its Computational and Information Systems Laboratory.

Figure 9. NCSA Abe, high-use jobs in 2010.

Figure 10. NCAR Bluefire, high-use jobs in 2010.

Page 7: Deep and Wide SC11 submitted revised2715... · 2020. 4. 7. · Deep and Wide Metrics for HPC Resource Capability and Project Usage David Hart National Center for Atmospheric Research

7. REFERENCES [1] Strohmaier, E., Dongarra, J.J., Meuer, H.W., and Simon,

H.D. 2005. Recent trends in the marketplace of high performance computing, Parallel Computing, 31, 3-4 (March-April 2005) 261-273, DOI=10.1016/j.parco.2005.02.001.

[2] Catlett, C. 2005. TeraGrid: A Foundation for US Cyberinfrastructure. In Network and Parallel Computing, H. Jin, D. Reed, W. Jiang, eds. Lecture Notes in Computer Science series. Springer, Berlin/Heidelberg, 3779:1. http://dx.doi.org/10.1007/11577188_1.

[3] Ndousse, T. 2006. ASCR Performance Measures. Advanced Scientific Computing Advisory Committee (ASCAC) Meeting, (Washington, DC, March 15-16, 2006). http://science.energy.gov/ascr/ascac/meetings/mar-2006/.

[4] Branscomb, L. et al. 1993. From Desktop to TeraFlop: Exploiting the U.S. Lead in High-Performance Computing. Final Report of the National Science Foundation Blue Ribbon Panel on High-Performance Computing, National Science Foundation, Arlington, VA, 1993; http://www.nsf.gov/pubs/stis1993/nsb93205/nsb93205.txt.

[5] Hart, D.L. 2011. Measuring TeraGrid: workload characterization for a high-performance computing federation. IJHPCA, Published online before print, February 10, 2011, DOI=10.1177/1094342010394382.

[6] NCAR. 2011. Allocations. http://www2.cisl.ucar.edu/docs/allocations.

[7] TeraGrid. 2011. NSF Resource Allocations Policies. https://www.teragrid.org/web/user-support/allocations_policy.

[8] Department of Energy. 2011. INCITE Call for Proposals. http://hpc.science.doe.gov/allocations/calls/incite2012.

[9] Katz, D., D. Hart, C. Jordan, A. Majumdar, J.P. Navarro, W. Smith, J. Towns, V. Welch, N. Wilkins-Diehr. 2011. Cyberinfrastructure Usage Modalities on the TeraGrid. Proceedings of the 25th IEEE International Parallel and Distributed Processing Symposium 2011 (IDPDS 2011), Anchorage (Alaska), USA, May 16-20, 2011.

[10] Feitelson D.G. 2006. Metrics for mass-count disparity. In: Proceedings of the 14th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Tele- communication Systems (MASCOTS), Monterey (Calif.), USA, September 11-14, 2006, pp. 61–68.

[11] Feitelson, D.G., and Tsafrir, D. 2006. Workload Sanitation for Performance Evaluation. In IEEE International Symposium on Performance Analysis of Systems and Software 2006. IEEE Computer Society, pp. 221-230.

[12] Wilkins-Diehr, N., Gannon, D., Klimeck, G., Oster, S., and Pamidighantam, S. 2008. TeraGrid science gateways and their impact on science. Computer, 41: 32–41.

[13] Community Earth System Model (CESM), http://www.cesm.ucar.edu/.

[14] Weather Research and Forecasting (WRF) Model, http://wrf-model.org/.

[15] Moore, R.L., Hart, D.L., Pfeiffer, W., Tatineni, M., Yoshimoto, K., and Young, W.S. 2011. Trestles: A high-productivity HPC system targeted to modest-scale and gateway users. TeraGrid’11, July 18-21, 2011, Salt Lake City, Utah.