automated extraction of skeleton apps from apps february 2012

Lawrence Livermore National Laboratory

Automated Extraction of Skeleton Apps from Apps

February 2012

Daniel Quinlan (LLNL)Matt Sottile (Galois), Aaron Tomb (Galois)

Lawrence Livermore National Laboratory, P. O. Box 808, Livermore, CA 94551Operated by Lawrence Livermore National Security, LLC, or the U.S. Department of Energy,

National Nuclear Security Administration under Contract DE-AC52-07NA27344

2

What is a Skeleton and why you want one A skeleton is a reduced size version of an application that focuses on

one or more aspects of the behavior of the full original application. Examples include:• MPI usage, message passing patterns; • memory traversal; • I/O demands

This is important for Exascale:• Provides inputs to simulators for evaluation of expected Exascale

architectures and features (e.g. SST/macro)• Provides smaller applications for independent study

A skeleton program will not get the same answer as the original application

There is prior work in this area… I think we are the only ones with a distributed tool for this…

3

CoDesign Tool FlowAutomatic Generation of Skeletons for Rapid Analysis

3

This talk is about these arrows

4

We can generate many skeletons from an App

Many skeletons could be generated from a single application

The process can work on full applications or smaller compact applications

Single App with many files

Aspect A

Aspect B

Aspect X

Skeleton A

Skeleton B

Skeleton X

Many Skeleton Apps each with maybe

many files

5

An Automated or Semi-Automated Process

We treat this as a compiler research problem

We are building tools to automate the generation of skeletons, but some questions are difficult to resolve• May require dynamic analysis to identify important

values• May require some user annotations to define some

behavior

We start with the original application and transform it to modify and remove code to define an automated process; this is a source-to-source solution

6

We are using the ROSE Source-To-Source Compiler to support this work

Science & Technology: Computation Directorate

Source CodeFortran/C/C++

OpenMPTransformed Source Code

ROSEIR

Analyses/ Transformation/ Optimizations

System-dependency

Sliced-system-dependency

Control-Flow

Control dependency

Control flow

Unparser

ROSE

ROSEFrontend

ROSE-based Skeleton Generation Tool

7

A Non-trivial problem to Automate

Different aspects are related (they are not actually orthogonal)• Example: inter-message timings are a function of the

computational work that an app does.

Static analysis is not always precise, and dynamic analysis is not always complete

We are focused on using static analysis and formal methods to generate plausible, realistic skeletons is the focus of our research work.

8

Example of Automated Skeleton Code Generation: Before/After

do { if (rank < size - 1) MPI_Send( xlocal[maxn/size], maxn, MPI_DOUBLE, rank + 1, 0, MPI_COMM_WORLD ); if (rank > 0) MPI_Recv( xlocal[0], maxn, MPI_DOUBLE, rank - 1, 0, MPI_COMM_WORLD, &status ); if (rank > 0) MPI_Send( xlocal[1], maxn, MPI_DOUBLE, rank - 1, 1, MPI_COMM_WORLD ); if (rank < size - 1) MPI_Recv( xlocal[maxn/size+1], maxn, MPI_DOUBLE, rank + 1, 1, MPI_COMM_WORLD, &status ); itcnt ++; diffnorm = 0.0; for (i=i_first; i<=i_last; i++) for (j=1; j<maxn-1; j++) { xnew[i][j] = (xlocal[i][j+1] + xlocal[i][j-1] + xlocal[i+1][j] + xlocal[i-1][j]) / 4.0; diffnorm += (xnew[i][j] - xlocal[i][j]) * (xnew[i][j] - xlocal[i][j]); } for (i=i_first; i<=i_last; i++) for (j=1; j<maxn-1; j++) xlocal[i][j] = xnew[i][j]; MPI_Allreduce( &diffnorm, &gdiffnorm, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD ); gdiffnorm = sqrt( gdiffnorm ); if (rank == 0) printf( "At iteration %d, diff is %e\n”, itcnt, gdiffnorm );} while (gdiffnorm > 1.0e-2 && itcnt < 100);

do { if (rank < size - 1) MPI_Send( xlocal[maxn / size], maxn, MPI_DOUBLE, rank + 1, 0, MPI_COMM_WORLD ) if (rank > 0) MPI_Recv( xlocal[0], maxn, MPI_DOUBLE, rank - 1, 0, MPI_COMM_WORLD, &status ); if (rank > 0) MPI_Send( xlocal[1], maxn, MPI_DOUBLE, rank - 1, 1, MPI_COMM_WORLD ); if (rank < size - 1) MPI_Recv( xlocal[maxn/size+1], maxn, MPI_DOUBLE, rank + 1, 1, MPI_COMM_WORLD, &status ); itcnt ++;

MPI_Allreduce( &diffnorm, &gdiffnorm, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD );

} while (gdiffnorm > 1.0e-2 && itcnt < 100);

Before After

9

Example of Automated Skeleton Code Generation: Larger example

Source-to-source transformation Def-use analysis of variables leading to MPI calls Future work will explore use of:

• System Dependence Graph (SDG)• Data flow framework and defined concepts of dead-code

elimination.• Can be supplemented with dynamic information• Can be applied to abstract other things than MPI use

Generated Skeleton Code: rank(int iteration)

Original Source Code: rank(int iteration)void rank( int iteration ){

INT_TYPE i, k;

INT_TYPE shift = MAX_KEY_LOG_2 - NUM_BUCKETS_LOG_2; INT_TYPE key; INT_TYPE2 bucket_sum_accumulator, j, m; INT_TYPE local_bucket_sum_accumulator; INT_TYPE min_key_val, max_key_val; INT_TYPE *key_buff_ptr;

TIMER_START( T_RANK );

/* Iteration alteration of keys */ if(my_rank == 0 ) { key_array[iteration] = iteration; key_array[iteration+MAX_ITERATIONS] = MAX_KEY - iteration; }

/* Initialize */ for( i=0; i<NUM_BUCKETS+TEST_ARRAY_SIZE; i++ ) { bucket_size[i] = 0; bucket_size_totals[i] = 0; process_bucket_distrib_ptr1[i] = 0; process_bucket_distrib_ptr2[i] = 0; }

/* Determine where the partial verify test keys are, load into *//* top of array bucket_size */ for( i=0; i<TEST_ARRAY_SIZE; i++ ) if( (test_index_array[i]/NUM_KEYS) == my_rank ) bucket_size[NUM_BUCKETS+i] = key_array[test_index_array[i] % NUM_KEYS];

/* Determine the number of keys in each bucket */ for( i=0; i<NUM_KEYS; i++ ) bucket_size[key_array[i] >> shift]++;

/* Accumulative bucket sizes are the bucket pointers */ bucket_ptrs[0] = 0; for( i=1; i< NUM_BUCKETS; i++ ) bucket_ptrs[i] = bucket_ptrs[i-1] + bucket_size[i-1];

/* Sort into appropriate bucket */ for( i=0; i<NUM_KEYS; i++ ) { key = key_array[i]; key_buff1[bucket_ptrs[key >> shift]++] = key; }

TIMER_STOP( T_RANK ); TIMER_START( T_RCOMM );

/* Get the bucket size totals for the entire problem. These will be used to determine the redistribution of keys */ MPI_Allreduce( bucket_size, bucket_size_totals, NUM_BUCKETS+TEST_ARRAY_SIZE, MP_KEY_TYPE, MPI_SUM, MPI_COMM_WORLD );

TIMER_STOP( T_RCOMM ); TIMER_START( T_RANK );

/* Determine Redistibution of keys: accumulate the bucket size totals till this number surpasses NUM_KEYS (which the average number of keys per processor). Then all keys in these buckets go to processor 0. Continue accumulating again until supassing 2*NUM_KEYS. All keys in these buckets go to processor 1, etc. This algorithm guarantees that all processors have work ranking; no processors are left idle. The optimum number of buckets, however, does not result in as high a degree of load balancing (as even a distribution of keys as is possible) as is obtained from increasing the number of buckets, but more buckets results in more computation per processor so that the optimum number of buckets turns out to be 1024 for machines tested. Note that process_bucket_distrib_ptr1

void rank(int iteration){ INT_TYPE i; INT_TYPE k; INT_TYPE shift = (23 - 10); INT_TYPE key; INT_TYPE2 bucket_sum_accumulator; INT_TYPE2 j; INT_TYPE2 m; INT_TYPE local_bucket_sum_accumulator; INT_TYPE min_key_val; INT_TYPE max_key_val; INT_TYPE *key_buff_ptr;/* Get the bucket size totals for the entire problem. These will be used to determine the redistribution of keys */ MPI_Allreduce(bucket_size,bucket_size_totals,((1 << 10) + 5),MPI_INT,MPI_SUM,MPI_COMM_WORLD);/* This is the redistribution section: first find out how many keys each processor will send to every other processor: */ MPI_Alltoall(send_count,1,MPI_INT,recv_count,1,MPI_INT,MPI_COMM_WORLD);/* Now send the keys to respective processors */ MPI_Alltoall(key_buff1,send_count,send_displ,MPI_INT,key_buff2,recv_count,recv_displ,MPI_INT,MPI_COMM_WORLD);}

INT_TYPE shift = MAX_KEY_LOG_2 - NUM_BUCKETS, it is highly possible that the last few processors don't get any buckets. So, we need to set counts properly in this case to avoid any fallouts. */ while( j < comm_size ) { send_count[j] = 0; process_bucket_distrib_ptr1[j] = 1; j++; }

TIMER_STOP( T_RANK ); TIMER_START( T_RCOMM );

/* This is the redistribution section: first find out how many keys each processor will send to every other processor: */ MPI_Alltoall( send_count, 1, MPI_INT, recv_count, 1, MPI_INT, MPI_COMM_WORLD );

/* Determine the receive array displacements for the buckets */ recv_displ[0] = 0; for( i=1; i<comm_size; i++ ) recv_displ[i] = recv_displ[i-1] + recv_count[i-1];

/* Now send the keys to respective processors */ MPI_Alltoallv( key_buff1, send_count, send_displ, MP_KEY_TYPE, key_buff2, recv_count, recv_displ, MP_KEY_TYPE, MPI_COMM_WORLD );

TIMER_STOP( T_RCOMM ); TIMER_START( T_RANK );

/* The starting and ending bucket numbers on each processor are multiplied by the interval size of the buckets to obtain the smallest possible min and greatest possible max value of any key on each processor */ min_key_val = process_bucket_distrib_ptr1[my_rank] << shift; max_key_val = ((process_bucket_distrib_ptr2[my_rank] + 1) << shift)-1;

/* Clear the work array */ for( i=0; i<max_key_val-min_key_val+1; i++ ) key_buff1[i] = 0;

/* Determine the total number of keys on all other processors holding keys of lesser value */ m = 0; for( k=0; k<my_rank; k++ ) for( i= process_bucket_distrib_ptr1[k]; i<=process_bucket_distrib_ptr2[k]; i++ ) m += bucket_size_totals[i]; /* m has total # of lesser keys */

/* Determine total number of keys on this processor */ j = 0; for( i= process_bucket_distrib_ptr1[my_rank]; i<=process_bucket_distrib_ptr2[my_rank]; i++ ) j += bucket_size_totals[i]; /* j has total # of local keys */

/* Ranking of all keys occurs in this section: *//* shift it backwards so no subtractions are necessary in loop */ key_buff_ptr = key_buff1 - min_key_val;

/* In this section, the keys themselves are used as their own indexes to determine how many of each there are: their individual population */ for( i=0; i<j; i++ ) key_buff_ptr[key_buff2[i]]++; /* Now they have individual key */ /* population */

/* To obtain ranks of each key, successively add the individual key population, not forgetting the total of lesser keys, m.

INT_TYPE i, k;

INT_TYPE shift = MAX_KEY_LOG_2 - NUM_BUCKETS_LOG_2; INT_TYPE key; INT_TYPE2 bucket_sum_accumulator, j, m;ailed = 0;

switch( CLASS ) { case 'S': if( i <= 2 ) { if( key_rank != test_rank_array[i]+iteration ) failed = 1; else passed_verification++; } else { if( key_rank != test_rank_array[i]-iteration ) failed = 1; else passed_verification++; } break; case 'W': if( i < 2 ) { if( key_rank != test_rank_array[i]+(iteration-2) ) failed = 1; else passed_verification++; } else { if( key_rank != test_rank_array[i]-iteration ) failed = 1; else passed_verification++; } break; case 'A': if( i <= 2 )

{ if( key_rank != test_rank_array[i]+(iteration-1) ) failed = 1; else passed_verification++;

} else { if( key_rank != test_rank_array[i]-(iteration-1) ) failed = 1; else passed_verification++; } break; case 'B': if( i == 1 || i == 2 || i == 4 )

{ if( key_rank != test_rank_array[i]+iteration ) failed = 1; else passed_verification++;

} else { if( key_rank != test_rank_array[i]-iteration ) failed = 1; else passed_verification++; } break; case 'C': if( i <= 2 )


} else { if( key_rank != test_rank_array[i]-iteration ) failed = 1; else passed_verification++; }

void rank( int iteration ){

INT_TYPE i, k;

INT_TYPE shift = 'D': if( i < 2 )


} else { if( key_rank != test_rank_array[i]-iteration ) failed = 1; else passed_verification++; } break; } if( failed == 1 ) printf( "Failed partial verification: " "iteration %d, processor %d, test key %d\n", iteration, my_rank, (int)i ); } }

TIMER_STOP( T_RANK );

/* Make copies of rank info for use by full_verify: these variables in rank are local; making them global slows down the code, probably since they cannot be made register by compiler */

if( iteration == MAX_ITERATIONS ) { key_buff_ptr_global = key_buff_ptr; total_local_keys = j; total_lesser_keys = 0; /* no longer set to 'm', see note above */ }

}

10

Static Analysis Drives Skeleton Generation First prototype:

• Generate skeleton representing message passing via static analysis (using the use-def analysis in ROSE)

Basic concept, where MPI is the target aspect:• Identify message passing (MPI) operations.• Preserve MPI operations and code that they depend on, removing superfluous code.• Aim to remove large blocks of computational code, replacing it with surrogate code

that is simpler to produce skeleton of app that contains essential message passing structure without the actual work.

Our research approach has been to explore four different forms of analysis to drive the skeleton generation:

1) Use-def analysis (to generate a form of program slice), works on the AST directly, not directly using the inter-procedural control flow graph (CFG)

2) Program slicing using ROSE’s System Dependence graph (SDG) which captures the def-use analysis and more on the inter-procedural control flow graph in ROSE

3) A new Data-Flow Framework in ROSE; another form of analysis using the interprocedural control flow graph in ROSE

4) Connections to Formal methods

11

Static Analysis: Program Slicingint returnMe (int me) { return me; }

int main (int argc, char ** argv) { int a = 1; int b; returnMe(a); b = returnMe(a); #pragma SliceTarget return b; }

System (Inter-procedural) Dependence Analysis

A sequence of directed edges define a slice Can be used for Model extraction

12

Data Flow as an alternative approach to Drive Skeleton Generation

Future work will explore the use of a new Data Flow Framework in ROSE to support analysis required to generate skeletons• May be an easier way (for users) to specify aspects• It is related to slicing in that it uses the same inter-

procedural control flow graph internally

Each form of analysis (Use-def, SDG, and Data-Flow) are an orthogonal direction of work which share the common infrastructure we have built for skeleton generation.

The analysis and infrastructure in implemented using ROSE

13

A Generic API for Skeletonization

Generalized skeletonization target APIs• Original work focused on skeletonizing relative to the MPI API.• Current code extended to allow skeletons against any API (e.g.,

Visualization and Data Analysis, I/O and Storage, use of domain-specific abstractions, etc.)

• Important for building skeletons to probe different aspects of program behavior – IO, message passing, threading, app-specific libraries

14

Annotation guided skeletonization

Annotation guided skeletonization• Previous work focused on purely dependency-based

slicing. This led to problems: Removal of computational code could cause loops to cease to

converge (iterate forever). Branching patterns no longer meaningful with computational

code gone.• Annotations let the user guide skeletonization to add

semantics the skeleton that is impossible/difficult to statically infer. Loop iteration counts ; branching probabilities ; variable

initialization values.

15

Use of an Annotation Before/After

int main() { int x = 0; int i; // execute exactly 10 times #pragma skel loopIterate 10

for (i = 0; x < 100 ; i++) { if (x % 2) x += 5;

} return x;}

int main() { int x = 0; int i; // execute exactly 10 times #pragma skel loopIterate 10 int k = 0; for (i = 0; k < 10; k++) {{ if ((x % 2) != 0) x += 5; } rose_label__1: i++; } return x;}

Before After

16

User Work Flow for Skeletonization

Science & Technology: Computation Directorate

OriginalApplication

Program

DynamicMeasurements

Of Program

AnnotatedApplication

Program

SkeletonProgram

SkeletonExtraction

Tool

ObserveBehavior

Of Skeleton

Satisfactory BehaviorKeep Skeleton

Unsatisfactory behavior:modify or add annotations to tune skeleton generator

- Branch probabilities - Average loop iteration counts - Legitimate data values

17

Future work

SDG version of analysis for skeletonization Using the new Data Flow framework in ROSE for skeletonization Galois will be working on adding formal-methods-based analysis to

the skeleton generator to analyze regions of code to remove.• Floating point range analysis.• Symbolic execution.

Formal methods will aim to answer questions to aid skeleton generation such as:• What range of values do we expect a complex computation to produce?

Allows us to automatically select surrogate values for populating data structures Know when specific values are critical

• Under specific input conditions, what code is reachable or not reachable?

Allows us to build skeletons for specific input circumstances, instead of generic skeletons

This is a connection to path feasibility analysis currently being developed in ROSE

18

Front-End

Back-End

AST Builder API

High Level IRs (AST)

IR Extension API(ROSETTA)

High Level Analysis

& OptimizationFramework

ExascaleArchitecture

Mid-End

Low Level Analysis & Optimization

Low Level IR(LLVM)Unparser

Existing LLVM Analysis & Optimization

Exascale Vendor Compiler

Infrastructures

LLVM Backend Code Generation

Exascale Vendor Compilers

General Purpose Languages used within DOE

Python

C & C++ Fortran (F77-F2003)

UPC 1.1OpenMP 3.0

CUDA

ROSE Compiler Design

automated extraction of skeleton apps from apps february 2012

Documents

mpi usage

rank size

skeleton generation

skeleton program

onea skeleton

source compiler

rose source

realistic skeletons