c012-2-sub1.doc.doc

十二、研究計畫內容：（二）研究計畫之背景及目的。請詳述本研究計畫之背景、目的、重要性及國內外有關本計畫之

研究情況、重要參考文獻之評述等。本計畫如為整合型研究計畫之子計畫，請就以上各點分別述明與其他子計畫之相關性。

（三）研究方法、進行步驟及執行進度。請分年列述：1.本計畫採用之研究方法與原因。2.預計可能遭遇之困難及解決途徑。3.重要儀器之配合使用情形。4.如為整合型研究計畫，請就以上各點分別說明與其他子計畫之相關性。5.如為須赴國外或大陸地區研究，請詳述其必要性以及預期成果等。

（四）預期完成之工作項目及成果。請分年列述：1.預期完成之工作項目。2.對於學術研究、國家發展及其他應用方面預期之貢獻。3.對於參與之工作人員，預期可獲之訓練。4.本計畫如為整合型研究計畫之子計畫，請就以上各點分別說明與其他子計畫之相關性。

A. Background

In the past, after the binary code is generated by a static compiler and is linked and loaded

to the memory, it is pretty much left untouched and unchanged during its program execution

and in the rest of its software life cycle unless the code is re-compiled and changed. However,

on more recent systems, code manipulation has been extended beyond static compilation time.

The most notable systems are virtual machines such as Java virtual machine (JVM), C# and

some recent scripting languages. Intermediate code format such as bytecode in JVM is

generated first and then interpreted at runtime, the intermediate code could also be compiled

dynamically after hot code traces are detected during its execution, and further optimized in its

binary form for continuous improvement and transformation during its program execution.

As a matter of fact, binary code manipulation is not just limited to virtual machines.

Virtualization at all system levels has become increasingly important at the advent of multi-

cores and the dawn of utility computing, also known as clouding computing. When there are

many compute resources available on the platform, utilizing such resources efficiently and

effectively often requires code compiled in one instruction set architecture (ISA) to be moved

around in the internet and run on platforms with different ISAs. Binary translation and

optimization becomes very important to support such system virtualization.

There are also other important applications that require manipulation of binary code. The

most notable application includes binary instrumentation in which additional binary code

segments are inserted to some specific points of the original binary code. These added binary

code segments could be used to monitor the program execution by collecting runtime

information during program execution that could be fed back to profile-based static compilers

to further improve their compiled code, to detect the potential breach of security protocols, to

trace program execution for testing and debugging, or to reverse engineer the binary code and

retrieve its algorithms or other vital program information.

The main objective of this research is to support high-performance binary code

manipulation, in particular, to support system virtualization in which binary translation and

binary optimization are crucial to their performance.

Existing binary manipulation systems such as DynamoRio, QEMU, Simics, all assume that

the binary code they use as input comes straight from the exiting optimizing compilers. Such

binary code is often assembled after linking with runtime libraries and other relevant files that

are needed for its execution. Hence, it has the entire execution code image. It allows such

binary manipulation systems to have a global view of the entire code and could do a global

analysis which each individual code piece could not do during its individual compilation phase.

However, such global analyses are notoriously time consuming. Furthermore, a lot of vital

program information such as control and data flow graph, data types and liveness of their

variables, as well as alias information is often not available, and is very difficult to obtain from

the original binary code. Such program information could be extremely valuable for a runtime

binary manipulation system to carry out more advanced types of code translation and

optimization.

Binary translation and optimization have been used very successfully in many applications.

In functional simulators such as QEMU and Simics, an entire guest operating system using a

different ISA could be brought up on a host platform with a completely different ISA and

operating system. Guest applications running on the guest operating system will be totally

unaware of the host platform and the host operating system it is run on. Such a virtualization

system has many important applications. One of the important applications is to allow system

and application software to be developed in parallel with a new hardware system still under

development. The new hardware system still under development could be virtualized and run

on a host system with a completely different ISA and OS. It could save a significant amount of

software development time and shorten the time to market substantially. Other virtualization

systems being developed on multicore platforms allow different OS’s to be run simultaneously

on the same multicore platforms. It provides excellent isolation among these OS’s for high

security and reliability. A crashed OS will not affect the other OS’s concurrently run on the

same platforms.

B. Previous related work

We have many years of experience in dynamic binary optimization systems. ADORE [ ]

developed on single-core platforms was started in early 2000’s on Intel’s Itanium processor, and

later ported to microprocessors developed by Sun Microsystems. ADORE (see Figure 1) has

several major components. It relies on hardware performance monitoring system to help in

identifying program phases, program control flow structures, and performance bottlenecks in the

programs.

As such runtime binary manipulation system will take away compute resources from the

application that is currently running, the overhead of such manipulation system will be counted as

part of the application execution time it tries to optimize. Hence, the resulting program

improvement has to be substantial enough to offset such an overhead, or the overhead has to be

minimized sufficiently enough not to interfere with the original program execution. Even though

such binary manipulation could be done on a different core, thus not interfering with the program

execution, it still takes away major core resources from other useful work.

To minimize runtime overhead, ADORE uses hardware performance monitoring system to

sample machine states at a fixed interval. The interval size determines the overhead, the program

phases and the amount of runtime information it could collect. As a stable program phase is

detected, hot traces of the program execution will be generated and optimizations will be applied

to the hot traces. Optimized hot traces are then placed in a code cache, the original program will

be patched to cause control transfer to the optimized code stored in the code cache. If most of the

execution time, the program is executing from the optimized code in the code cache, good

performance could be achieved.

ADORE, COBRA and current NSF grants

C. Proposed approach

To provide valuable program information obtained during sophisticated static compiler analyses

for more effective and efficient runtime binary manipulation, we plan to have the static compiler

annotate the generated binary code with such information. The types and the extent of program

information useful for runtime binary manipulation will be one of the main subjects of this

research.

For example, from the experience of ADORE, we found that it is extremely difficult to find free

registers that could be safely used by the runtime binary optimizer. As the runtime binary

optimizer will share the same register file as the code it tries to optimize, it needs to spill some

registers to the memory for its own use. However, there is no application interface (API)

convention defined for such interaction between the static compiler and the runtime optimizer.

Hence, it is even not easy to find available registers to execute the code that will spill the registers

because, to spill registers, it will need at least one register to keep the address of the memory

location it plans to spill into.

Another example is that it is quite difficult to determine the boundaries of a loop, especially in

loops that have complex control flow structures. This could cause extreme difficulty to insert

memory prefetch instructions for identified long latency delinquent load operations. Because such

prefetch instructions are often inserted in a location that needs to be executed several loop

iterations before the intended delinquent load instruction in order to offset the long miss penalty. It

is because loops are often implemented in jump instructions which could be difficult to

differentiate from the other jump instructions existed in the same loop or the nesting loops. Hence,

annotating loop boundaries and/or the control flow graph of each procedure could save a lot

analysis time and give runtime binary optimizer tremendous advantages of possible optimization

opportunities.

Several potentially useful types of program annotations were identified in [Das]. They include:

(1) control flow annotations; (2) register usage; (3) data flow annotations; (4) annotations that will

be useful for exception handlers; (5) annotations describing load-prefetch relationship.

Our research in annotations will be first driven by the need of binary translation and binary

optimization, in particular, for multi-core systems. As most of the studies in this area focused

primarily on single-core systems so far, adding annotations for multi-core applications will add

additional complexity and challenge. In particular, the information that could be used at runtime to

balance the workload and to help in mitigating synchronization overhead will be of particular

interests to this research. (Expand on this)

C.1. Annotate binary code by expanding ELF and adding require information

The annotations will primarily be incorporated in the binary code. We plan to use ELF binary

format in our prototype as ELF is a standard binary format use on most Linux systems. The

challenges will primarily be in several important areas:

(1) Size of annotations. Annotating all of the information mentioned in [ ] could take up a lot of

memory space and expand the binary code size to the extent that becomes unmanageable. A lot of

annotated information may not be useful for a particular program. A carefully designed annotation

format and encoding scheme could drastically reduce the annotation size and keep ELF as

compact as possible

(2) API for annotations. Annotating the types of runtime information that is useful to a

particular program on a particular platform could further reduce its size. In many cases, the

programmer has the needed inside knowledge about such information, for example, whether the

target platform is single-core or multi-core, the program is integer-intensive or floating-point

intensive. Such knowledge could affect the type of runtime optimizations that could be useful to

the code. Hence, a carefully design API that allows programmers to direct the types of annotations

to be generated, e.g. CFG, DFG, register liveness information, or likelihood of exceptions, for

some particular code regions could be particularly useful to both the generation of annotations and

the potential sets of runtime optimizations to be applied.

(3) Dynamic update for annotations. The types of useful annotation might change after each

phase of binary manipulation. For example, during binary translation, we might want to add

additional information that could be useful to the later runtime optimization. For example, there

might be changes to the original CFG or DFG due to binary translation, some useful runtime

information may point to the uselessness of some original annotations and trim them to reduce

overall code size.

(4) Code security. As runtime binary manipulation could potentially alter the original binary

code, avoid accidentally overriding some code regions during annotation updates, or binary code

translation and optimization, should be carefully considered. For example, using offset to some

base address for all memory references instead of absolute memory address could avoid accidental

stepping into forbidden code regions and improve its security.

C.2. API for specifying the program information to be included in the binary code

To allow flexibility and “annotate as your go”, we need to provide good API for user/compiler

to specify the types of annotation needed. Provide good API to allow binary manipulator (i.e.

translator and optimizer) to access annotation without knowing the format and arrangement of

annotations. Good for future upgrades and changes of annotations with changing the users of the

annotation as the API is standardized

C.3. Annotations useful in the sub-project 2 for binary translation

The main purpose of binary translation is to translate a code in the guest ISA to a binary code in

the host ISA. In sub-project 2, we propose to build a binary translator based on QEMU-like

functional simulator. In such a translator, each guest binary instruction is first interpreted and

translated into host binary instruction. In QEMU, each guest instruction is translated into a string

of micro-operations defined in QEMU. These micro-operations are then converted to the host ISA.

The advantage of such an approach is that the machine state could be accurately preserved and

converted from one ISA to another ISA. Hence, even privileged instructions could be interpreted

and translated this way. It allows an OS to be booted and run on QEMU accurately. However, the

overhead for such an approach is quite high as each guest instruction will require several micro-

operations to carry out, and each micro-operation may in term require several host machine

instruction to execute it.

As shown in sub-project 2, we plan to interpret and translate guest instructions on a per basic

block basis. These basic blocks will be translated into an intermediate representation (IR) format,

such as that used in LLVM. They will be optimized at the IR level using the annotated information

provided in the guest binary code. Such optimization will usually be somewhat machine

independent and fast. LLVM code generator could then be used to generate machine dependent

host binary code. When hot traces are identified and further machine dependent optimizations will

be applied on the identified hot traces, they will be kept in a hot-trace code cache for fast program

execution and further optimization. Information collected by runtime hardware performance

monitoring system could be used in such further optimizations.

There are many implications on how the annotated information in the guest binary code is used

and it in term determines what kinds of annotated information will be useful. Some past

experiences show that instead of interpreting each guest binary instruction first and then start

translating them after hot traces are identified, it is actually more efficient to translate each guest

binary instruction as it is being executed. Code optimization could be applied after hot traces are

identified. Optimized hot traces are then placed in the code cache for future execution. If the

coverage of the traces in the code cache is high, most of the execution time will come from the

optimized hot traces in the code cache, and the overall performance could be significantly

improved.

In the approach proposed in sub-project 2, we will first translate the guest binary into a

machine independent IR format on a per basic block basis, and carry out machine independent

code optimization at the IR level followed by code generation to the targeted host machine with

another ISA. The annotation in the guest binary code could then be used directly in machine

independent code optimization within each translated basic block before host machine code is

generated. The annotated information could help in forming hot traces and in global optimization

across basic blocks within an identified hot trace. Some of the annotation such as those identified

[Das] on branch target information for indirect branches could be useful in this process also.

C.4. Annotations useful in the sub-project 3 for further binary optimization

Many types of annotation that could be useful in binary optimizations for single-core platforms

have been identified in [Das]. We plan to identify more annotated information useful on multi-

core platforms.

Here, we have two possible scenarios. One is similar to a traditional binary optimizer, such as

ADORE, in which guest binary code is using the same ISA as that of the host platforms. The other

is for the binary optimizer to optimize the translated binary code from a binary translator. In that

case, annotated information may have to be converted to match the translated host binary code as

well. In sub-project 3, our main focus will be on optimizing the translated host binary code. As

mentioned earlier, there are two levels of optimization that could be performed. One is machine

independent optimization at the machine independent IR level. The other is at the machine

dependent host binary code level after runtime information is collected during program execution.

Even for some machine dependent optimizations such as register allocation, they might be

parameterized and done during the machine independent optimization phase at the IR level. For

example, if we know the number of registers on the host platform and their ABI convention such

as register assignment in a procedure call, the types of annotation in the guest binary code such as

alias and data dependence information could be used to optimize register allocation.

Other useful annotations include:

(1) Control flow graph. It could be used in determining the loop structures for hot traces to

increase optimized code coverage. It has also been proposed that edge profiling information could

be used to annotate each branch instruction and indicate whether it is most likely to be taken or

not-taken. Hot traces could be formed this way during binary translation phase. Indirect branch

instructions in which branch targets are not known until runtime are the most challenging in

forming CFG. However, many indirect branches are from high-level program structures such as

the switch statement in a C program, or the return statement of a procedure. If such information

could be annotated, their target instructions could be identified and more accurate CFG could be

constructed.

(2) Live-in and live-out variables could be annotated to provide information for more optimized

local register allocation. If the optimization is performed after binary translation, live-in and live-

out information might be updated as the translated code might introduce some additional

information that needs to be passed on to the following basic blocks. Here, the register acquisition

problem may not be as serious as in ADORE as register allocation will be performed during

binary translation process. Available free registers could be annotated after the translation phase to

the binary optimizer for use later.

(3) Data flow graph. Def-use and use-def information could be useful in many well-know

compiler optimizations. Other useful information includes alias information and data dependence

information. Such information could be useful in register allocation and some well-known partial

redundancy elimination (PRE) optimizations. It is even suggested that some information obtained

from value profiling could be annotated to help in tracking data flow information. However, the

effectiveness of such optimizations and the amount of overhead required for such optimization at

runtime is interesting research issues.

(4) Exception handlers. A lot of optimizations could alter the order of program execution and,

hence, could alter the original machine state when an exception is thrown. However, binary code

usually does not have the information on where the exception handlers will be used. To produce

accurate machine states when exceptions occur in the original guest binary code, a lot potential

code optimization such as code scheduling need to be conservatively suppressed which could have

a very significant impact on the program performance. By annotating the regions of exception

handlers and only avoid aggressive optimizations in those regions such concerns could be

eliminated.

(5) Prefetch instruction and its corresponding load instruction. Prefetching could be very

effective in reducing miss penalty and cache misses on single-core platforms. However,

prefetching instructions usually come with some additional instructions to compute prefetch

addresses. Also, because of its proven effectiveness, exiting optimizing compilers often generate

very aggressive, and often excessive, prefetcihng instructions. Such excessive prefetching

instructions could consume a lot of bus and memory bandwidth if their corresponding load

instructions are not delinquent load instructions as originally assumed. It have been shown that by

selectively eliminating such aggressive prefetching instructions, performance actually will

improve in many programs on single-core platforms [ ]. In a multi-core environment when

multiple programs could be in execution concurrently. Too aggressive prefetching from one

program could adversely affect the other programs running on the same platform because it takes

away valuable bus and memory bandwidth needed by other programs. Identifying a particular load

instruction with its corresponding prefetch instruction and the supporting address calculation

instructions could help to eliminate those instructions if the load instruction is no longer a

delinquent load in this particular run.

(6) Workload and synchronization information for parallel applications. For parallel

applications running on multi-core platforms, the types of annotation that could help in improving

programming performance, or help in tracking program execution for testing and debugging, will

be identified and studied. Some well-known information that could help in improving the parallel

programs includes workload estimation of each thread and the synchronization information that

identifies critical sections or signal and wait instructions. Such information could help in workload

balancing at runtime and reducing synchronization wait time and overhead. Other useful

information such as the one identified in (5) that could improve the bus and memory bandwidth as

well the load latency on a multi-core platform and the tracking of cache coherence traffic to

remove potential false data sharing will be of great interests to our study.

C.5. Evaluation

Evaluation of the effectiveness of adding annotations to the binary code will be another major

effort in this project. Open64 compiler will be used as our main platform in generating

annotations. Open64 has been a very robust production-level and high-quality open-source

compiler. It has all of the major components of a production compiler and is also supporting

profile-based approach. It has good documentations and a large and active user community. It is

currently supported by major companies such as HP and AMD, and major compiler groups such as

University of Minnesota, University of Delaware, University of Houston and several others. It

supports almost all major platforms such as Intel IA-64, IA32, Itanium, MIPS and CUDA. It is a

very general-purpose compiler that supports C, C++, FORTRAN, and JAVA.

We plan to use it to generate all of the useful binary annotations and study its effectiveness in

improving program performance, its code size expansion, and the API support needed.

D. Other related work

E. Methodology and our Work Plan (for each year)

E.1 The annotation framework

Figure 1 is this sub-project’s framework and it can be divided into two parts, annotation

producer and consumer. Producer's functionality focuses on producing annotation and annotates

these into binary file. The main component of the producer is the compiler. Consumer's

functionality reads annotation from binary file and try to leverage it efficiently, and the main

components of the consumer are the binary translator and the binary optimizer.

The source of annotation data come from two parts, the one is from the static compiler analysis,

and the other one is from the program’s profiles analysis. We will adopt Open64 as the compiler,

it is an open source, optimizing compiler for the Itanium and x86-64 microprocessor architectures.

The compiler derives from the SGI compilers for the MIPS R10000 processor and was released

under the GNU GPL in 2000. Open64 supports Fortran 77/95 and C/C++, as well as the shared

memory programming model OpenMP. The major components of Open64 are the frontend for

C/C++ (using GCC) and Fortran 77/90 (using the CraySoft front-end and libraries), inter-

procedural analysis (IPA), loop nest optimizer (LNO), global optimizer (WOPT), and code

generator (CG). It can conduct high-quality inter-procedural analysis, data-flow analysis, data

dependence analysis, and array region analysis.

Figure 1: The annotation framework.

E.2 The Producer

Figure 2: The producer’s components and data flowchart.

Figure 2 shows the flow of how the producer annotates the information to the ELF executable

file. There are two ways to produce the annotated ELF executable file according to the annotation

source, the static compiler analysis and the profile analysis.

At the static compiler analysis way, the program source code will be compiled and produced

assemble code and annotation data by the modified Open64 compiler. However, at the compilation

phase, without the virtual address produced by the compiler, only function labels can be identified.

Therefore, if the annotation data contain information in the function, the function labels will be the

basic point and the annotation data’s position will be record as the offset of function labels. After the

assembly code are assembled and produced the ELF executable file by the assembler, the virtual

address of the function and instruction are assigned. The annotation data and the corresponding

virtual address will be combined and produced the annotated ELF executable file by the annotation

combiner.

At the profile analysis way, profile data will be analyzed by the profile analyzer, and then

producing the useful annotation data. The produced annotation data will be combined by the

annotation combiner as well as the static compiler analysis way’s annotation data and then producing

the annotated ELF executable file.

Annotation information can be stored as a new section in the ELF file (for example“.annotate”).

This section can be loaded in the memory after the text segment by setting the SHF_ALLOC flag in

the section header for .annotate section and adding this section to the program header table with the

PT_LOAD flag set. In order for the consumer (the sub-project 2 or the sub-project 3) to find the

location of the .annotate section, the section-name string table should be loaded in memory too.

Modifications can be made to the ELF file to enable this. The consumer can now read the in-memory

representation of the ELF headers and load the contents of the .annotate section.

E.3 The annotation granularity

There are four annotation data level, such as the program level, the procedural unit level, the

basic block level and instruction level, according to the range of the described information. The

instruction level needs the biggest storage space; it is because the annotation data has to record each

instruction’s information. Each instruction maps to a virtual address at runtime, therefore, the

annotation data will contain the instruction information and its corresponding virtual address. On the

other side, the program level needs the smallest storage space; it is because the described

information represents for the whole program and without to keep the corresponding virtual address.

The data in program level describes whole program’s information. Such as the hot trace annotation,

it is to annotate a set of basic blocks to form a frequently executed path called hot trace and then

the hot trace will be treated as an optimization unit at runtime. The phase change annotation is to

annotate which program regions are hit frequently, when the program’s execution path hit these

regions over a threshold at runtime in a period, the phase change can be identified. The inter-

procedure loop annotation is to annotate loop’s relation in the procedure; it can help for building hot

traces efficiently.

The data in procedure unit level describes procedure’s information in the program, like intra-

procedure loop annotation. It is to annotate the information of loop in procedure. The loop is usually

executed for many times at runtime; therefore, it is often treated as an optimization unit.

The data in basic block level describes basic block’s information in the program. Register usage

is an annotation which labels each basic block register usage and this information can be used to

identify free registers at runtime.

The data in instruction level describes instruction’s information in the program, such as a

memory reference annotation. It is to annotate each instruction’s memory reference. If the memory

reference information is correctly known, instruction rescheduling can be performed further by the

dynamic binary optimizer.

E.4 The Consumer

Figure 3: Annotation framework’s position in the virtualization system.

Figure 3 shows this sub-project’s position in the whole virtualization system (red parts). The

goal of this sub-project is to assist the sub-project 2 to perform dynamic binary translation efficiently

and to help the sub-project 3 to perform advanced optimizations. Therefore, the sub-project 2 and

sub-project 3 are the consumers for this sub-project. The annotation data will be annotated to the

EFL executable file firstly and then reading by the sub-project 2. When the sub-project 2 performs

dynamic binary translation, it will influence the information range described by the annotation data.

It is because the memory layout of the guest machine will be changed to the memory layout of the

host machine. Therefore, the annotation data will be adjusted when the sub-project 2 performs

dynamic binary translation so that the dynamic binary optimizer can read the correct annotation data.

F. Work Plan

There are three points in the first year. The first point is to exploit the useful annotations from

the static compiler analysis to improve the dynamic binary translator efficiency. The second one is to

exploit the beneficial annotations from the static compiler analysis to help the dynamic binary

optimizer performing advanced optimization. In order to achieve these two points, we have to know

benchmark’s runtime behaviors which are manipulated by the dynamic binary translator and

dynamic binary optimizer explicitly. The last one is to modify the compiler to produce the

annotation data which are exploited by the previous two points.

There are three points in the second year. The first point is to design a guest binary annotation

encoding format for the dynamic binary translator. The second point is to implement the annotations

which are exploiting from the first to the dynamic binary translator. The last point is to exploit useful

annotations from the profile data which is produced by the dynamic binary translator and then

feedback to the dynamic binary translator.

There are three points in the second year. The first point is to translate the guest binary

annotation encoding format to the host annotation format for the dynamic binary translator. The

second point is to implement the annotations which are exploiting from the first to the dynamic

binary optimizer. The last point is to exploit useful annotations from the profile data which is

produced by the dynamic binary optimizer and then feedback to the dynamic binary optimizer.

F. Milestones and Deliverables

First Year

1. Exploiting various annotations to help the binary translation and binary optimization from the

static compiler analysis and to estimate the performance gain and overhead using these

annotations.

2. Modifying the open64 compiler to produce corresponding annotations and to estimate the size

of annotations.

Second Year

1. Designing the annotation format for the binary translator to consume annotation and

annotating the annotation to the ELF file.

2. Equipping the annotation schemes which are exploited from the first year to the binary

translator and evaluating the performance gain and overhead when the dynamic translator

utilizes annotations.

3. Exploiting various annotations to help the binary translation form the profiles produced by the

binary translator and implementing these annotation scheme to the binary translator.

Third year

1. Transforming the annotation format from the binary translator to the binary optimizer.

2. Equipping the annotation schemes which are exploited from the first year to the binary

optimizer and evaluating the performance gain and overhead when the dynamic translator

utilizes annotations.

3. Exploiting various annotations to help the binary optimization form the profiles produced by

the binary optimizer and implementing these annotation scheme to the binary optimizer.

4. Evaluating the performance gain and overhead for the virtualization system when using the

annotation framework.

表C012 共頁第頁