mainframe assembler translation

16
Relogix: Converting assembler to high quality C MicroAPL Ltd Simon Marsden, MicroAPL Ltd 22nd September 2010

Upload: alaeddinedridi

Post on 25-Apr-2017

240 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: Mainframe Assembler Translation

Relogix: Converting assembler

to high quality C MicroAPL Ltd

Simon Marsden, MicroAPL Ltd

22nd September 2010

Page 2: Mainframe Assembler Translation

Introduction

MicroAPL’s Relogix translator is designed to convert programs written in IBM mainframe

assembly language to C.

Although there have been other attempts at doing this, some of them are fairly simplistic.

What distinguishes Relogix is that it produces high quality C code. Our aim is to produce C

code of a standard that’s close to what a human programmer would write - readable, easy

to understand and easy to maintain.

To achieve this it’s necessary to think carefully about how the original program should be

represented in C. If you model the behaviour of the processor too closely you will get a

translation that’s slow and unreadable. On the other hand, the author of the assembler

code may have used some quite subtle coding tricks, so a very detailed analysis of the

program is necessary.

This document explores some of the techniques used by Relogix to achieve high-quality

code. Along the way we’ll look at how not to convert assembler code to C, and how choosing

the right level of abstraction brings benefits in both readability and performance.

Representation of Registers and Conditions

The IBM370 mainframe architecture has 16 user-programmable General Purpose Registers,

R0-R15. The instruction set is very rich, with both register- and memory-based operands,

and most instructions set the condition codes.

For a simple example, consider the following register-to-register add instruction:

AR R1,R2

This instruction adds register R2 to R1, but also updates the condition codes register as

follows:

Resulting Condition Code:

0 Result zero; no overflow

1 Result less than zero; no overflow

2 Result greater than zero; no overflow

3 Overflow

How not to do it!

First let’s look at how not to translate this instruction to C. Some simplistic translators model

the general purpose registers in a C global variable similar to the following:

Page 3: Mainframe Assembler Translation

struct {

long r0;

long r1;

long r2;

...

long r15

} registers;

The translation of the AR instruction is then something like:

long temp;

temp = registers.r1;

registers.r1 += registers.r2;

if ( (temp > 0 && registers.r2 > 0 && registers.r1 <= 0) ||

(temp < 0 && registers.r2 < 0 && registers.r1 >= 0))

condition = 3;

else if (registers.r1 > 0)

condition = 2;

else if (registers.r1 < 0)

condition = 1;

else

condition = 0;

Even if you use a C pre-processor macro like AR(r1,r2) to hide the implementation details,

this code is ugly and inefficient, and does not lead to a translation that’s easy to understand

or maintain.

Representation of registers in Relogix-translated code

By contrast, Relogix maps the general purpose registers onto normal C variables. These are

not global, but rather are ordinary local variables within a subroutine (or subroutine

arguments if the register is passed into a routine as a parameter).

In addition, even within a subroutine a particular register like R1 is not always modelled by

the same local variable. Instead Relogix performs variable ’lifetime’ analysis: it looks at how

R1 is used within the subroutine. In the following example there are two different lives, the

second load instruction completely overwriting the results of the first with new data:

L R1,MYPTR

STC R3,0(R1)

L R1,OFFSET << This use of R1 is unconnected to the last AR R1,R2

One important consequence is that Relogix can allocate proper types to the local variables

representing registers, as shown in the following example (ignore for a moment the clumsy

Page 4: Mainframe Assembler Translation

names given to the local variables; they’re shown this way to make the example clear, but

Relogix normally chooses more meaningful names):

char *r1_1; << First version of R1 has type: char *

char r3;

long r1_2; << Second version of R1 has type: long long r2;

...

r1_1 = myptr;

*r1_1 = r3;

r1_2 = offset;

r1_2 += r2;

Variable types are chosen by Relogix after a deep inspection of the code, examining all the

ways in which a variable is used. Does it seem to be a signed quantity or an unsigned

quantity? A pointer? What sort of pointer? Relogix is able to choose sensible variable types,

even discovering structures and unions.

A second consequence of modelling registers as local variables is that Relogix can perform

intermediate variable elimination, so that the example above becomes:

char r3;

long r1_2; long r2;

...

*myptr = r3;

r1_2 = offset + r2;

Representation of conditions in Relogix-translated code

We saw how the simplistic translation of the AR instruction faithfully reproduced its effect

on the Conditions register. The translation is accurate but inefficient. In effect this approach

is close to writing an IBM370 instruction-set simulator without the overhead of interpreting

the opcodes at runtime.

Relogix takes a different approach. It analyses what comes after the AR instruction. Does the

program actually check the conditions? If so, which ones?

For example, the AR instruction might occur in a sequence like this:

AR R1,R2

A R1,8(R4)

In the example above the programmer doesn’t care about how the AR instruction sets the

condition code, since this is immediately overwritten by the instruction following. In this

case the translation of AR can just be:

Page 5: Mainframe Assembler Translation

r1 += r2;

Or maybe the programmer tests whether the result of the instruction is negative:

AR R1,R2

BM MINUS

LA R3,1

B DONE

MINUS LA R3,-1

DONE DS 0H

In this case, the Relogix translation might be:

r1 += r2;

if (r1 >= 0)

r3 = 1;

else

r3 = -1;

If the final value in R1 is not needed again, the translation might even be:

if (r1 + r2 >= 0)

r3 = 1;

else

r3 = -1;

The key points here are:

The code matches what a human programmer might write, apart from the unusual

choice of variable names. In fact, Relogix can choose better names, a fact which we

will explore in a subsequent section.

The code is efficient: it doesn’t reproduce unwanted side-effects from the original

IBM 370 instruction.

In order to determine the best translation of an instruction it is necessary to look at

it in context, analysing what comes before and what comes after. Relogix makes

heavy use of recursive analysis techniques to perform a detailed investigation of the

code.

Unravelling Spaghetti: ‘Go-to statement considered harmful’

Edsger Dijkstra famously wrote a paper in 1968 called ‘Go-to statement considered harmful’,

in which he argues that the existence of goto is ‘an invitation to make a mess of one’s

program.’

Page 6: Mainframe Assembler Translation

As assembly-language programmers we’re used to dealing with this - unconditional and

conditional branches, subroutine calls and returns are pretty much all we have. However,

no good C programmer would make excessive use of the goto statement.

Relogix is able to analyse the assembler program and recover a higher-level structure. In

particular it can detect the following C flow-control constructs

if...else if...else statements

do...while loops

while loops

for loops

switch statements

subroutine calls and returns

This includes handling more complex ‘spaghetti’, like code which jumps into the middle of a

loop, code which jumps out of a loop to more than one exit point, and code which

manipulates a return address so that a subroutine doesn’t return cleanly to its caller.

As an example of flow recovery, consider the following piece of assembler code which uses a

typical jump-table idiom found in 370 assembler:

B *+4(R3)

B L1

B L2

B L3

L1 LA R1,=C'ANIMAL' TYPE IS ANIMAL

B DONE

L2 LA R1,=C'VEGETABLE' TYPE IS VEGETABLE

B DONE

L3 LA R1,=C'MINERAL' TYPE IS MINERAL

DONE DS 0H

A bad translation of this might be as follows (Relogix never produces this):

if (r3 == 0)

goto L1;

if (r3 == 4)

goto L2;

if (r3 == 8)

goto L3;

_rlx_flow_error_trap (); /* (Does not return) */

Page 7: Mainframe Assembler Translation

L1: r1 = "ANIMAL";

goto DONE;

L2: r1 = " VEGETABLE ";

goto DONE;

L3: r1 = " MINERAL ";

DONE:

As a first step, Relogix will convert this sequence into the following (Notice also that the

comments in the assembler code are carried over into the C code):

if (r3 == 0)

r1 = "ANIMAL"; /* type is animal */

else if (r3 == 4)

r1 = "VEGETABLE"; /* type is vegetable */

else if (r3 == 8)

r1 = "MINERAL"; /* type is mineral */

else

_rlx_flow_error_trap (); /* (Does not return) */

However it there are enough cases to make it worthwhile, Relogix will convert this into a

switch statement :

switch (r3) {

case 0:

r1 = "ANIMAL"; /* type is animal */

break;

case 4:

r1 = "VEGETABLE"; /* type is vegetable */

break;

case 8:

r1 = "MINERAL"; /* type is mineral */

break;

default:

_rlx_flow_error_trap (); /* (Does not return) */

break;

}

Note that Relogix adds a call to a routine named _rlx_flow_error_trap which is called

in the event that an unexpected value is passed in R3. This is just a safety feature which

helps in catching programming errors. The implementation of _rlx_flow_error_trap

typically prints an error message and aborts the program.

Better Variable Names

In the examples so far we have used variable names like r3 to make it clear which registers

the variables represent. To get closer to the goal of producing C code that a human

programmer might have written, Relogix needs to choose better variable names.

Page 8: Mainframe Assembler Translation

Relogix includes a module known as the name manager which takes care of this. The name

manager uses a number of techniques to choose a suitable name for each variable:

The name manager inspects the way that a variable is used. If r1 is used in an

example like this:

*r1 = 10;

...then it is some kind of pointer. A generic name like ptr might be suitable. Similarly

a variable used as a counter in a for loop might be called i

If a pointer variable is initialised to point to a global data item, we can improve on

the name. For example, instead of

ptr = &date;

...we could use the name date_ptr

Similarly if a variable is loaded from a named structure field, we can get a useful

name:

elapsed_time = performance.elapsed_time;

And if it’s a pointer to a structure of type time, Relogix might choose a name like

time_ptr

Another useful source of variable names is the comments. For example if we see the

assembler statement:

XR R1,R1 CLEAR THE TOTAL

...then a good name for the variable which represents r1 might be total

The names of variables can also be specified explicitly by the Relogix user.

To make it easy to relate the translated C code back to the original assembler code, the

original location of each variable is included in a comment when the variable is declared, e.g.

unsigned long total; /* [Originally in R1] */

Data Types

Relogix performs a detailed analysis of the way that variables are used in order to determine

their types. For example in the following code R1 is tested to check whether it’s negative.

This indicates that it’s a signed value, and so by inference PROFIT is signed too:

L R1,PROFIT

LTR R1,R1

BM MADE_A_LOSS

Page 9: Mainframe Assembler Translation

The translation might be something like:

long profit;

if (profit < 0)

...

In the example below, R2 is used in a logical shift right operation. It’s probably an unsigned

value, and R3 is a pointer to an unsigned value:

L R2,0(R3)

SRL R2,8

In this case the translation might be:

unsigned long r2;

unsigned long *r3;

r2 = *r3 >> 8;

Type analysis also allows Relogix to recover C structures and unions from the code, as in this

example:

ADDSALES CSECT

REGIONS EQU 5

USING SALES,R3

XR R1,R1

LA R2,REGIONS

LOOP CLI TYPE,'A'

BNZ SKIP

A R1,VALUE

SKIP LA R3,SIZE(R3)

BCT R2,LOOP

ST R1,TOTAL

BR R14

TOTAL DC A(0)

SALES DSECT

VALUE DS A

TYPE DS CL1

SIZE EQU *-SALES

Page 10: Mainframe Assembler Translation

In this example the DSECT declaration converts to a C structure of the following type:

struct sales {

long value;

char type;

};

The converted assembler code is shown below (notice that it’s close to what a human

programmer might write):

#define REGIONS 5

/* Private file-scope variables */

static long total = 0;

/*

***************************************************************

* addsales *

***************************************************************

*

* Parameters:

*

* struct sales *sales_ptr [Originally in r3; In]

*/

void addsales (struct sales *sales_ptr)

{

long i; /* [Originally in r2] */

long v; /* [Originally in r1] */

v = 0;

for (i = 0; i < REGIONS; i++) {

if (sales_ptr->type == 'A')

v += sales_ptr->value;

sales_ptr++;

}

total = v;

}

All the examples shown above are produced by Relogix automatically, without any human

intervention. However in all cases the type solving system can also be guided by

supplementary information provided by the user.

Self-Modifying Code

The technique of modifying instruction opcodes at runtime is very often used in IBM

mainframe assembler applications. In fact it seems to be far more frequent than for other

processors which we’ve seen.

Page 11: Mainframe Assembler Translation

The following is a typical example:

LTR R1,R1

BZ LABEL

MVI LABEL+1,C'Y'

LABEL CLI 0(R2),C'X'

Viewed in isolation the CLI instruction seems unambiguous - it’s just comparing the

memory location pointed to by R2 with the immediate value 'X'. A simple translator might

mistakenly convert the instruction to something like this:

if (*r2 == 'X')

...

However, it’s necessary to look at the whole code. The MVI instruction is modifying the

opcode of the CLI instruction. It’s patching the immediate value used in the right argument,

overwriting it with a 'Y'. Relogix is able to detect this case and it produces the following

translation:

static unsigned char immediate = 'X';

if (r1 != 0)

immediate = 'Y';

if (*r2 == immediate)

...

Other examples of self-modifying code include:

Branch conditions modified at runtime so that a branch is either taken or not taken:

LABEL BC 0,TARGET

OI LABEL+1,X'F0'

String lengths modified at runtime:

STC R1,*+5

MVC WORKB(0),0(R3)

Operand offsets modified at runtime (Displacement DISP is modified in the

following example):

SR R3,R4 Calculate displacement

A R3,=X'00009000' OR in the R9 field

STH R3,LABEL+4 Poke the instruction

LABEL MVC RESULT,DISP(R9)

Page 12: Mainframe Assembler Translation

Many simple cases of this kind are handled automatically. By analysing the whole of the

program, Relogix chooses a translation which reproduces the original behaviour.

Sometimes a program will actually generate whole sequences of instructions at runtime. For

example to get maximum performance it might compile an in-house query language into

machine code which it then executes. Although Relogix can detect and warn about such

cases it doesn’t attempt to translate them automatically; they typically require rewriting by

hand.

If your application generates code at runtime please contact MicroAPL for advice. We have

wide experience in handling this type of problem. One approach which we’ve used

successfully on several past projects is to change the application to generate pseudo-code,

which is then executed using a simple pseudo-code interpreter.

Function calls and parameter passing

Relogix typically uses a one-to-one mapping such that each subroutine in the original

assembler code becomes a function of the same name in the C code.

There are two parameter-passing techniques commonly used in mainframe assembler code:

Parameters can be passed in registers

Parameters can be passed in a parameter block pointed to by the R1 register

Alternatively a subroutine may take an R1 parameter block and additional parameters in

registers.

Parameters in Registers

The case where parameters are passed in registers is simple to handle: Each register

becomes a separate parameter to the subroutine. Consider the following assembler code

example:

LA R3,=C'THE TOTAL IS: '

L R4,SUM

BAL R14,PRINTVAL

In this case the translation is straightforward:

char *r3;

long r4;

r3 = "THE TOTAL IS: ";

r4 = sum;

printval (r3, r4);

...which, after intermediate variable elimination is simply:

Page 13: Mainframe Assembler Translation

printval ("THE TOTAL IS: ", sum);

If the subroutine only returns a single register, that becomes the explicit result of the

function. Additional results are handled by passing values by pointer, just like a human

programmer would write:

result = myfunction (&res2);

Parameters in a Parameter Block pointed to by R1

A very common technique in mainframe assembler code is to pass subroutine parameters in a

parameter block. Typically the parameter block is declared in-line, and filled in with pointers to

all the parameters, and then the parameter block address is passed in register R1:

BAL R1,*+12 Branch around in-line param block

DC 2A(0)

LA R2,=C'THE TOTAL IS: '

ST R2,0(R1)

LA R2,SUM

ST R2,4(R1)

BAL R14,PRINTVAL

What should the C translation be in this case? One approach would be a very literal

translation, but it would produce very unnatural C code:

void * param_block [2];

void **r1;

char *r2_1;

long *r2_2;

r1 = &param_block[0]; // BAL R1,*+12

r2_1 = "THE TOTAL IS: "; // LA R2,=C’THE TOTAL IS: ‘

r1 [0] = r2_1; // ST R2,0(R1)

r2_2 = &sum; // LA R2,SUM

r1 [1] = r2_2; // ST R2,4(R1)

printval (r1); // BAL R14,PRINTVAL

...which, after intermediate variable elimination, becomes:

void * param_block [2];

param_block [0] = "THE TOTAL IS: ";

param_block [1] = &sum;

printval (&param_block);

Page 14: Mainframe Assembler Translation

Instead, Relogix models each entry in the parameter block as a separate argument to the

subroutine:

char *param1;

long *param2;

char *r2_1;

long *r2_2;

r2_1 = "THE TOTAL IS: "; // LA R2,=C’THE TOTAL IS: ‘

param1 = r2_1; // ST R2,0(R1)

r2_2 = &sum; // LA R2,SUM

param2 = r2_2; // ST R2,4(R1)

printval (param1, param2); // BAL R14,PRINTVAL

...which, after intermediate variable elimination, is just:

printval ("THE TOTAL IS: ", &sum);

Once again the best code is obtained by choosing the appropriate level of abstraction. The

goal of Relogix is to separate what the code really does - pass some parameters into a

subroutine - from the assembler-specific way that it does this.

Dynamic loading of modules

As a final example, consider the treatment of dynamically loaded modules.

In mainframe assembler code it’s typical to write one major function per file. Each file is

separately assembled, but unlike a typical C program the object modules are not linked

together. Instead code which wants to call a function in a different module must dynamically

load the module during program execution:

MOD1 CSECT

LOAD EP=MOD2 Dynamically load module

ST R0,MOD2ADR Save entry point

...

L R15,MOD2ADR Get MOD2 entry point

BALR R14,R15 and call function

...

MOD2ADR DS A

Page 15: Mainframe Assembler Translation

In most cases the C version of the application will be built from source files which are

compiled separately, but with the resulting object files then linked together into a single

executable.

By default, Relogix detects sequences involving the LOAD call and tracks which module is

loaded and where the module address is stored. When converting this code to C, Relogix

strips out the assembler-specific detail and substitutes a normal external function call. The

LOAD instruction, the code which stores its entry address, and the variable used to store the

address are not required:

extern void mod2 (void);

...

mod2 ();

Note that in this example the external function takes no parameters and returns no result.

However this is just to make the example assembler code simple. In general Relogix

performs detailed cross-module analysis of the code to determine the parameters and

results of each function.

Although the default action of Relogix is to strip out the dynamic loading of an external

module and substitute a simple function call, there are times when this is not appropriate.

Sometimes the application genuinely needs to load external modules at runtime. To

accommodate this, Relogix can optionally convert the code to use a dynamic loading

technique in the target environment.

Conclusion

If you have a small assembler application it’s possible to re-code it in C by hand, but for

larger applications this rapidly becomes unthinkable. You will spend years rewriting the code

and probably introduce many bugs along the way.

To quickly get the code working you need to consider some form of automatic translation.

Reliably converting assembler code to C is hard, even for a human. Producing C code that’s

easy to read and easy to maintain is harder, and not possible using a simplistic approach to

automatic translation. However a tool like Relogix which performs detailed analysis of the

code can produce remarkable results.

For more information about Relogix please visit our website:

http://www.microapl.co.uk/asm2c/index.html

Page 16: Mainframe Assembler Translation

About MicroAPL

MicroAPL was founded in 1979, and developed one of the world's first multi-user

microcomputers.

Since 1990 the company has concentrated on the translation of assembly language, working

for major clients such as Apple Computer, EMC, Motorola/Freescale Semiconductor, Novell,

Schneider, Philips, DaimlerChrysler, Nortel, Alcatel, and many others.

Our first automated translation tool, PortAsm, translated code from one architecture to

assembler of another architecture. However, the main emphasis since 2003 has been on

translation to C, using our Relogix translation tool which builds on the considerable practical

experience we had built up in assembler-to-assembler translation.