domain data typing

Information and Software Technology 1995 37 (3) 165-175

Domain data typing

David Livingstone Department of Computing, University of Northumbria at Newcastle, Newcastle upon Tyne NE1 8ST, UK

A form of data typing for imperative languages is described which is based on the idea of a domain of permissible values that a programming language variable can logically take. A particularly novel aspect is that instead of using a data typing sub-language, the computational part of a programming language is used in conjunction with a new type-assignment operation, in an equivalent fashion to conventional value assignment. This considerably increases the power of the language with only a negligible increase in language size and complexity. The data typing has been implemented and tested in APL.

Keywords: domains, type assignment, data typing, types

Introduction

There are various views of data typing. In principle, it can be used to specify for a variable:

(1) The permissible values it may contain. This is equivalent to the relational database principle of domains t. The relational idea is that a domain holds only those values that directly reflect those that are valid in the real world. (In practice relational DBMSs generally only support domains to the level of conventional data types.)

(2) The permissible structure it has: e.g. the number and size of the dimensions of an array.

(3) The permissible functions or operations that may be applied to it. This is the equivalent of the Abstract Data Type, or the permissible messages that can be sent to an object in an OO programming language.

(4) The nature of permissible values and their unit of measurement. By 'nature' is meant, what kind of value is it; a currency, a speed, a weight, or .. ? The unit of measurement is what the value is measured in; if it is a speed, is it expressed in miles per hour or metres per second? House has written a proposal for dealing with this aspect 2.

These concepts may appear in various ways within a programming language. For example, OO programming languages use the concept of classes of objects to provide significant typing facilities. A class is a type definition that describes the state and behaviour of a group of similar objects. For each class there is a set of permissible operations (defined in terms of messages and methods) for the objects of that class. The state of an object is described by a set of properties, attributes or instance variables, each of

which has a type that may be a class or a more conventional 3GL data type. Details vary significantly between different OO languages, but typically the first three kinds of data typing are all implemented in some form.

Domain data typing--the subject of this paper--is of the first-mentioned kind. However, where the host programming language permits the manipulation of data structures, it could be applied there as well; this was the case with the APL implementation of domain data typing.

The original reason for putting data types in a programming language was to specify how the data was to be physically stored. For example, referring to COBOL, Harrison states3:

Data has three main attributes: location, size and encoding. Each piece of data has a specific address in the computer's internal storage as its starting location and extends from that point for a certain length in that storage. It also has an encoding that represents the data . . . 'good' programming, and especially cost-reduction optimisation, is best done with some knowledge of these areas.

Of course, the encoding affected the nature of the routines that could be used to manipulate the data, i.e. in effect the permissible operations. See Fischer and Grodzinsky 4 for a history of data typing, and Cardelli and Wegner 5 for a survey of ideas on types.

Nowadays a much more important virtue of data typing is considered to be the assurance that a programming language variable can only contain the correct kind of value. For example, referring to ADA, Pyle states6:

A fundamental idea in ADA is that every data item has a particular type, which determines the possibilities for the values it may have . . . specifying not so much what its

0950-5849/95/$0.95 © 1995 Elsevier Science B.V. All rights reserved 165

Domain data typing: D Livingstone

value currently happens to be, but what it might ever legitimately be.

Nevertheless the defining of numeric types by their encoding and size has become so habitual that it has led to their continued use even in ADA--e.g. SHORT_INTEGER, L O N G F L O A T - - s i n c e 'Every type has a representation' (= an encoding). Such types can still be viewed as sets of permissible values, but not in terms of the application. Even so, ADA concedes the usefulness of application- oriented types by having enumeration types, numeric ranges, subtypes, etc., that are domains of values which are intended to be defined and used in an application-specific way. However, they are still limited in their usefulness. They cannot be applied in all circumstances sufficiently extensively, and so validation code is still commonly employed in the body of a program to ensure that a variable only contains permissible values. To further develop user- defined types along conventional lines so as to solve this problem would result in a larger and more comprehensive language. Presumably for this reason, it has not been pursued.

This paper describes a new strategy for creating application-specific data types as domains of permissible logical values. The strategy is to use the full power of the computational part of a programming language to generate data type domains, and then assign these domains to variables as their type. This requires only one addition to the language-- type assignment--but it yields a data typing facility of great power and flexibility. The power to size ratio of the language is maximized and language learning is minimized, as a programmer will already know how to use the data typing facility from using the computational part.

It is worth noting that the need for this domain data typing arose as a result of the author's industrial experience of developing commercial software. In this situation it is important to be able to separate different concerns. Specifically in this context it is important to separate the functionality of software from data integrity constraints, so as to give one's full attention to each in turn. Conventional third generation programming languages hinder this because stating the data integrity constraints is split into two phases. First a large set of permissible values must be chosen (i.e. a conventional data type) and later this can be refined and reduced in size to (a more) accurate set of permissible values (i.e. using validation code). Moreover the first phrase is usually necessary just to have variables to use for developing the functionality, which then often intervenes between the two phases. The fact that data integrity is often not recognized as a single task but as the two independent tasks of typing and validation stems from the influence of conventional languages on our thinkingV; indeed the term data integrity is taken from the database field.

It is more productive to be able to concentrate on either the functionality or the integrity and not mix them. The proposed domain data typing facilitates this. It does not prevent one from moving between the two concerns at will--after all they may interact--or refining the design of each, as required.

The research project that developed the data typing prin-

ciples did so in the context of the programming language APL, and implemented, tested and evaluated them in APL. The reasons for this were:

• APL is a mathematically based language, and hence is supportive of a precise, mathematically based definition of domain data types.

• APL is such a simple and weakly typed language (it just has numbers and characters) that it provides a 'clean sheet' for the development of domain data types. It would have been much more confusing to develop a new approach to typing in a language which already has a well-developed typing system (say PASCAL), or even an object-oriented language where as well as fundamental types, a class is also a type.

The remainder of this paper describes the basic principles of the proposal, its implementation in APL, its application to arrays and user-defined functions, and its use in program design; finally some conclusions are drawn. While the principles of domain data types are language independent, their description must necessarily be couched in terms of some language or other, and so APL is also used for that.

Basic principles

Three design principles were used:

• The philosophy, design and implementation of the data typing extension should be consistent with those of the host programming language (in this case, APL).

• The extension should utilize the minimum of new constructs.

• The extension should be as simple, general, and as ortho- gonal to other language design constructs as possible.

Thus the primary design decision was that assigning a type (defined as a mathematical set of all possible values) to a variable should be done in an identical manner to assigning actual value(s) to a variable. Thus type-assign should be analogous to value-assign in its syntax and semantics. Whatever can legitimately appear to the left and right of one assignment symbol--i.e, a variable and an expression respectively--should also legitimately be able to appear to the left and right of the other, subject only to the semantic constraints of what is being assigned, as explained below.

Consequently a variable name has two kinds of value associated with it, its actual value (corresponding to a variable's conventional value) and a type value. Con- ventional value-assignment associates a variable with the former, and a type-assignment with the latter. In APL the symbol chosen for type assignment was ,=, to demonstrate its similarity to *--, the APL symbol for value assignment.

In mathematics a set can be expressed in one of two ways. Either all the values in it can be enumerated, or a predicate or test is supplied which determines whether any given value is in the set. Therefore it would initially seem reasonable to allow both methods to be used to specify a type. The two kinds of type specification would be expected to lead to two kinds of values:

166 Information and Software Technology 1995 Volume 37 Number 3


• the set of values constituting the type (an enumeration expression);

• a value of TRUE or FALSE indicating that the value applied in the specification is or is not a member of the type (a test expression).

However, we might wish a domain to consist of the single truth value TRUE (or FALSE). So the problem then arises of how to distinguish an enumeration type of TRUE (or FALSE) from a test result of TRUE (or FALSE). (Actually in APL the problem is worse than this in that TRUE and FALSE have no special identifiers and are represented by one and zero respectively.) Using two different formats is one possibility. For example, the presence of a parameter to represent the value to be tested would indicate a test expression. The absence of parameter would indicate an enumeration expression.

But still the two kinds of expression require different actions from the interpreter or compiler. The result of a test expression states directly whether the tested value can be value assigned or not, whereas a test still has to be carried out after an enumeration expression to see if the proposed value is contained in that domain.

Bearing in mind our design principles and that type and value assignment should be analogous, since value assignment only causes one simple direct action, so too should type assignment. Thus it was decided to achieve this by allowing only test expressions and abandoning enumeration expressions. The latter can always be simply expressed in terms of the former, anyway, by adding a test to the enumeration that checks whether the proposed value is a member of the enumerated set. (In APL, this is trivially done by means of the membership test function e). In this way, an enumeration can be considered as a special component of a test. Hence a single, general principle is provided to cover all cases. It also ensures that an interpreter's or compiler's actions are explicitly and completely specified in the test expression. For this reason, it is proposed that only test expressions should be used to implement domain data typing, regardless of the host programming language.

It follows that the proposed value must be able to appear in the test expression. For simplicity it should always be represented by a standard identifier. This has been chosen to be the symbol ~ in our APL implementation.

It also follows that the type expression must evaluate to a code fragment which, when executed, will carry out the test on the value in :#. Thus, whereas the expression to the right of value assignment must evaluate to a 'conventional' value (which is then assigned), the expression to the right of type assignment must evaluate to a value that is an expression--a test expression--which is then assigned and will thereafter be automatically invoked and executed to test any putative value assignment to that variable.

So although a domain type might initially appear to be a set of values assigned as a type--thereby making it analogous to a normal value assignment to a variable--the principle of using test expressions means that what is type assigned is more general; the expression may correspond to a simple enumerated set of values, but can potentially be far more

powerful than this, being only limited by the computational power of the language. This is why domain typing is so much more powerful than (say) the collection of application- oriented types in ADA. Despite being a single mechanism, it gives the opportunity for larger and more complex types to be expressed.

Consider some simple examples of domain data typing:

(1) The scalar variable S has a domain type of the integers 1 to 10.

S ~ ':# e t 10 ' The whole expression to the right of ,~ is in quotes to indicate that it is a character string. It evaluates to a test expression in APL code. The integers 1-10 are generated by the function 'L' with parameter 10. The expression states that the proposed value, :#, must be a member of the set of integers 1-10.

S-,--1 S is assigned the value 1. It works successfully because 1 replaces # in the type expression which then evaluates to TRUE (1 in APL), so the program then proceeds with the value assignment.

S----11 S is assigned the value 11. It fails, TYPE ERROR because 11 is not in the domain and so S --- 11 the value assignment is not carried out

^ but a type error generated instead.

(NB We consider later how we might take application- specific action as the result of detecting a type error; but for the moment we just consider the typing mechanism.)

(2) Variable S has a domain type of the two scalar values JAN and FEB. This is analogous to the previous example but with character data, except the data is held in a character array.

MONTHS ,--- The APL character strings have been ' "JAN . . . . FEB" ' nested using double quote symbols, i.e.

turned from 3-character arrays to single composite values, and then value assigned to the array MONTHS.

S ~ MONTHS is used in assigning the type ':# e MONTHS' of S.

S - - - - c ' J A N ' S is successfully assigned the single composite value JAN; ' E ' nests the character string JAN.

S is unsuccessfully assigned the single composite value APR, causing a type error.

S .,-- c 'APR' TYPE ERROR S ,-- r - 'APR'

A

MONTHS in effect holds an enumerated type. Since variables can be manipulated by programs, it also shows how dynamic rather than just static domain types can be provided. Although APL is dynamically typed, anyway, this approach could be used as a means of providing dynamic types even in an otherwise statically typed language such as PASCAL.

h~Jbrmation and S~ware Technology 1994 Vohone 37 Number 3 167


(3) One variable 's domain type is made to depend on the value of another variable. Suppose that an employee 's salary is dependent on their salary grade, of which there are five. Let MIN and MAX be arrays holding the minimum and maximum salaries respectively of each grade.

MIN *-- 5000 10000 15000 20000 25000

MAX ,-- 9999 14999 19999 24999 29999

GRADE ~= ' ~ e t5'

SALARY .~ ' (~ _> MIN [GRADE]

A(~ _< MAX [ GRADE ] ) '

GRADE ~ [-]

SALARY ~ [~

The minimum and maximum salary values of the 5 grades are assigned to MIN and MAX.

The variable GRADE is typed so that it can only take one of the integer values 1-5.

SALARY is typed so that it can take any value between a minimum and maximum level that is determined by the current value of GRADE (which indexes into the two arrays.) Thus SALARY's type can vary, depending on GRADE ' s value.

A value for GRADE is input. '[7' is an APL symbol indicating that input is to be taken directly from the terminal. It is checked by its type to see if it is a permissible grade.

SALARY is likewise input. GRADE is used to provide the relevant type for SALARY.

In the examples so far, the result of the test expression has always evaluated to a single value of TRUE or FALSE (zero or one in APL). However, in a programming language where multiple values can be assigned in one statement (as in APL, which can be regarded as an array processing language), complications could arise if some of the values to be assigned were permissible and some were not. Normally when such an assignment happens, it is an atomic transaction, with all the individual assignments being made logically in parallel; it must all happen or none of it must happen. This is certainly the case in APL whenever a value assignment assigns a whole set of values to elements of an array, and it will be assumed to be the correct norm. It is undesirable for the type test to result in a mixture of TRUEs and FALSEs because it leaves the decision up to the interpreter or compiler as to how it must deal with it. What to do should be explicitly stated within the test, consonant with the view that there should only be test expressions and no enumeration expressions. This principle seems valid regardless of the host programming language. The following example illustrates the principle.

(4) The vector (one-dimensional array) V, constrained to hold only integers that are also positive, is assigned a new set of values:

V ,= ' A / (~ = I - ~ ) ¢~ represents the proposed new A(~ > 0) value o f V , i.e. it is also an array

V ~ 2 4 6 8

V , - -0 2 2.5 4 TYPE ERROR V - - 0 2 2 . 5 4

A

of the same dimensions as V. The second comparison checks that every value in the array is positive, and the first comparison to its left checks that each value equals the mathematical ceiling, [- :#, of itself, i.e. is an integer. In APL such comparisons automatically apply to every element of the array. Logically ANDing the two comparisons together gives us a TRUE/ FALSE value for each array element. The distributed AND over these results (accomplished by ' A / ' ) yields TRUE or FALSE for the complete test.

All the values assigned here are in the domain, so the assignment is successful.

As one value here is not positive and another is not an integer, the assignment fails with a type error.

In APL, ' A / ' and ' V/ ' (a distributed OR operation) will be common ways of ensuring that a test gives a single result.

The principles governing the use of domain types are now described. When value assignment is encountered, before assigning a value to the variable on its LHS, a check is made to see whether that variable has a type definition. If it does not, that action is carried out which is the standard for the language in question; in APL, the value is assigned regardless--this would be the norm in APL; in other interpreted languages, the program may halt with an error; in most compiled languages, the lack of a type would be flagged as an error at compile time. If the variable has a type definition, the type definition is treated as a fragment of program code with :# replaced by the value to be assigned, and executed. If the execution yields TRUE (i.e. a one in APL) then the value assignment is carried out as normal. If it yields FALSE (i.e. a zero in APL) then a TYPE ERROR is generated and the assignment is not carried out; execution halts there. If the execution evaluates without an error to a value other than a single TRUE or FALSE value, execution will halt with a DOMAIN ERROR, since it has not been possible to ascertain unequivocally whether the proposed value(s) conformed to the variable 's domain or not. In our APL implementation, execution would halt pointing at ,= in the type assignment statement. For example:

V ~= '~ = 2 3 4 5' The type statement states that V must be a vector of the four values 2, 3 , 4 a n d 5 .

V ~ 1 V is assigned the value one. The DOMAIN ERROR assignment fails because one is not V ,~ ':# = 2 3 4 5' equal to any of the four values and

^ so four values of FALSE are generated. Even though all four truth values are the same, it is not

168 Information and S~ware Technology 1995 Volume 37 Number 3


a single TRUE or FALSE result that tells the interpreter unequivocally whether or not to carry out the value assignment.

If execution of the type definition is halted by some error, it should halt with the appropriate error message at the relevant point in the type definition, just as it would if that fragment of code had been executed for any other purpose. If the type assignment statement contains a syntax error, then it should be detected at input and reported on in the normal way.

Although all the type expressions shown in the examples have been quite simple, there is no inherent limitation to the complexity allowed. If the complexity is too large or too inconvenient to be expressed in one line of code, it could be embodied in a sub-routine (a function in APL) which is then used in the type statement. For example:

S ~= 'CHECK ~ '

where CHECK is a sub-routine of arbitrary size and complexity. Given that conventional data validation is code of arbitrary size and complexity, it can be seen that data validation code can in effect be combined with the data type statement. This should not surprise us as both have the same purpose, namely to constrain the values that a variable may take to those that are permissible. Indeed it is a benefit that these two disparate methods can be unified in one kind of statement. It means that data typing/validation code can be kept quite separate from other code. (If desirable, libraries of standard validation routines could be kept and combined appropriately into a sub-routine called in a data typing statement.)

Since a variable can be repeatedly assigned values, by our design principles it ought to be possible repeatedly to assign types to it. This fits in naturally with the philosophy of APL. However, in a conventional language which requires all variables to be declared with a type once only at the beginning, our design principle requiring consistency with the host language would suggest that the same should apply to domain data typing. In fact the domain data typing concept is independent of which choice is made. The decision should be made on the basis of which option best harmonizes it to the programming language in question. In the case of APL, the decision was to allow repeated data typing, but this need not be the decision for every language.

Allied to this is the question of whether the use of domain data typing should be optional or mandatory in a language that possesses it. Again the choice is independent of the concept and should make domain data typing harmonize with the host language. So in the case of our APL implementation it was optional; but for most languages its use (or that of some other form of typing) would be mandatory.

There are dangers in optional and repeated data typing. Variables may receive no type when they should. A variable may receive a type that is inconsistent with a value that it already has. (Of course, some system software check of the program could flag such errors.) However, the following examples illustrate that if used properly they can bring some advantages.

• It can aid debugging by inserting new type statements in places that will check for particular values that should result from particular test data.

• Since domain typing can involve very complex data validation, it can be useful to 'switch off' a type at a later stage for efficiency reasons. For instance, suppose on the first value assignment to a variable a sophisticated domain typing check is applied. If subsequent value assignments to it are derived from the original value by an algorithm that is considered to be correct--by reason of proofs carried out, comprehensive testing, or whatever is considered acceptable--further (time-consuming) typing checks may be pointless. It would be an engineering design decision as to whether to remove the typing for efficiency or retain it as 'redundant structure' for reliability.

• It can be used to control changes in value, which can be important in practical data validation. In the following example S is constrained to receive only a value that is less than its current value.

S , = ' ( S > ~) '

The typing exploits the fact that a variable's name can appear in its own type statement, in which case it refers to its current value, while ~ always refers to the future value to be assigned. Note that although a variable now has both an actual value and a type value associated with it, using the variable name retrieves only the actual value according to the usual conventions. (Retrieving the type value is looked at later.)

For example, the type used to constrain a variable's initial value (say from an input) could thereafter be changed to constrain subsequent changes to the value brought about by the program's processing of the variable.

• One can ensure that after a variable (X say) has received a value, it retains it as a constant, by using the type assignment (in APL)

X ,= '0'

after X has received the value. This is because such a type assignment means that any value assignment thereafter would generate a TYPE ERROR.

So far, all the examples have had character string constants on the right of ,= that directly represent the APL code of the types' test expressions. These expressions have then been invoked and executed later by the APL interpreter as necessary to test any putative value assignments. However, in APL, expressions to the right of ~= could be any kind of expression which, when evaluated, yielded a character string representation in APL code of the test expression. (For clarity, the expression to the right of ~= is called the type specification, whereas what it evaluates to is called the type definition.) The following example illustrates this:

A - - - ' ( ~ < I0)' A character string (between speech marks) of executable code is value assigned to variable A.

B----'(:# > 1)' A character string of different executable code is value assigned to variable B.

Information and Sl~ware Technology 1995 Volume 37 Number 3 169


S ~= A, ' A ', B Using commas, the two variables are concatenated together with the ' A' character between them, and the result is then type assigned to S.

Thus the domain type actually assigned to S was:

**'(3 < 10)A (3 > 1)'

Hence domain types can be generated under program control in APL. Whether they could be so generated in other languages depends on the possibility of types being determined at program run time, or whether they must be completely determined earlier, typically at compile time for most languages.

Since a variable now has an actual value and a type value, we need to be able to access both. The actual value is accessed by using the variable name in the normal way. To access the type value a monadic retrieval function was developed in APL, a monadic version of ¢:. Its argument can be any variable. The result is a character string com- prising the type definition (compare the type specification). If the argument does not have a type, a null result (i.e. a zero length character vector) is returned. The result can be manipulated like any other character string value. For example, it can be stored in a variable (as its value-- compare this with the previous example) or used in specifying the types of other variables. The latter allows the type of a variable to be specified in terms of the types of others. (It could be used to simulate the inheritance mechanism of OO programmingS). The following example illustrates this:

Z ,-- ~= V V's type is retrieved and stored as a character string value in Z.

X ~= Z, 'A ', ~ ' Y Y's type is retrieved, concatenated with the value of Z with the ' A ' character between them, and then the complete character string is type assigned to X.

Monadic ~= does not have a counterpart with --- because the name of a variable is used to retrieve a variable's value. However it is akin to Gfeller's proposed 'Parts ' function for retrieving attributes of an array 9, or APL90's 'Dot Notation' for accessing properties of an object ~°. Whether monadic ~= would be implemented in another language depends in practice on the same factors as govern the creation of domain types under program control, since there is little point in having it if all types must be known before run time.

Handling arrays

Although the structure of arrays is conceptually different from the values that populate that structure, because of the interactive nature of APL, it contains primitive functions to retrieve and manipulate array structures (i.e. the shape of arrays, and the nesting of arrays within each other). Since a key idea of domain data typing is to use the computational part of the language to define domain types, it was of interest to investigate whether it could be used directly to control permissible structures, as well as to investigate the

interaction with controlling the permissible values of array elements.

The array structure primitives can be used directly in a type statement. For example, in APL, an array V could be typed to be a one-dimensional array of three elements:

V a ' (A/3 = O~) A (1 = p ~ ) '

The function 'p ' yields the shape of the variable it is applied to, i.e. the size of each of the variable's dimensions. Thus p~ must be equal to 3. However suppose # had more than one dimension. Then in APL, '3 = p~ ' would yield a vector of ones and zeros (TRUEs and FALSEs), one value per dimension. So distributed AND, ' A/ ' , is applied to the result to ensure that however many dimensions H possess, the final result will be a single zero or one. The second constraint, '1 = pO~', ensures that V only has one dimension; as (p:#) is itself a one-dimensional variable with a size value for each of ~ ' s dimensions, when p is applied to (p~), it gives the size of each dimension of ( ~ ) , which must be one for a one-dimensional 3 .

If the type definition applies to an array, then :# represents that array. The elements of :# are manipulable in just the same way as those of any other array. This allows any subset of the array to be selected and typed. The following example illustrates how in APL the first element of V may be constrained to have the value 6:

V ~ '4~[1] = 6'

To maintain the syntactic analogy between value and type assignment, any expression that is valid on the left of --- can appear on the left of ,=as well. Thus in APL indexed arrays are allowed on the left of ~=. In that case :# would represent the array of elements selected by the index expression on the left of ¢=. Thus to type the first element of V one could also write:

V[1] ~= '~ = 6'

Here ~ represents V [ 1 ]. Arguably the latter form provides a higher level of readability as the typed element is clearly visible on the left of *,.

Both the above forms can be combined in a single type statement. For example:

PART[3,6] ~= ' ~ ( ~ [1] e ~ [2]) '

:# represents two elements of PART, viz PART[3,6] . So :# [ 1 ] refers to PART [ 3 ], and:# [ 2 ] refers to PART [6 ]. PART is assumed to be an array containing the attributes of a manufactured part, and the purpose of the typing is to say that a part cannot be a member of the set of parts that make it up. The third element is the part's identity code, and the sixth element is a nested vector of the identity codes of component parts making up this one. ' ~ ' unnests the single composite value ~ [2] into a set of values so that e can be applied to it. ' ~ ' negates the result of the membership test to achieve the purpose of the constraint.

As with value assignment, a new type assignment to an array variable simply supersedes any existing type. How- ever there is a major difference in re-typing array variables as opposed to scalar variables. In the case of arrays, it is

170 Information and S~ware Technology 1995 Volume 37 Number 3


possible for a subsequent type assignment to cover only a subset of the elements typed in the previous assignment. Consider the following example:

MATRIX ~ '(2 = 0tr~) A ( A / 2 2 = 21trY) A ( ~ [ l ; l ] = ~ [2 ;21) '

The typing constraint is the logical AND of three separate parts. The first part says that MATRIX must be a two- dimensional array. The second part says that each of the two dimensions must be two elements long. ('21x' takes the first two values of 'x ' , and again the distributed AND protects against arrays with dimensions other than two). The third part says that the element in the first column of the first row must have the same value as that in the second column of the second row. Now let the following type assignment be made:

MATRIX[;2] ,~ ' (A/0 > 4~)'

This says that every element in the second column of MATRIX must be less than zero.

The combined effect of these two statements must be clearly defined. One scheme might be to replace the old type of each array element by its new type, with elements that are not re-typed retaining their existing type. Unfor- tunately this scheme is unable to deal with all possible situations. For instance, MATRIX [2;2] is typed in relation to MATRIX [ 1; 1 ] in the first type statement. It is then re- typed in the second statement as part of the second column, but MATRIX [ 1; 1 ] is not. The replacement of the type of MATRIX [2;2] is not acceptable because it removes what is also the type of MATRIX [ 1; 1 ].

In other words, the problem with this scheme is that when an array element is typed in relation to another element of the same array, re-typing the former would remove the type of the latter.

The scheme implemented instead is for the type of an array (or part thereof) to be replaced only if exactly the same array (or part thereof) appears on the left of ~=. So the type assignment to MATRIX [ ;2] would not replace any part of MATRIX's existing type. Instead it would introduce an additional constraint on the second column of MATRIX to what already existed.

This technique is particularly useful when a complicated set of types is to be applied to the various elements of an array. Each set of logically related elements can be typed separately in its own type statement. Furthermore, each type statement can be redefined without interfering with others.

An important consequence of this scheme is that an array element can accrue several types in a sequence of different type statements. Hence a value assignment to that element invokes each type definition covering that element. Each must evaluate to TRUE if the element is to be assigned the value. This is the equivalent of logically ANDing the results of all the type evaluations together. Such a situation applies to elements of the second column of MATRIX, so a value assignment to element MATRIX[I ;2] would invoke both type definitions. Therefore the assignment:

MATRIX [ 1 ;2 ] -,-- 5

would fail because the type definition of the second column

of MATRIX would evaluate to FALSE (zero in APL). Some of these aspects of array typing could be applied in

other languages. However, limitations would arise where a language did not have the equivalent of APL's 'p ' , and many languages do not possess it since their static arrays mean they have no requirement for it.

Functions and sub-routines

In principle, domain data typing should be applicable to the arguments and result of functions and sub-routines, with the purpose of controlling the permissible values that are passed to and from them. This was investigated with user-defined functions in APL--they correspond to user- written functions and sub-routines in other programming languages.

The usual 'black box' principle was assumed for functions/sub-routines. When arguments are passed to the black box, they can only enter it if they are of the type defined for the function/sub-routine. Likewise a result can only be passed out of the box if it is of the defined type. The act of passing parameters in and results out is regarded as the equivalent of the value assignment and hence is when type checks are applied. The black box principle also means that argument and result types can be specified in terms of variables and functions in the environment of the function/sub-routine, but cannot involve those within its body.

Because of the black box principle, the original intention was to define argument and result types within the header of a function, as is normal in programming languages. The function header, in APL as in any other languages, determines the interface between the function and its environment, and the body what goes on inside the function.

However, the format of function headers in APL meant that the header line could become very long and complex, even if (say) a specific parenthesized format were used to clarify it. Furthermore, to re-format the APL header line (which is no problem in principle) would require the amendment of a lot of other software in the APL environment and there was not time in the project to carry it out. Thus it was decided that the argument(s) and result would not be typed in the header line, but instead in independent type statements similar to those used for variables. For most programming languages this problem would not arise and domain types could be put in the header.

With hindsight, this does not appear to be a disadvantage. The aim of domain typing is to be able to specify type domains in terms of individual applications. That being so, function types can be expected to be different in each application, perhaps even for every function invocation in an application (which can be achieved via repeated type assignment). The black box principle supports this. Constr- aints on input and output for the purpose of ensuring the function terminates properly, and does not crash or return incorrect values, should be an integral part of the function itself; they should not be dependent on constraints outside the function/sub-routine which the programmer may or may

Information and Software Technology 1995 Volume 37 Number 3 171


not have remembered to provide. Only I/O which conformed to both sets of domain types would be acceptable.

Two different type statements must be provided for a function/sub-routine, one for the input argument(s), the other for the result. The reason for this is that at the moment of calling the function/sub-routine, it has self- evidently not yet produced a result; therefore if the two type statements were combined into one, it would halt with an error whenever it reached a reference to the non-existing result. With regard to the result type statement, there is no reason why it may not also refer to the argument(s), which will have the values they had on entering the function/sub- routine. Indeed this may be very useful as a resuit's permissible values will often depend on the argument values.

Consider an APL example where the arguments and result of the dyadic function FN are typed through the following statements respectively (notice their order could have been reversed if desired):

a FN ~o ~= '((ppo 0 = (ppoa) A ( A I(po 0 = (pA))' FN ~= ' A / (p~) = (pc0'

The left and right arguments of a dyadic function are represented by c~ and w respectively. (In APL, a monadic function would have its one argument, ~o, to the right of the funtion name). They are used in the same way as ~ for variables. They also appear in the phrase on the left of ~= when it is an argument type statement to distinguish it from a result type statement which contains neither. Otherwise the one statement would supersede the other according to the type re-definition rules already established. 7t represents the result of the function, and was chosen

because a function's result corresponds to a variable's value; a niladic function is equivalent to a variable. Hence the type statement for a function result is syntactically identical to that for a variable, except for the possible inclusion of c~ and/or ~ in the type specification to the right of ~=.

The above example specifies the following:

• Argument type statement: firstly the two arguments must have the same number of dimensions; secondly the shape of the left argument (i.e. number and size of each dimension) must be the same as that of the variable A.

• Result type statement: the result must have the same shape as the left argument.

The example demonstrates that as well as being able to specify types in terms of external variables and constants, the function result and arguments can also be typed in relation to each other.

The following points may be stated concerning the valence of functions (i.e. whether they are dyadic, monadic or niladic, the only possiblities in APL):

• A dyadic/monadic function is specified by the presence of c~ and ~ / jus t ~0 to the left of '= in a function's argument type statement. In the case of a dyadic function, both ct and ~o should appear to the left of ~= even if only one of them is type checked.

• For a niladic function of course, no argument type statement can exist.

• The presence of an c~ in the type definition of a monadic function argument or result is signalled with a VALUE ERROR when the typing check is carried out. The case of a niladic function is analogous.

• To provide maximum flexibility in defining and editing functions and their types, the correspondence between the valence given by the function header-line and that given by the argument type statement is only checked at function call time in APL.

The general principles applied to function type statements would be relevant with any host language. However, with most languages, their implementation would have to differ at the very least to take account of the use of a compiler instead of an interpreter (used by APL), and to allow more than two input arguments and more than one output result. Some languages do allow a function/sub-routine header statement to be separate from the body; this could facilitate a similar approach in these languages to that used in APL.

In APL, the types assigned to arguments and results can be obtained via monadic ~= as they can be with variables. The types of arguments are retrieved by applying ,~ to " ~ FnName o~" or "FnName w", depending on the function's valence. The type of the result is retrieved by applying to the function name. Using the above function FN, this is exemplified by:

• ¢= e~ FN w ((ppoz) = (ppw)) I'k ( A / ( p o 0 = (pot)) ~= F N

^/(p~) = (p ,~ )

Consequences of domain data typing

Conventionally, the permissible values that a variable could contain have been specified in two places. (Note that although the following discussion is couched in terms of variables, it applies equally well to a function result or a function argument.) Firstly they are specified in data type statements, then in validation code. The contention here is that these two tasks are really one and the same task, but it has had to be split up into these two because of the limitations of conventional programming languages.

Domain data types are specified in one place and are much more precise, powerful and flexible. Among other things they can provide:

• dynamic typing, rather than just a static set of values; • referential integrity, as in a relational database J J: i.e. the

permissible values of one variable depend on what another variable contains or is allowed to contain;

• control over changes in values.

Their impact on program design is through a better co- ordination and separation of concerns. Domain data typing provides data-typing-cum-validation all in one place rather than two separate places. Furthermore, because the computational part of the language is used for this, the same approach is used throughout the program. More modular



code is facilitated because anything to do with checking the content of a variable--as opposed to processing it--can be separated out and placed in a type statement. If the data checking is complex, it can be embedded in a separate function which is called by the type definition. Thus the program can be clearly separated into typing and processing modules. This conceptual separation is very supportive of good design.

There is a similarity to the relational database philosophy that one can separate the specification of a relation from the use to which it is put. SQL or any relational algebra or calculus are languages for writing expressions which evaluate to a relation, which can then be used as the scope of a retrieval, view, access rights, etc u. Here the imperative part of a programming language is used to write expressions that will evaluate to values, not only for use in the functional part of a program but also for use in data types.

However, in reality the problem is more complex than just combining conventional data typing with data validation. The requirements of a domain type vary during the flow of execution of a program. Where data enters the program, a thorough check is usually required to ensure that inappropriate values are not accepted into the receiving variable. When values are assigned thereafter on the basis of valid input, the purpose of any type is to make the program more robust by providing a redundant structure to cope with data errors caused by faults occurring elsewhere in the program, i.e. faults in the algorithm or input checks. Since such redundant structure always causes inefficiency, it is an engineering design decision as to what the balance should be between robustness and efficiency. This aspect of program design is not always properly considered; conventional types do not offer much scope here. A consequence of this is that a variable's desired type could change considerably during the execution of a program, implying the need to re-type it.

The use of sub-routines and library routines is also affected. To what extent should a routine's types be heavily engineered to handle any circumstances, or tailored to suit each occasion on which it is used?

Having considered the ideas of domain data typing it is useful now to formalize it. For each domain data type, a test must be specified which will determine whether a putative value may be value assigned or not. In general terms, such a test is formally defined as follows. Let "Testx' be a function that represents the test expression of the domain type of variable x, and let :# be its putative new value; let vars be the set of all other variables in scope at the same time as x. Let X be the base type of:# and x. Each variable in vars will have some base type Vi drawn from the set of all available base types in the programming language, BASE_TYPES. Note that for complete generality, while these variables are available, it is not mandatory for any particular test to refer to any of them if they are not required; they are all optional, although one would typically expect at least :# in it (but see the example earlier where a constant value is enforced on a variable).

The type of Testx is:

TesL: X x X x V x . . . × V ~ B o o l

i

where0 _< i _< n e N and V~, X e BASE TYPES

Then:

Test~ (:#, x , vars ) = T R U E if the test determines that :# is a permissible value, i.e. falls within the domain, and should be value assigned,

= F A L S E otherwise.

The domain itself comprises all possible values of:# which at that run-time point satisfy the test.

Let 'Assign' be a function representing the value assignment to a variable.

Assign: Bool x X x X ~ X

Then:

Assign(Tests(:#, x, vars) , :#, x)

is given by:

Assign(TRUE, :#, x) ~ x: =:# Assign ( F A L S E , :#, x) ~ x: = x

Corresponding formal definitions could be written for function/sub-routine arguments and results.

Note that x, and any other variables involved, have some primitive base type. In APL these base types are numbers and characters. (For any number, the APL interpreter automatically decides on the best physical storage size and encoding from several it has available.) Thus a domain is a sub-set of values taken from a base type. The base type is important in that it determines what functions and operations may be applied to a variable; it encapsulates the view of types as being defined by the permissible functions and operations that are applicable. Base types also constrain the functions and operations that can be used to derive a domain type.

In APL, its weak typing would also permit that a domain can be a mixture of its two base types, although one would not expect that to arise often in practice. In other languages, if we assume that their current types correspond to base types, the extent to which base types could be mixed in adding domain types would depend entirely on the rules of the language, involving such factors as the possible coercions from one type to another and the degree of polymorphism permitted.

Having a domain which is a mixture of base types creates potential problems. Semantically it may not always be possible to combine two variables that both have that same domain type! For example, suppose a domain comprised a set of numbers, some drawn from a numeric base type and some expressed via a character base type. Then adding together two variables drawn from that domain would be impossible if one held a 'numeric' number and the other a 'character' number. Since the main purpose of types is to impose constraints that enforce correctness, it is safer to always constrain domain types to be subsets of base types.

This raises the question as to whether a large range of conventional 3GL types--which would be base types if domain typing were added to such a language--is really



desirable, particularly a range of numeric types, of which there can be many. From the user and programmer perspective they are usually of little interest. Their choice is normally a technical chore which programmers repeatedly have to do for every program they write, and which allows scope for error. If domain types were added, it would be preferable if only numeric and character base types existed, plus perhaps any specialist types like date-time, and the compiler/interpreter used the domain type statements to derive the most suitable machine code, including size, encoding, and choice of implementation routines. (The language ML already has some ability to do this'3). It would reduce the size of programming languages overall while increasing their power. It would also increase programmer productivity.

In practice it is not sufficient merely to detect an error; one must trap it and carry out an appropriate recovery procedure. Although it was not the purpose of the project to investigate this aspect, a few words as to how domain data typing relates to it are appropriate.

Whatever kind of data typing is involved, there are always conceptually the following steps:

(1) Permissible values are specified. (2) A check is made to see if the putative assignment value

lies within the domain. (3) If the check is satisfactory, make the assignment; if not

raise a type exception. (4) (In the case of a type exception only) carry out some

appropriate action to deal with the error.

In conventional data typing, step 1 is provided by the type definition, steps 2 and 3 are carried out by code inserted by the compiler with an exception or interrupt being raised in the case of an error, and step 4 is the output of a standard type error message.

In data validation code, steps 1,2 and 3 are often combined in that typically a test condition built from IF or CASE statements implicitly determines the permissible values, carries out the check, and as a result determines what action to take next. Step 4 can sometimes be extensive and comprehensive, particularly if it is in support of (say) a flexible, user-friendly input module.

In domain data typing, steps 1 to 3 are carried out analogously to conventional typing, as explained earlier. In the APL implementation of it, step 3 was carried out in the conventional manner by generating a type exception and step 4 by trapping the exception and using the trap to invoke suitable functions (which could be extensive and comprehensive if required, exploiting the full power of APL). The APL interpreter used as the basis for the implementation already possessed a sophisticated exception or interrupt trapping mechanism, and TYPE ERROR was added to the interpreter as another exception. (The exception trapping mechanism is similar in principle and scope to the exception trapping facilities provided in ADA, PL/I, ML and C). A TYPE ERROR trap and recovery function(s) could be added to a data type and become a part of it. In this way it was possible to make error recovery specific to each variable, as well as being specific to a TYPE ERROR on that variable.

The foregoing raises the general question, should the error recovery procedure in principle be a part of the domain data type or not. One could answer that it should not, bcause this provides a better separation of concerns; the purpose of the data typing is to establish whether the putative assignment value is in the domain or not, and if it is established that it is not, it is a separate task to deal with the problem. Thus one could, for example, assign error recovery code to a variable as a kind of third value 9''°. On the other hand, a domain type can consist of a comprehensive range of validation checks, and each check might call for a different recovery procedure. This would argue for relating recovery procedures and domain types closely together. It seems likely that the ideal solution would involve being able to associate different recovery modules with different parts of the domain typing statement. (As recovery can involve considerable investigation into the nature of the false value so as to decide on what action is appropriate--say to support a user- friendly data input facility, or to provide very reliable code--recovery code could be a major portion of the system, and as such would be better modularized). This would require a suitable exception trapping and handling facility, with a 'Throw and Catch' type structured design ~4'~5. Bearing in mind that one would want to use it for program components other than domain data typing, the project did not attempt to resolve this question.

Conclusions

A domain data typing facility has been described which provides a coherent replacement for conventional data typing and the inevitable concomitant data validation code in real- life programs. Because it uses the computational part of the programming language to define the domain data types:

• the same philosophy is applied throughout any program; • there is a single, simple addition to the programming

language, not a variety of constructs; • the only limit to the power of domain types is that

inherent in the computational component of the language to manipulate data.

Domain data types consist of purely application-oriented values. The implication of this is that the physical storage of values should be handled by the language compiler/ interpreter or as a separate issue by the programmer.

In terms of program design using domain data types, there can be a better separation and co-ordination of concerns, with everything to do with the enforcement of valid data separated from the use of that data to achieve the program's purpose, particularly if domain data typing is accompanied by a suitable exception or interrupt handling facility.

Although APL has been used to develop, implement and test domain data typing, there is no reason why domain data typing should not be incorporated into other programming languages. It may well look somewhat different in detail, because of the design principle 'the philosophy, design and implementation of the data typing extension should be consistent with those of the host programming language'. Thus, among other things, differences might arise due to:



• the language being compiled rather than interpreted, with all the consequent differences in constraints that this implies compared to an interpreted language like APL;

• the availability of TRUE and FALSE as Boolean values, rather than having to use zero and one;

• the nature of function/sub-routine declarations; • the ability of a function/sub-routine to have more than

two parameters and one output; • the nature of arrays, particularly whether they can be

dynamic or are constrained to be static; • whether the interactive nature of type retrieval would

cause problems; • the nature of the TYPE ERROR exception/interrupt

facility.

A further reason for pursuing the idea of domain data types is that they can form a correspondence at the program code level of pre- and post-conditions in formal methods, and hence facilitate a smooth transition from a detailed design in a specification language to the design of the program code.

In formal notations like Z Schema Calculus t6 and VDM ~7, a formal model of the required system is defined. These use pre-conditions to state what predicate must be true before operations commence, and post-conditions to state what must be true after they finish. Mathematical sets form the basis of the logic used. By comparison, the domain data types of function arguments together comprise the pre-conditions of functions, and the domain data types of function results are the functions' post conditions. The domain data type of a variable is its pre-condition for assignment (and its post-condition, since no processing occurs in a variable). Hence the use of domain data types would facilitate a transition (including the use of proofs) between the specification and program code design. Baber's approach ~s applying pre- and post-conditions at the level of individual program statements illustrates how the end result of formal design might appear at the program level. It can be seen that domain data types would be supportive of this, although their use there would be very different in style from the use of conventional data types.

Finally, the relational database ideas of domains and referential integrity relate to domain data typing. The use of the latter in an application program would help reduce the conceptual mismatch between the program and any relational database it utilized.

Acknowledgements It is a pleasure to make the following acknowledgements:

• my research student Hamid Gharib (now of British Telecom) who carried out a PhD project on this subject;

• the University of Northumbria at Newcastle for finan- cially supporting this research;

• John Scholes (Dyadic System Ltd) and Paul Barnetson (IBM UK) whose critiques and suggestions considerably improved the original ideas;

• Dyadic Systems Ltd for the source code of an APL interpreter to use as a basis for implementation, and their technical support through John Scholes and Geoff Streeter in carrying out the implementation;

• colleagues who have offered their critique of drafts of this paper;

• the two anonymous referees whose helpful comments enabled the clarity of the paper to be improved.

References 1 Date, C J 'What is a domain?' in Relational database writings

1985-1989 Addison-Wesley (1990) 2 House, R T 'A proposal for an extended form of type checking of

expressions' Computer J. Vol 26 No 4 (1983) 3 Harrison, W J A programmer's guide to COBOL Chapter 3, Van

Nostrand Reinhold (1980) 4 Fischer, A E and Grodzinsky, F S The anatomy of programming

languages Chapter 5, Prentice-Hall (1993) 5 Cardelli, L and Wegner, P 'On understanding types, data abstrac-

tion and polymorphism' Comput. Surv. Vol 17 No 4 (December 1985)

6 Pyle, J C The ADA programming language (2nd edn) Chapters 2 and 12, Prentice-Hall (1985)

7 Weinberg, G M The psychology of computer programming Chapters 11 and 12, Van Nostrand Reinhold (1971)

8 Snyder, A 'Encapsulation and inheritance in object oriented programming languages' OOPSLA Proc., Sigplan Notices Vol 21, No 11 (1986)

9 Gfeller, M 'Parts of arrays--an introduction' APL88 Conf. Proc. (1988) p 166

10 Girardot, J-J 'The APL 90 Project: new directions in APL interpreter technology' APL85 Conf Proc. (1985) p 12

I 1 Date, C J An introduction to database systems Vol 1 (5th edn), Chapter 12, Addison-Wesley (1990)

12 Date, C J An introduction to database systems Vol 1 (5th edn), Chapter 13, Addison-Wesley (1990)

13 Paulson, L C MLfor Working Programmers Chapter 2, Cambridge University Press.

14 Barnes, J G P Programming in ADA (3rd edn) Chapter 10, Addison- Wesley (1989)

15 Terribile, M A Practical C+ + Chapter 9, McGraw-Hill (1994) 16 Wordsworth, J B Software development with ZAddison-Wesley (1992) 17 Jones, C B Systematic software development using VDM (2nd edn)

Prentice-Hall (1990) 18 Baber, R L Error-free software: know-how and know-why of program

correctness John Wiley (1991)


domain data typing

Documents