22C:18, Lecture 23, Fall 1996

Douglas W. Jones
University of Iowa Department of Computer Science

Object Oriented Programming

In modern high-level languages that support an object oriented programming model, programmers are discouraged from writing in terms of pointers to records and instead encouraged to declare classes, class instances and methods. Consider the following binary tree node, declared in Simula 67, the original object oriented programming language:

	class TREE;
	begin
	    ref (TREE) left;
	    ref (TREE) right;
	    int value;

	    procedure traverse;
	    begin
		if not(left = none) then left.traverse;
		print(value);
		if not(right = none) then right.traverse;
	    end traverse;

	    procedure setval(v)
	    int v;
	    begin
		value := v;
	    end setval;

	    comment - initialization code;
	    left :- none;
	    right :- none;
	    v := 0;
	end TREE;
Note that in Simula 67, class declarations resemble record (or structure) declarations, with fields of the record becoming instance variables of the class, and with procedures (or functions) declared inside the record becoming methods of the class. Executable statements in the class declaration make up the code of the initializer for instances of that class. The operator := is assignment, while :- is the pointer assignment operation (this may be the biggest syntactic flaw in Simula 67, but it is no worse than == vs = in the C family of languages).

Given the above declarations, the following bit of code would be legal:

	ref (TREE) a;
	comment - a is an uninitialized pointer to a tree;
	a :- new TREE;
	comment - allocate and initialize an instance;
	a.setval(5);
	a.traverse;
Calls to methods of a class instance are written as if they were references to components of a record. Thus, a.setval(5) calls the setval method of instance a of class TREE.

De-objectification

The problem we face is how to translate this kind of code to machine code! For simple object oriented programming, the answer is straightforward. Each class is just a record, and the methods and initializers of that class are procedures and functions. As with other high level programming problems, the first step to reducing object oriented code to assembly language has nothing to do with assembly language, but instead, it involves rewriting the high level language program in that languagei, or at least, in a language at a similar level, to avoid the features that don't translate easily to assembly language.

So, the first step is to "de-objectify" the object oriented code, converting it to simple use of procedures and functions. It is convenient, but not strictly necessary, to adopt a systematic naming scheme for the resulting new types and routines. The basic rules we will follow are:

  1. Remove all methods and initializers from the class, converting the class to a pure data object, equivalent to a Pascal record or C struct. If the class name was C, a systematic naming rule would be to name the record or structure type something like Crep, for C representation.

  2. Convert the initializer for the class to a function that returns an instance of the record or struct. The body of this routine begins with a call to malloc (in C) or new (in Pascal), then initializes the data part of the class. A good name for the initializer function for instances of class C would be Cinit.

  3. Convert the methods of the class into procedures and functions. Each such procedure or function has the same return-type as the original method, and each takes the same parameters as the original method, plus one new parameter, the class instance that the method should be applied to. If class C has method M, the name for the corresponding procedure or function might be CM.

  4. Convert all references to methods of a class to procedure or function calls, passing the instance of the representation type for the class as a parameter to the procedure or function that handles that method. If I is an instance of class C, and M is a method of class C, the method reference I.M(x) would be rewritten CM(I,x).
Applying the first rule above to the Simula 67 code given earlier gives the following result:
	class TREErep;
	begin
	    ref (TREErep) left;
	    ref (TREErep) right;
	    int value;
        end TREErep;
This declares a structure that is exactly analogous to a Pascal record or C struct, and that conveys no more information about the use of this structure than the corresponding C or Pascal declarations would. The remaining rules above allow us to replace the methods of the class with the following procedures and functions:
	procedure TREEtraverse(t)
	ref (TREErep) t;
	begin
	    if not(t.left = none) then TREEtraverse(t.left);
	    print(value);
	    if not(t.right = none) then TREEtraverse(t.right);
	end TREEtraverse;

	procedure TREEsetval(t,v)
	ref (TREErep) t;
	int v;
	begin
	    t.value := v;
	end TREEsetval;

        ref (TREErep) procedure TREEinit;
	begin
	    ref (TREErep) n;
	    n :- new TREErep;
	    n.left :- none;
	    n.right :- none;
	    n.v := 0;
	    TREEinit :- n;
	end TREEinit;
Note that in Simula 67, as in Pascal, values are returned from functions by assignment to the function name. Thus, the final statement above, TREEinit:-n, would be written as return(n) in C or FORTRAN.

Given the above rewritten declarations, the example bits of code used to illustrate references to the original Simula 67 fragment would be rewritten as follows:

	ref (TREErep) a;
	comment - a is an uninitialized pointer to a tree;
	a :- TREEinit;
	comment - allocate and initialize an instance;
	TREEsetval(a,5);
	TREEtraverse(a);

The rules for deobjectification given here are completely sufficient for converting object oriented programs into conventional programs so long as the different classes involved are unrelated. In such cases, there is nothing special about object orientation. In fact, some say that the object structures illustrated above are nothing but syntactic sugar, since there is nothing that can be written in a language incorporating simple classes and methods that cannot be rewritten in terms of procedures, functions, records and pointers to records.

More generally, the term syntactic sugar refers to any language feature that is there to help make programs easier to read but contributes nothing in particular to the expressive power of the programming language. Clearly, comments are syntactic sugar, but so are for loops in C and Pascal! After all, there is essentially nothing that can be programmed with a for loop that can't also be programmed with a while and a few auxiliary statements.

Class Hierarchies

Class hierarchies are not mere syntactic sugar! Programs containing class hierarchies cannot be trivially converted to conventional procedural programs; For example, such programs cannot be converted trivially to Pascal without the need to introduce significant extra code! Such programs can, however, be converted to both C and assembly language without adding executable code.

To demonstrate this, we need an example. Consider the problem of representing parse trees of expressions. The parse tree of an expression is a tree with leaf nodes for each operand and internal nodes for each operator, where the tree structure is used in liu of parentheses to indicate which operators are applied in what order. This is illustrated by the following example:

        An expression:  (1 + 2 * 3) / 4

        The parse tree:      (/)
                             / \
                           (+)  4
                           / \
                          1  (*)
                             / \
                            2   3
There are clearly two kinds of nodes in a parse tree, leaf nodes, each with a value, and internal nodes, each of which carries an operator, a left pointer, and a right pointer. Both external and internal nodes are kinds of parse trees, so we would like to define them as kinds of parse tree nodes. Thus, we say that parse trees nodes are an example of a polymorphic (many shaped) type.

Furthermore, we would like to be able to perform operations on parse trees; specifically, we would like to be able to print the expression corresponding to a tree, and we would like to be able to evaluate the expression corresponding to a tree. To print a leaf node, we output the value of the node. To print an internal node, we print a left paren, print the left subtree, print the operator, and print the right subtree. Thus, depending on the node type, the print operations will be very different.

To handle polymorphic types, Simula 67 and other object oriented programming languages allow classes to be derived from other classes. We start by declaring the base class:

	class NODE;
	begin
	    virtual: procedure print;
	    virtual: integer procedure evaluate;
	end NODE;
The keyword virtual before a procedure declaration for a method of a class says that no definition for that method will be given, but that such a method must be given if the class is ever to be used. Thus above code says that all nodes have methods allowing them to print themselves and a method allowing them to evaluate themselves. This declaration does not provide any code for these methods, but just says that all instances of nodes must provide such code. We must introduce new classes that expand on NODE to actually define these methods. The two we need are given below:
	NODE class LEAF;
	begin
	    int value;

	    procedure print;
	    begin
		write(value);
	    end print;

	    procedure evaluate;
	    begin
		evaluate := value;
	    end evaluate;

	end LEAF;

	NODE class INTERNAL
	begin
	    ref (NODE) left;
	    char op;
	    ref (NODE) right;

	    procedure print;
	    begin
		write('(');
		left.print;
		write(op);
		right.print;
		write(')');
	    end print;

	    procedure evaluate;
	    begin
		if op = '+' then
		    evaluate := left.evaluate + right.evaluate
		else if op = '-';
		    evaluate := left.evaluate - right.evaluate
		else if op = '*';
		    evaluate := left.evaluate * right.evaluate
		else if op = '/';
		    evaluate := left.evaluate * right.evaluate
		else error;
	    end evaluate;

	end INTERNAL;
Given such a set of declarations, how can they be translated to assembly language? Again, we will seek a high level language solution as an intermediate step, but first, we must seek a conceptual solution.

Consider the code fragment right.evaluate used in the expressions in the second version of evaluate above. The name right refers to a variable which is a pointer to a NODE, but there are two kinds of node, INTERNAL and EXTERNAL. This means that, in a de-objectified version of this code, right.evaluate must determine which of two procedures is to be called. In Pascal, perhaps the best that can be imagined is replacing the call to right.evaluate with something like this:

	if right.kind = internal then
	    INTERNALevaluate(right)
	else
	    EXTERNALevaluate(right);
This assumes that NODErep is a variant record (a record with multiple representations) with an added tag field, kind, used to determine which representation, is being used. The problem is, this makes something that is easy to write into something expensive to implement.

In C or assembly language, another option is open. We can replace each "virtual method" of a class with a pointer to the procedure being used. Thus, we would declare three structures:

	type struct NODErep {
		void (*print)(NODErep *);
		int (*evaluate)(NODErep *);
	} NODErep;

	type struct LEAFrep {
		/* this is a subclass of NODErep */
		void (*print)(NODErep *);
		int (*evaluate)(NODErep *);

		/* added fields */
		int value;
	} LEAFrep;

	type struct INTERNALrep {
		/* this is a subclass of NODErep */
		void (*print)(NODErep *);
		int (*evaluate)(NODErep *);

		/* added fields */
		NODErep * left;
		char op;
		NODErep * right;
	} INTERNALrep;

	type union NODErep {
		LEAFrep;
		INTERNALrep;
	} NODErep;
Note, here, that we have been very careful to replicate all fields of NODErep exactly in the structures used to represent each subclass! The only fields common to all subclasses are the pointers to the print and evaluate functions, each of which takes a single instance as a parameter.

Given these routines, we can apply the same de-objectification rules discussed earlier, giving the following procedures:

	void LEAFprint(NODErep * self)
	{
		printf( "%d", ((LEAFrep *)self)->value );
	}

	int LEAFevaluate(NODErep * self)
	{
		return ((LEAFrep *)self)->value;
	}

	NODErep * LEAFinit()
	{	
		NODErep * self = (NODErep *)malloc(sizeof(LEAFrep));
		self->print = LEAFprint;
		self->evaluate = LEAFevaluate;
		return self;
	}

	void INTERNALprint(NODErep * self)
	{
		printf( "(" );
		(*((INTERNALrep *)self)->left->print)
			(((INTERNALrep *)self)->left));
		printf( "%c", ((INTERNALrep *)self)->op );
		(*((INTERNALrep *)self)->right->print)
			(((INTERNALrep *)self)->right));
		printf( ")" );
	}

	int INTERNALevaluate(NODErep * self)
	{
		int leftval = (*((INTERNALrep *)self)->left->evaluate)
			(((INTERNALrep *)self)->left));
		int rightval = (*((INTERNALrep *)self)->right->evaluate)
			(((INTERNALrep *)self)->right));
		return leftval + rightval;
	}

	NODErep * INTERNALinit()
	{	
		NODErep * self = (NODErep *)malloc(sizeof(LEAFrep));
		self->print = INTERNALprint;
		self->evaluate = INTERNALevaluate;
		return self;
	}
The worst thing about translations of object oriented code to C is the extensive use of casting needed! Casting in C is indicated by the prefix of a parenthesized type name. For example, the (NODErep *) used as a prefix to the call to malloc in INTERNALinit forces the pointer returned to be treated as a pointer to a structure of type NODErep, conforming to the type of the variable self and to the type to be returned by the function.

Formally, such casting is needed everywhere that one of our pointers changes type, and the type changes every time we move from, for example, a parameter of type NODErep, as required by the procedures type declarations, to a value of type LEAFrep or INTERNALrep, as needed to reference the fields of some specific item.

We can make the code slightly easier to read by introducing an extra variable. For example, we can rewrite the evaluate procedure for internal node evaluation as follows:

	int INTERNALevaluate(NODErep * self)
	{
		INTERNALrep * myself = (INTERNALrep *) self;
		int leftval = (*myself->left->evaluate)(myself->left);
		int rightval = (*myself->right->evaluate)(myself->right);
		return leftval + rightval;
	}
In assembly language, all of the casts dissapear, and there is no need for an extra variable to improve readability. The only thing we need to worry about is how to initialize a field of a record to point to a procedure and how to follow that pointer to get to a particular procedure.

Consider a procedure with the label INTERNALprint. If R3 is a pointer to a record with a field called print that should point to INTERNALprint, we can make it do so as follows:

	LEA	R5,INTERNALprint
	STORE	R5,R3,print
Given that R3 points to a record containing a field called print that points to a procedure, that procedure can be called as follows:
	LOAD	R1,R4,print
	JSR	R1,R1
If the print procedure is intended as a method of a class, and if the record pointed to by R3 is the representation of an object of that class, the pointer to the record must be passed as a parameter, so using R3 is an excellent choice!

Historical note: When Bjorn Stroustrup set out to design the C++ programming language, his initial idea was to write a set of C preprocessor macros (#define statements) that would add Simula 67 notions of object oriented programming to the C programming language. Much of the C++ notation is still compatable with this idea, but the C preprocessor proved to be insufficient for this job, so he moved on to using more sophisticated tools. Nonetheless, Stroustrup's implementation of C++, still used by AT&T, is merely a translator from C++ to C, and it de-objectifies programs using methods very similar to those discussed above.