Generation of Executable Code from Source Program Files


Source Program Files

"Source" program files contain text (characters) which describe the data elements and the processing required to perform some particular information processing task.

The form that this textual description takes depends upon the specific "language" being used. In general such "languages" can be divided into two major catagories: "high-level" languages and "low-level" (or "assembler") languages. In either case, the textual description must be converted, from text (characters), into the binary patterns actually used by the computer's internal circuits to specify and control the activities required. A "compiler" is a program which translates "high-level" source program files into these binary patterns (often called "machine-level" code; a "compiler" translates single "high-level" statements into several (often, many) "machine-level" instructions. An "assembler" is a similar, but much simpler, translating program which translates each "low-level" instruction into a single "machine-level" instruction; each "machine-level" action is represented by a single, simple "low-level" textual code, called a "mnemonic" code. Mnemonic coding provides a reasonable way of controlling the computer at the machine level, although it would be a tedious method for developing very large programs or systems.

Understanding the "assembler" process is necessary as a starting point for understanding the translation process of higher level languages (using "compilers") into instructions the computer can process,

In addition to the use of mnemonic codes to represent machine-level actions, assembler (or mnemonic) coding typically uses symbolic labels to represent addresses (including data addresses) which at the machine-level are identified as (numeric) bit patterns. When a specific memory location is required for use in an instruction, a symbolic reference to the corresponding label is used. A label can only correspond to a single (low-level) memory address, however, there may be many symbolic references to this label.

Translation of "mnemonic code" using symbolic addresses into machine-level instruction requires that the mnemonic (source) code file be read twice:
- once to determine the memory address to associate with each symbolic reference, and
- a second time to replace sybolic references with the associated addresses (and translate operation codes)

Steps in Assembler to Machine Language Translation



Modularization of Program Development

Even once mnemonic code (or high level language code) has been translated into machine level instructions, there is generally a lot of work that may need to be done before we have a working program.

Most programs are developed in a "modular" fashion and are composed of routines (or "subroutines") which have been assembled or compiled into (separate) independent "object" files. Some of these subroutine object files may have been developed as integral components of the specific program being developed, but others may be "general purpose" routines (prehaps even supplied with the "assembler" or "compiler") which may be stored in composite library file or object subroutines.

All these files need to be combined into one complete program. Addresses will not match where they were assumed to belong by the "assembler" (or "compiler") since there would be no way for these programs to tell the size of other routines which would need to appear in memory locations ahead of any specific subroutine. All such addresses, therefore, will need to be modified to compensate for where the routines have actually been loaded into real memory.

In order to more fully explore this process, we will first need to consider how the "Little Man" Computer could be expanded to permit handling of "subroutines".

Subroutine Management

The original "Little Man" Computer posed some significant restrictions in implementing an "extension" which would permit management of subroutines. Specificly:

  1. Only one decimal digit (0) was not used by the original "Little Man" as a machine-level "operation code". This meant that, while it was possible to add an new instruction to "call" a subroutine, there were no numeric code values left to add a "return" instruction.
  2. Establishing a method for saving the "return address" (the location where control should resume after completion of the subroutine) is difficult. The "Little Man" Computer supplies only a single "working" register, the Accumulator, and no easy way to use the contents of this register as an address. Furthermore, if the Accumulator were used to hold the "return address" when a call were made to a subroutine, this would elimate the possibility of using the Accumulator to pass values to or return values from the subroutine.
Implementation Method:
The method selected for the extended LMC (called the "sonOfLMC") is to use the numeric code 0xy as the CALL xy instruction (in a method very similar to the use of 9xy as the JMP xy instruction). The difference between the CALL and the is that the CALL first stores a copy of the numeric code for a JMP to the location physically after the CALL (that is to the address in the Counter) in the memory location which immediately precedes the location being CALL'ed / JMP'ed to. Each subroutine must be coded with a mailbox which is left unused immediately before the first instruction to be executed in the subroutine. When a subroutine has completed its task and wishes to return to the "calling" routine, it must JMP to the mailbox immediately preceding its first instruction (that is to the mailbox containing another JMP which will cause control to return to the statement physically following the original CALL.

The method for passing parameter values to a subroutine is left unspecified by the structure of this extended version of the LMC. Parameter passing is a protocol to be decided upon between the two routines involved, but, in general, information sharing (passing argument values to the subroutine and returning result values to the calling routine) is normally achieved through the use of "global" labelled memory locations. This "normal" method assumes that, if separately assembled (or compiled modules (routines) are used, there will be some method for identifying shared, labelled memory locations.

Separate Module Assembly (to Object Files)

As has been already noted, modules (or "subroutines") are generally assembled (or compiled) from "source" files into separate "object" files. This results in several problems:
  1. Calls to External Routines - Calls to routines not included with the source can not have their actual memory addresses filled in since those addresses will not be known at the time the translation is done.
  2. Variable Address References - Addresses of instructions or data which were translated based on the assumption that the routine would be loaded starting at a specific address, will be invalid (and will need to be modified) if the routine is loaded at a different address.
  3. Modifications to Object File for Calling Modules - The "object" file for routines which call other (external) routines must include the name of the "called" routine and the location within the "calling" routine where the "called" routine is called.
  4. Modifications to Object File for Called Modules - "Called" routines (as a result of not being the "main" routine) can not rely on any fixed address references as being valid; the "object" file must contain a table of all locations containing addresses that must be "reloacted" to reflect the actual load address used for the routine.
As a result of these problems, "object" files (typically) contain three tables, in addition to the numeric code translations:
- a public label table containing a list of all labels defined within the module (and their corresponding mailbox addresses) which should be made available to other modules.
- an external reference table containing a list of all mailbox addresses which make references to "public" labels defined in other modules (plus the actual label being referenced).
- a relocation address table containing a list of all the addresses containing instructions which include address references to addresses within the module (based on the false assumption that the module would be loaded starting at address 00); all such address references will need to be adjusted to reflect the "true" address values, once the "true" starting address of the module has been established.

Example:
Suppose we want a simple program which outputs pairs of numbers; the first number of each pair will be a "count" value starting at 001 and going up to 010; the second number of each pair will be double the value of the first number. In order to allow the user time to read the numbers output, the program will pause after each output for a certain amount of time (performing some time wasting loop).

Suppose this has been coded as 3 separate "source" files (using mnemonic assembler code):
Mnemonic Code Object Code
// main program
Repeat  LDA Count
        OUT
        CALL Pause
        LDA  Count
        STA  Arg
        CALL Dble
        LDA  Count
        ADD  One
        STA  Count
        SUB  Max
        SKP
        JMP  Repeat
        HLT
Count   DAT  000
One     DAT  001
Max     DAT  010
Pause   EXT
Arg     EXT
Dble    EXT

00: 113
01: 600
02: 000
03: 113
04: 201
05: 000
06: 113
07: 314
08: 213
09: 415
10: 802
11: 900
12: 700
13: 000
14: 001
15: 010

public labellocation
(none) 

external referencelocation
Pause02
Arg04
Dble05

relocatable address locations
(none)

// Pause subroutine
        DAT
Pause   LDA  Count
        SUB  One
        STA  Count
        SKZ
        JMP  Pause
        JMP  Pause-1
Count   DAT  000
One     DAT  001
        PUB  Pause

00: 000
01: 107
02: 408
03: 207
04: 801
05: 901
06: 900
07: 000
08: 001

public labellocation
Pause01

external referencelocation
(none) 

relocatable address locations
01, 02, 03, 05, 06

// Dble subroutine
        DAT
Dble    JMP  overArg
Arg     DAT
overArg LDA  Arg
        ADD  Arg
        OUT
        CALL Pause
        JMP  Dble-1
        PUB  Dble
        PUB  Arg
Pause   EXT

00: 000
01: 903
02: 000
03: 102
04: 302
05: 600
06: 000
07: 900

public labellocation
Dble01
Arg02

external referencelocation
Pause06

relocatable address locations
01, 03, 04, 07


Link Processing

Combining multiple "object" files together to form a single "executeable" program (often stored as a file) is the job of a program called a Linker.

    The link process can be viewed as a series of steps:
  1. copy the numeric codes from all object files into sequentially contiguous locations, keeping track of the address used for each routine's "entry point".
  2. adjust the location specifications in the "external reference" & "relocatable address locations" tables by adding the entry point address of the routine to each location specification.
  3. adjust all external reference locations by adding the actual entry point address of the externally referenced routine to the address portion of instruction at that location.
  4. adjust all relocatable address locations by adding the entry point of the corresponding routine (i.e. the routine containing the specific "relocatable address locations" table) to the address operand at each location.
  5. copy the modified numeric code to an "executable" file.

Running through this process with the previously generated object files:

  1. The Linker copies the numeric instruction codes from each of the object files into contiguous (mailbox) memory locations and builds a table indicating the actual location where each subroutine was loaded, plus a table of public labels with their locations modified by the addition of the beginning load address of the subroutine which contains the definition of each public entry:
    00: 113
    01: 600
    02: 000
    03: 113
    04: 201
    05: 000
    06: 113
    07: 314
    08: 213
    09: 415
    
    10: 802
    11: 900
    12: 700
    13: 000
    14: 001
    15: 010
    16: 000
    17: 107
    18: 408
    19: 207
    
    20: 801
    21: 901
    22: 900
    23: 000
    24: 001
    25: 000
    26: 903
    27: 000
    28: 102
    29: 302
    
    30: 600
    31: 000
    32: 900
    

    routineloaded at:
    main routine00
    Pause subroutine16
    Dble subroutine25
    public labellocation
    Pause01 17
    Dble01 26
    Arg02 27
    Dble01 26

  2. using the external reference table from the main object file:
    external referencelocation
    Pause02
    Arg04
    Dble05

    the instructions in (mailbox) addresses 02, 04, and 05 are changed by adding the location address from the (modified) public label table to the current contents:
    02: 000 017
    03: 113
    04: 200 227
    05: 000 026
    
    CALL Pause
    LDA  Count
    STO  Arg
    CALL Dble
    

  3. Since the Pause object file contains no external references, no modifications are required to this portion of the code to "fill in" the addresses for external references

  4. Similarly the external reference table from the Dble object file is used to modify the contents of the location within the Dble subroutine which CALLs the Pause subroutine. Note that the locations in the external reference table must be modified (by the addition of the Dble subroutine's load address) before they (actually, in this case, "it") can be used.
    external referencelocation
    Pause06 31

    Since the (modified) public label table identifies mailbox address 17 as corresponding to the public label "Pause", then the contents of mailbox 31 (above) must be modified by the addition of 17.

    31: 000 017
    
    CALL Pause
    

  5. Finally within the (mailbox) addresses for each subroutine, the instructions at the locations identified by the relocatable address locations table from each object file, must be adjusted by the actual location where that specific subroutine was loaded.
    Note that there are two address adjustments required for each subroutine:
    1. The relocatable address locations table must be adjusted to compensate for the actual load address of the subroutine:
      For the Pause subroutine (loaded starting at 16)
      relocatable address locations
      01 17,   02 18,   03 19,   05 21,   06 22  

      For the Dble subroutine (loaded starting at 25)
      relocatable address locations
      01 26,   03 28,   04 29,   07 32  

    2. The contents of memory at the addresses specified in the (modified) relocation address locations table must be modified to compensate for the actual load address of the subroutine:

      For the range of mailboxes of the Pause subroutine
      values from the
      Pause object file
      modified instructions
      base on load at 16
      16: 000
      17: 107
      18: 408
      19: 207
      20: 801
      21: 901
      22: 900
      23: 000
      24: 001
      
      16: 000
      17: 123
      18: 424
      19: 223
      20: 801
      21: 917
      22: 916
      23: 000
      24: 001
      

      and, similarly, the relocatable addresses within the Dble subroutine range will be updated.

      For the range of mailboxes of the Dble subroutine
      values from the
      Dble object file
      modified instructions
      base on load at 25
      25: 000
      26: 903
      27: 000
      28: 102
      29: 302
      30: 600
      31: 000
      32: 900
      
      25: 000
      26: 928
      27: 000
      28: 127
      29: 327
      30: 600
      31: 000
      32: 925
      

    At this point all address operands should be correct and the combined block of code properly executable.



Link Process Summary



Load and Execute Operating System Functions