APCS introduction

(ARM Procedure Call Standard)

Introduction

APCS, or ARM Procedure Call Standard, provides a mechanism for writing tightly defined routines which may be interwoven with other routines. The most notable point about this is that there is no definition to where these routines come from. Some may be compiled C, some from compiled Pascal, and yet others written in assembler.

The APCS defines:

restrictions on the use of registers
conventions for using the stack
passing/returning arguments between function calls
the format of a stack-based structure which may be 'backtraced' to provide a list of functions (and parameters given) from the failure point backwards to the program entry

The APCS is not a single given standard, but is a collection of standards which are similar but differ in certain situations. For example, APCS-R (used on 26 bit versions of RISC OS) says that flags set on function entry should be reset on function exit. Under the 32 bit definition, it is not always possible to know the entry flags (there is no USR_CPSR) so you do not need to restore them. As you may expect, there is no compatibility between the versions. Code which expects the flags to be restored is likely to misbehave if they are not restored...
The newest versions of SharedCLibrary (v5.43 etc) can recognise and work with APCS-R (26 bit, restores flags), and APCS-3/32 (32 bit, doesn't restore flags). So long as the version of APCS is the same within your entire application, it can differ between applications on older machines.
For an example, David Pilling's OvationPro and my OvHTML are both compiled to be 26/32 neutral using the newer APCS specification. Meanwhile, Edit v1.54 sits on the iconbar. It is compiled to APCS-R.
The situation is slightly different on the Iyonix. There, APCS-R programs simply won't work (at all, no way José) because the processor does not support the required things. The way we get older programs working on this system is to set up a 'fake' environment that looks a lot like the older machines, patch the application so that certain things are done instead of the flag restoring and mucking with R14 to set/clear flags (but, to the application, it looks like it really happened), and then run the program in this safe environment. The software that does this is called Aemulor.

If you are developing an ARM based system (from scratch), then there is no requirement to implement APCS. It is recommended, as it is not difficult to implement, and it allows for a variety of benefits.
But, the here and now. APCS must be used if you are writing assembler code to hook into compiled C. The compiler will expect certain conditions, and these must be met in your add-in code. A good example is APCS defines that a1 to a4 may be corrupted, but v1 to v6 must be preserved.
By now, I'm sure you are scratching your head and saying 'a-what? v-what?'. So here is the APCS-R register definition...

Registers

APCS defines the registers using different names to our usual R0 to R14. With the power of the assembler pre-processor, you can define R0 etc, but it is just as well to learn the APCS names in case you are modifying code written by others.

Register names
`Reg #`	`APCS`	Meaning
`R0`	`a1`	Working registers
`R1`	`a2`	"
`R2`	`a3`	"
`R3`	`a4`	"
`R4`	`v1`	Must be preserved
`R5`	`v2`	"
`R6`	`v3`	"
`R7`	`v4`	"
`R8`	`v5`	"
`R9`	`v6`	"
`R10`	`sl`	Stack Limit
`R11`	`fp`	Frame Pointer
`R12`	`ip`
`R13`	`sp`	Stack Pointer
`R14`	`lr`	Link Register
`R15`	`pc`	Program Counter

These names are not defined by standard in Acorn's objasm (version 2.00), though later versions of objasm, and other assemblers (such as Nick Roberts' ASM) define them for you. Some assemblers may use a different command, refer to your documentation.
To define a register name, you typically use the RN directive, at the very start of your program:

a1     RN      0
a2     RN      1
a3     RN      2
    ...etc...

r13    RN      13
sp     RN      13
r14    RN      14
lr     RN      r14
pc     RN      15

That example shows us two important things:

That registers can be multiply defined - you can have both 'r13' and 'sp'.
That registers can be defined from previously defined registers - 'lr' was defined from the setting of 'r14'.
(this is correct for objasm, other assemblers may not do this)

Design criteria

Function calling should be fast, small, and easy to optimise (by compilers)
It should be able to cope with multiple stacks
It should be easy to write re-entrant and relocatable code; primarily with the writable data separated from the code
But above all, it should be simple so that assembler programmers may use it's facilities, and debuggers may be able to trace through the program fairly easily

Conformity

A part of a program which conforms to APCS while making a call to an external function is known as "conforming".
A program which conforms to the APCS at all times during it's execution (typically, a program generated by a compiler) is known as "strictly conforming".
The protocol would indicate that, provided you observe the correct entry and exit parameters, you may do whatever you need within the confines of your own function, and still remain conformant. This, sometimes, is necessary, such as when writing SWI veneers that utilise a large number of registers for the actual SWI call.

The stack

The stack is a linked list of 'frames' which are linked through what is known as a 'backtrace structure'. This structure is stored at the high end of each frame.
Each block of the stack is allocated in descending address order. The register sp will always point to the lowest used address in the most recent frame. This fits in with the tradition of a fully descending stack.
In APCS-R, the register sl refers to a stack limit, below which you cannot decrement sp.
The memory that resides between the current stack point, and the current stack limit, should contain nothing that is to be relied upon as another APCS function, when called, may well set up a stack block for itself.

There may be multiple stack chunks. These may be located at any address in memory, there is no convention here. This, typically, would be used to provide multiple stacks for the same code which is executing in a re-entrant manner; an anology here is FileCore which provides its services to the currently available FileCore filing systems (ADFS, RAMFS, IDEFS, SCSIFS, etc) by simply setting up 'state' information and calling the same pieces of code as is required.

Backtrace

The register fp (frame pointer) should be zero, or it should point to the last in a list of stack backtrace structures which will provide a means of 'unwinding' the program to trace backwards through the functions called.

The structure is:

   save code pointer       [fp]        fp points here
   return link value       [fp, #-4]
   return sp value         [fp, #-8]
   return fp value         [fp, #-12]  points to next structure
   [saved v7]
   [saved v6]
   [saved v5]
   [saved v4]
   [saved v3]
   [saved v2]
   [saved v1]
   [saved a4]
   [saved a3]
   [saved a2]
   [saved a1]
   [saved f7]                          three words
   [saved f6]                          three words
   [saved f5]                          three words
   [saved f4]                          three words

The structure contains between four and twenty-seven words, those in square brackets being optional values. The only thing that can be said is that if they do exist, then they exist in the given order (ie, saved f4 will be lower in memory than saved a3, but a2-f5 might not exist).
The floating point values are stored in 'internal format' and are three words (12 bytes).

The fp register points to the stack backtrace structure for the currently executing function. The return fp value should be zero, or a pointer to a stack backtrace structure created by the function which called the current function. The return fp value in this structure is a pointer to the stack backtrace structure for the function that called the function that called the current function; and so on back until the first function.

The return link value, return sp value, and return fp value are reloaded into pc, sp, and fp when the function exits.

  #include <stdio.h>

  void one(void);
  void two(void);
  void zero(void);

  int main(void)
  {
     one();
     return 0;
  }

  void one(void)
  {
     zero();
     two();
     return;
  }

  void two(void)
  {
     printf("main...one...two\n");
     return;
  }

  void zero(void)
  {
     return;
  }

  At the point of printing a message on the screen,
  our example APCS backtrace structure would be:

      fp ----> two_structure
               return link
               return sp
               return fp  ----> one_structure
               ...              return link
                                return sp
                                return fp  ----> main_structure
                                ...              return link
                                                 return sp
                                                 return fp  ----> 0
                                                 ...

Therefore, we can examine fp and see the structure for function 'two', which would point to the structure for function 'one', which would point to the structure for 'main', which points to zero to end. In this way, we can wind our way backward through the program and determine how we came to be at our current crash point.
It is worth pointing out the 'zero' function, as that has executed and been done with by the time we do our printing, so it was in the backtrace structure once, but it is no longer.

It is also worth pointing out that an APCS structure as the above is unlikely ever to be generated for the code given. The reason for this is that functions which do not call any other functions do not require full APCS headers.
For your perusal, this is the code generated by Norcroft C v4.00 for the above code...

        AREA |C$$code|, CODE, READONLY

        IMPORT  |__main|
|x$codeseg|
        B       |__main|

        DCB     &6d,&61,&69,&6e
        DCB     &00,&00,&00,&00
        DCD     &ff000008

        IMPORT  |x$stack_overflow|
        EXPORT  one
        EXPORT  main
main
        MOV     ip, sp
        STMFD   sp!, {fp,ip,lr,pc}
        SUB     fp, ip, #4
        CMPS    sp, sl
        BLLT    |x$stack_overflow|
        BL      one
        MOV     a1, #0
        LDMEA   fp, {fp,sp,pc}^

        DCB     &6f,&6e,&65,&00
        DCD     &ff000004

        EXPORT  zero
        EXPORT  two
one
        MOV     ip, sp
        STMFD   sp!, {fp,ip,lr,pc}
        SUB     fp, ip, #4
        CMPS    sp, sl
        BLLT    |x$stack_overflow|
        BL      zero
        LDMEA   fp, {fp,sp,lr}
        B       two

        IMPORT  |_printf|
two
        ADD     a1, pc, #L000060-.-8
        B       |_printf|
L000060
        DCB     &6d,&61,&69,&6e
        DCB     &2e,&2e,&2e,&6f
        DCB     &6e,&65,&2e,&2e
        DCB     &2e,&74,&77,&6f
        DCB     &0a,&00,&00,&00

zero
        MOVS    pc, lr

        AREA |C$$data|

|x$dataseg|

        END

This example code is not 32 bit compliant. However the APCS-32 specification simply states that flags need not be preserved. Thus, remove the '^' on the LDMs, and remove the 'S' from the MOVS in zero. Then the code is pretty much the same as that generated by a 32-bit aware compiler.

The save code pointer points to a location twelve bytes beyond the start of the code which set up that backtrace structure. You can see this in the example. Remember, you will need to strip off the PSR for 26-bit code.

So now we turn to our function, 'two'. As soon as execution enters 'two':

pc contains the location of the next instruction(s) to be executed, as always
lr contains the value to load into pc to exit (as always). This will also contain the PSR in 26-bit code.
sp points to the current stack chunk limit, or above it. This is the place you can dump temporary data into, registers and the like. Under RISC OS, you have at least 256 bytes with the option to extend it.
fp is either zero, or it points to the most recent part of the backtrace structure.
Function arguments are arranged as described (below).

Arguments

The layout of records, arrays, and the like is not defined by APCS. Thus, a language is free to define how it performs these activities. However making your own implementation is not really in the spirit of APCS as it would not permit code from your compiler to be linked with code from another compiler. Typically, the C language conventions are utilised.

The first four integer arguments (or less, if less!) are loaded into a1 - a4.
The first four floating point arguments (or less, if less) are loaded into f0 - f3.
There does not seem to always be an obvious pattern to this - sometimes when passing FP values from C to assembler, they'll be in Fx and sometimes they'll either be split across ARM registers or stacked.
Anything else (if anything) is stored in memory, pointed to by the words immediately above the value of sp on entry. In other words, the remaining arguments have been pushed onto the stack. It seems, therefore, that optimisation may be made simply by defining functions to receive four or less parameters.

Leaving the function

The return link value is moved into the program counter to exit the function, and:

If the function returns a value of, or less, than a word in size, that value is to be present in a1.
If the function returns a floating point value, then it is to be present in f0.
(I think...?)
sp, fp, sl, v1-v6, and f4-f7 shall be restored (if altered) to contain the values that were present on entry.
I have tested corrupting the registers, intentionally, and can report that the results could be the most unexpected and bizarre glitches (often in totally different parts of the program), as well as the expected 'uh-oh!'.
ip, lr, a2-a4, and f1-f3 and those arguments that were stacked may be corrupted.

In 32bit modes, the PSR flags do not need to be preserved across a function call. In 26bit modes they should be, and would be implicitly restored by moving the entirety of lr into pc (MOVS, or LDMFD xxx^).
The N, Z, C, and V must be reloaded from lr, it is not enough to preserve the flags across the function.

APCSs

Globally, there are several versions of APCS (16, in fact). We are, however, only going to concern ourselves with those you may encounter on RISC OS.

APCS-A
This is APCS-Arthur; and was defined in the dark days of Arthur. You may come across it (unlikely, though), or references to it, so it is worth knowing it exists. It has been deprecated and due to differing register definitions (that seem somehow alien to a seasoned RISC OS coder), it should not be used.
It was for Arthur applications running in USR mode.
sl = R13, fp = R10, ip = R11, sp = R12, lr = R14, pc = R15.
The PRM (p4-411) says "Use of r12 as sp, rather than the architecturally more natural r13, is historical and predates both Arthur and RISC OS."
The stack is segmented and is extended on demand.
26-bit program counter.
No passing of floating point arguments in FP registers.
Non-reentrant. Flags must be restored.

APCS-R
This is APCS-RISC OS. It is for (old) RISC OS applications operating in USR mode; or modules/handlers in SVC mode.
sl = R10, fp = R11, ip = R12, sp = R13, lr = R14, pc = R15.
This is the single most common APCS version, as all (older) compiled C programs will have used APCS-R.
Explicit stack limit checking
26-bit program counter.
No passing of floating point arguments in FP registers.
Non-reentrant. Flags must be restored.
It is worth noting that I have seen 'cc' version 5 (26 bit) generate code to put an FP value into an FP register - even though APCS-R says that FP values are not passed in FP registers, so I'm not sure exactly which context caused this to occur.

APCS-U
This is APCS-Unix, used in Acorn's RISCiX. It is for RISCiX applications (USR mode) or the kernel (SVC mode). sl = R10, fp = R11, ip = R12, sp = R13, lr = R14, pc = R15.
Implicit stack limit checking (with sl)
26-bit program counter.
No passing of floating point arguments in FP registers.
Non-reentrant. Flags must be restored.

APCS-32
This is an extension of APCS-2 (-R and -U) which allows for a 32bit program counter, and for flags to not be restored on exit from a function executing in USR mode.
Other things as for APCS-R.

APCS variants

APCS PC width Stack-limit checking FP arguments Reentrancy Notes

APCS-U (RISCiX) 26 bits Implicit Not in FP registers Non-reentrant

APCS-R
(older RISC OS software) 26 bits Explicit Not in FP registers Non-reentrant Flags must be restored

26 bits Implicit Via FP registers Non-reentrant

26 bits Explicit Via FP registers Non-reentrant

26 bits Implicit Not in FP registers Reentrant

26 bits Explicit Not in FP registers Reentrant

26 bits Implicit Via FP registers Reentrant

26 bits Explicit Via FP registers Reentrant

32 bits Implicit Not in FP registers Non-reentrant

APCS-32
(new RISC OS software) 32 bits Explicit Not in FP registers Non-reentrant Flags cannot be restored

32 bits Implicit Via FP registers Non-reentrant

32 bits Explicit Via FP registers Non-reentrant

32 bits Implicit Not in FP registers Reentrant

32 bits Explicit Not in FP registers Reentrant

32 bits Implicit Via FP registers Reentrant

32 bits Explicit Via FP registers Reentrant

Creating a stack backtrace structure

For simple functions (fixed number of parameters, non-reentrant), you can create a stack backtrace structure in a few instructions:

function_name_label
        MOV     ip, sp
        STMFD   sp!, {fp,ip,lr,pc}
        SUB     fp, ip, #4

That snippet (from the aforementioned compiled program) is the most basic form. If you intend to corrupt some of the non-corruptible registers, then you should include that register in the STMFD command.

Your next task is to check the stack space. If you don't need much space (less than 256 bytes) then you can use:

        CMPS    sp, sl
        BLLT    |__rt_stkovf_split_small|
	SUB	sp, sp, #<size of local variables>

That is the 'new' C version way of handling stack overflows. In earlier versions (v4.00 or previous), you will want to call |x$stack_overflow| instead.
Note that, following a call to __rt_stkovf_split_small (or x$stack_overflow), sp may point to a different stack chunk, so you should access stacked arguments with offsets from fp, not offsets from sp.

Then you do your stuff...

Exiting (when no FP registers need to be restored) is performed by:

        LDMEA   fp, {fp,sp,pc}

(LDMDB is the same as LDMEA - you do not use LDMFD to exit an APCS function)

Again, if you stacked other registers, then reload them here.
The exit mechanism was chosen because it is easier and saner to simply LDM... to exit a function than to branch to a special function exit handler.
For APCS-R (26 bit), suffix the LDM instruction with '^'.

An extension to the protocol, used in backtracing, is to embed the function name into the code.
Immediately before the function (and the MOV ip, sp), you should have the following:

        DCD     &FF0000xx

Where 'xx' is the length of the function name string (including padding and terminator). This string is word-aligned, tail-padded, and should be placed directly before the DCD &FF....

So, your complete stack backtrace code (<256 bytes of stack required) would look like:

        DCB     "my_function_name", 0, 0, 0, 0
        DCD     &FF000010
my_function_name
        MOV     ip, sp
        STMFD   sp!, {fp, ip, lr, pc}
        SUB     fp, ip, #4

        CMPS    sp, sl                    ; this may be omitted if you
        BLLT    |__rt_stkovf_split_small| ; won't be using stack...
	SUB	sp, sp, #<size of local variables>

        ...process...

        LDMEA   fp, {fp, sp, pc}          ; <-- append '^' for APCS-R

If you use no stack, and you don't need to save any registers, and you don't call anything, then setting up an APCS block is unnecessary (but might be useful to track down problems during the debug stage).
In this case, you could:

my_simple_function

        ...process...

        MOV     pc, lr

Use MOVS pc, lr in APCS-R.

One thing to consider is the case when we require more than 256 bytes. In this case, our code is:

        ; create the stack backtrace structure
        MOV     ip, sp
        STMFD   sp!, {fp, ip, lr, pc}
        SUB     fp, ip, #4

        SUB     ip, sp, #<maximum frame size>
        CMPS    ip, sl
        BLLT    |__rt_skkovf_split_big|
	SUB	sp, sp, #<initial frame size<

        ...process...

        LDMEA   fp, {fp, sp, pc}          ; <-- append '^' for APCS-R

To finish up, we'll look at an example function, and the code that is generated.

void c_lowercase(char string[])
{
   int  i = 0;
   while ( string[i] )
   {
      string[i] = tolower(string[i]);
      i++;
   }
   return;
}

Since this uses only one local variable, which can be kept in a register, the stack checking in the assembler snippet below is probably 'boilerplate' code - the stack is not used so we should be able to get away without any stack checks...

        =       "c_lowercase", 0
        DCD     &FF00000
cc_lowercase
        MOV      ip,sp
        STMDB    sp!,{a1,v1,v2,fp,ip,lr,pc}
        SUB      fp,ip,#4
        CMP      sp,sl
        BLLT     __rt_stkovf_split_small
        MOV      v1,a1
        MOV      v2,#0
        LDRB     a1,[a1,#0]
        CMP      a1,#0
        LDMEQDB  fp,{v1,v2,fp,sp,pc}
|L0002cc.J4.c_lowercase|
        LDRB     a1,[v1,v2]
        BL       tolower
        STRB     a1,[v1,v2]
        ADD      v2,v2,#1
        LDRB     a1,[v1,v2]
        CMP      a1,#0
        BNE      |L0002cc.J4.c_lowercase|
        LDMDB    fp,{v1,v2,fp,sp,pc}

Useful codey things

The first thing to consider is that dratted 26/32 bit issue. Put simply, there is absolutely no way in hell that the same general-purpose code can be assembled for both versions of APCS, without some hairy and devious tricks.
But, frankly, this isn't an issue. We know that your APCS standard isn't going to suddenly change. We also know that a 32bit version of RISC OS is not going to transmogrify itself when you pop out to brew a cuppa.
We also know that, with the newer SharedCLibrary, you can run APCS-32 code alongside APCS-R. So long as you don't start using processor-specific instructions (UMULL and MRS, the same code should work across the entire range of RISC OS machines, save those poor sods who are still using RISC OS 2!

Many existing APIs don't actually require flags to be preserved. So in our 32bit version we can get away by changing MOVS PC,... to MOV PC,..., and LDM {...}^ to LDM {...}, and rebuilding.
The objasm assembler (v3.00 or later) have a {CONFIG} variable which will be either 26 or 32. Using this, it is possible to build macros...

my_function_name
        MOV     ip, sp
        STMFD   sp!, {fp, ip, lr, pc}
        SUB     fp, ip, #4

        ...process...

        [ {CONFIG} = 26
          LDMEA   fp, {fp, sp, pc}^
        |
          LDMEA   fp, {fp, sp, pc}
        ]

I've not tested this code. It (or something like it) is likely to be the best way to stay compatible with both versions of APCS; unless you decide to simply require your users to have the newer SharedCLib, in which case you simply write 26/32 neutral code and use APCS-32. The 'power' applications (like OvationPro) require the new CLib, there's no reason why you can't expect the same.

Testing for 32bit?
If you require your code to be adaptive, there is a simple test to determine the processor PC state. From this, you can determine:

26bit PC, may be APCS-R or APCS-32.
32bit PC, will never be APCS-R. All 26-bit code (TEQP etc) doomed to failure!

   TEQ     PC, #0
   TEQ     PC, PC     ; EQ for 32bit; NE for 26bit

The first test ensures some flags are set, so that the second test will work correctly.

First case optimisation
Let's say we have a function like:

  int getbytefromcache(ptr)
  {
     /* ptr is pointer to cache value 0...xxxx */

     int __ptr = ptr;

     if (__ptr > __cachebase)
     {
        __ptr -= __cachebase;

        if (__ptr < __cachelimit)
           return (int)__cache[__ptr];
     }

     /* flush the cache, reload wanted block, return value */

     ...

It's a crappy example I devised off the top of my head. You have a pointer which can point into a total area of memory, and a cache of a small part.
If you look at it, the lead tests are pretty simple and quick. You could perform these without the APCS overheads. Something like:

getbytefromcache
        LDR     a4, __cachebase
        CMP     a1, a4
        BLT     getbytefromcache_entry
        SUB     a2, a1, a4
        LDR     a4, __cachesize
        CMP     a2, a4
        LDRLTB  a1, [a2]
        MOVLT   pc, lr
        ; fall through if not LT

getbytefromcache_entry
        MOV     ip, sp
        STMFD   sp!, {fp, ip, lr, pc}
        SUB     fp, ip, #4

        ... stuff ...

        LDRB    a1, [a#]

        [ {CONFIG} = 26
          LDMEA   fp, {fp, sp, pc}^
        |
          LDMEA   fp, {fp, sp, pc}
        ]

That example is, again, off of the top of my head so don't blindly copy the code. :-)
The point, though, is if there is something quick and simple that can be done in few instructions which can skip that which is done in many instructions, then it may be a worthwhile optimisation to make. A suggested rule of thumb, if it would take less time to execute than twice the time taken for the APCS stack structure creation, then try to optimise it.
(I work visually, counting the numbers of lines and taking STM as being '2'. This is a lot quicker than actually working out how many nanoseconds it would take by counting cycles!)

Return to assembler index

APCS variants
APCS	PC width	Stack-limit checking	FP arguments	Reentrancy	Notes
APCS-U (RISCiX)	26 bits	Implicit	Not in FP registers	Non-reentrant
APCS-R (older RISC OS software)	26 bits	Explicit	Not in FP registers	Non-reentrant	Flags must be restored
	26 bits	Implicit	Via FP registers	Non-reentrant
	26 bits	Explicit	Via FP registers	Non-reentrant
	26 bits	Implicit	Not in FP registers	Reentrant
	26 bits	Explicit	Not in FP registers	Reentrant
	26 bits	Implicit	Via FP registers	Reentrant
	26 bits	Explicit	Via FP registers	Reentrant
	32 bits	Implicit	Not in FP registers	Non-reentrant
APCS-32 (new RISC OS software)	32 bits	Explicit	Not in FP registers	Non-reentrant	Flags cannot be restored
	32 bits	Implicit	Via FP registers	Non-reentrant
	32 bits	Explicit	Via FP registers	Non-reentrant
	32 bits	Implicit	Not in FP registers	Reentrant
	32 bits	Explicit	Not in FP registers	Reentrant
	32 bits	Implicit	Via FP registers	Reentrant
	32 bits	Explicit	Via FP registers	Reentrant