From ARMwiki
Revision as of 00:43, 12 December 2011 by Admin (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search
Instruction STM[IA¦IB¦DA¦DB (¦FD¦ED¦FA¦EA)]
Function Store Multiple
Category Load and Store
ARM family All
Notes -



STM, for Store Multiple, is a way of writing multiple registers to memory in one single instruction. It is useful as both a way to quickly stack registers on entry to a subroutine, or in combination with Load Multiple as a way to produce block copies capable of saturating the memory bus (in other words, about as fast as the hardware can manage).

STM also includes (optional) writeback to the base address, so all stack or store offset calculations can be performed automatically.

STM will store registers - any subset between one and all of the general purpose R0-R15 registers. This is normally the register set visible to the current processor mode, however (using the ^ suffix) it is possible to access the banked (User) registers while the processor is in a privileged mode.

The registers are stored in sequence, with the lowest numbered register being written to the lowest memory address. Registers are written in order, regardless of the order in which they were specified, thus the following wouldn't work:

  STMFD  R13!, {R0, R1, R2, R3}  ; fails, as load/store instruction
  LDMFD  R13!, {R3, R2, R1, R0}  ; only keeps bitmask of registers,
                                 ; and not their ordering


  STM[cond][type]  Rn[!], {<regs>}[^]

Where '!' specifies to write back the updated base pointer to the base register,
and '^' specifies to access the banked registers (32 bit).

The <regs> can be a comma-separated list, or a dashed range, or a mixture. For example:

  R0, R1, R2, R3, R4, R7, R8
  R0-R4, R7, R8


 Store multiple registers to memory.

Types of memory access

STM provides four different types of access.

Increment After
The first address is that of the base register (Rn). Following writing the first register, the address is incremented by four, and so on for all but the last register to be written. The final address will point to the last register written.

  STM[cond]IA  Rn[!], {<registers>}[^]

Increment Before
The first address is that of the base register (Rn) plus four. Following writing the first register, the address is incremented by four, and so on for all of the registers to be written. The final address will point to the word following the last register written.

  STM[cond]IB  Rn[!], {<registers>}[^]

Decrement After
The first address is that of the base register (Rn), minus four times the number of registers to be written, plus four. For subsequent addresses, the address is incremented by four. The final address is the value of the base register.

  STM[cond]DA  Rn[!], {<registers>}[^]

Decrement Before
The first address is that of the base register (Rn), minus four times the number of registers to be written. For subsequent addresses, the address is incremented by four. The final address is the value of the base register, minus four.

  STM[cond]DB  Rn[!], {<registers>}[^]

In each of the above cases, the final address relates to the final address in memory, not to the writeback address which is calculated to be correct. This apparent disparity is because the ARM writes the lowest numbered register to the lowest address, regardless of the addressing mode used.

Alternative memory access names

While the names given above will be useful for block transfer operations, where it is likely that both load and store will use the same names, this is not so useful for stack based operations, where the following would be an example:

    STMDB  R13!, {R0-R3, R14}
    ...some code...
    LDMIA  R13!, {R0-R3, PC}

It would be quite unpleasant to mis-remember which access name to use at which time. Therefore, there are alternative names for stack operations, based upon two criteria:

  • Stack type:
    • Full - the stack pointer points to the last used location.
    • Empty - the stack pointer points to the next unused location.
  • Stack direction:
    • Descending - the stack grows downwards in memory, starting at the highest address
    • Ascending - the stack grows upwards, starting at the lowest address

Therefore, the stack names are based upon this, and are FD, ED, FA, and EA. Their exact interpretation differs depending on whether it is a load or store, but you don't need to worry about this, only that using the same name (ie FD) in both cases will result in expected and consistant behaviour.

By way of comparison, the 6502 stack pointer counts downwards from &FF and the pointer itself is the next free location, thus it would be an "Empty Descending" stack.

Both RISC OS and ArmLinux (ie Android, etc), by convention, use a Fully Descending stack.

For the sake of completenes:

Stack name Storing Loading


Given R0-R3 as scratch registers, R4-R6 corrupted in our procedure, and R13 as a stack pointer with R14 as the return address; a procedure can be wrapped as follows:

  STMFD  R13!, {R4-R6, R14}
  ...procedure code here...
  LDMFD  R13!, {R4-R6, PC}

The final instruction will restore the three registers we don't want corrupted, push the return address into PC (to exit the procedure), and set the stack pointer to what it was on entry.

Zippedy block copy function, will copy eleven words at a time (44 bytes), in eleven word chunks.
On entry R0 points to the source block start, R1 points to the source block end, and R2 points to the destination block. R13 is stack pointer. All registers are preserved.

  STMFD  R13!, {R0-R12,R14}
  ; shift registers so we have the following:
  ;   R0-R10 = available for use
  ;   R11    = source start
  ;   R12    = source end
  ;   R13    = stack pointer
  ;   R14    = destination start
  MOV    R11, R0
  MOV    R12, R1
  MOV    R14, R2
  LDMIA  R11!, {R0-R10}  ; load data
  STMIA  R14!, {R0-R10}  ; write data
  CMP    R11, R12        ; reached end?
  BLT    copyloop
  ; tidy up and exit
  LDMFD  R13!, {R0-R12, PC}

Unlike the Simple Block Copy printed in the ARM Architecture Reference Manual, this version sacrifices one word capacity (ie we copy 44 bytes, not 48) for the flexibility of being a self-contained function. If you observe the ARM version, you will notice both R13 and R14 are corrupted, thus meaning you would need to do something with one (most likely R13) to allow it to be reloaded to permit the function to be exited.

While 44 bytes might seem an odd number, with 32 bytes (or 8 registers) being a better value (8 = 256 bytes, 32 = a kilobyte, etc), it really depends on what you will be copying. Take, for example, a megabyte. That would require 32768 loops of 32 byte transfers, or 23831 loops of 44 byte transfer with special handling for the remaining 12 bytes. The special handling will likely take a lot less time than nearly nine thousand additional passes through the loop.
To put this into perspective, I put together a little BASIC program which ran the copy loop on a megabyte of data 1024 times (thus copying a gigabyte). Under emulation, with an ARM710 clocking some 700MHz, and pushing some 600MiB/sec memory accesses, the 44 byte copy took 380cs (a mite under four seconds), while the 32 byte copy took 428cs.
Reality is likely to be quite different, depending on the physical hardware in use - for example the (ancient) RiscPC's lethargic memory bus means you'll probably only see figures in the order of 10-12MiB/sec running flat out, meaning a gigabyte would transfer in around a minute and a half. In optimal conditions. The differences could be significant.
Thankfully things are much nicer on more recent hardware, any Android phone or iPod for instance.


  • The S bit, controlled by the ^ suffix controls whether or not privileged modes will force storing the banked (user mode) registers instead of those applicable to the current mode. Accordingly, doing this in User (or System) mode is unpredictable.
  • R15 should not be specified, anywhere, in an STM instruction. As a base register, it would be unpredictable, and as a stored register it is implementation defined.
  • If the base register is specified in the register list, and writeback is enabled, then things could go bang!. Just, don't.
  • Addresses should be word-aligned.


The instruction bit pattern is as follows:

31 - 28 27 - 25 24 23 22 21 20 19 - 16 15 - 0
condition 1 0 0 P U S W 0 Rn (base) Bitmask of registers


  • P specifies if the address is incremented before the data is written (P=1) or incremented after (P=0).
  • U specifies if the address is ascending (U=1) or descending (U=0).
  • S specifies if banked register access should occur when in privileged modes.
  • W specifies if the base register address should be updated after the data transfer.
Personal tools