STM

From ARMwiki
(Difference between revisions)
Jump to: navigation, search
(Added diagram of STMxx and where data is written to; plus timings and use of STR/LDR for single register transfers.)
(Rewrote the memory access descriptions to be clearer.)
Line 34: Line 34:
  
 
===Types of memory access===
 
===Types of memory access===
STM provides four different types of access.
+
STM provides four different types of access. In all cases, the ARM writes the lowest register specified to the lowest memory address (in other words, if R0 and R2 are to be written, R0 will go to the lowest address word and R2 will go to the next word up). This is absolute.
  
 
'''Increment After'''<br>
 
'''Increment After'''<br>
The first address is that of the base register (Rn). Following writing the first register, the address is incremented by four, and so on for all but the last register to be written. The final address will point to the last register written.
+
The base register (Rn) points to the first address. The lowest register of the set to write is written to this address. The address is then incremented by four bytes. This sequence repeats until off of the registers have been written. If writeback has been selected, the base register will be updated to point to the address ''following'' the last register written.
 
   STM[cond]IA  Rn[!], {<registers>}[^]
 
   STM[cond]IA  Rn[!], {<registers>}[^]
  
 
'''Increment Before'''<br>
 
'''Increment Before'''<br>
The first address is that of the base register (Rn) plus four. Following writing the first register, the address is incremented by four, and so on for all of the registers to be written. The final address will point to the word ''following'' the last register written.
+
This is functionally similar to Increment After, only this time the base address is incremented ''before'' the first register is written. For each register, the address increments, then the register is written. Hence, if writeback is in use, the base register will be updated to point to the address of the last register written.
 
   STM[cond]IB  Rn[!], {<registers>}[^]
 
   STM[cond]IB  Rn[!], {<registers>}[^]
  
 
'''Decrement After'''<br>
 
'''Decrement After'''<br>
The first address is that of the base register (Rn), minus four times the number of registers to be written, plus four. For subsequent addresses, the address is incremented by four. The final address is the value of the base register.
+
This is more complicated. The first address is that of the base register (Rn) minus four time the number of registers to be written, plus four. The registers are written, the address increments.
 +
 
 +
It it perhaps better to think of this in logical terms rather than actual, therefore:<br>
 +
At the address specified in Rn, the final register to be written is stored. The address then decrements by four bytes (a word). This repeats until all of the registers have been written, in descending order so the lowest register goes to the lowest address. If writeback is enabled, the base register is updated to point to the word before (lower in memory) than the registers written.
 
   STM[cond]DA  Rn[!], {<registers>}[^]
 
   STM[cond]DA  Rn[!], {<registers>}[^]
  
 
'''Decrement Before'''<br>
 
'''Decrement Before'''<br>
The first address is that of the base register (Rn), minus four times the number of registers to be written. For subsequent addresses, the address is incremented by four. The final address is the value of the base register, minus four.
+
Thinking logically again (inside, the ARM works differently): The address specified in the base register is decremented by four bytes. The highest numbered register is written. This repeats until all of the registers have been written. If writeback is enabled, the base register is updated to point to the ''lowest'' address used, the lowest numbered register written.
 
   STM[cond]DB  Rn[!], {<registers>}[^]
 
   STM[cond]DB  Rn[!], {<registers>}[^]
 
In each of the above cases, the ''final address'' relates to the final address in memory, not to the writeback address which is calculated to be correct. This apparent disparity is because the ARM writes the lowest numbered register to the lowest address, regardless of the addressing mode used.
 
  
 
'''In other words...'''<br>
 
'''In other words...'''<br>
I understand that the above descriptions may seem a little complicated, so here is a diagram. What we have is a column representing memory locations. Each box is a 32 bit word, and higher addresses are up, lower addresses down.<br>
+
I understand that the above descriptions may seem a little complicated, so here is a diagram. What we have is a column representing memory locations. Each box is a 32 bit word, and higher addresses are up, lower addresses down. This column is repeated for each way of accessing memory.<br>
With <code>R13</code> pointing into the middle of this wodge of memory, we write <code>R0</code> to <code>R3</code> in each of the options, showing where the data is actually written, the order it is written in, and (if writeback is in operation) where <code>R13</code> would be updated to point to.
+
With <code>R13</code> pointing into the middle of this wodge of memory, we write <code>R0</code> to <code>R3</code> in each of the available memory access options, showing where the data is actually written, the order it is written in, and (if writeback is in operation) where <code>R13</code> would be updated to point to.
 
[[Image:STMxx_Memory.png|center]]
 
[[Image:STMxx_Memory.png|center]]
  
Line 106: Line 107:
  
 
===Single register stacking===
 
===Single register stacking===
On the ARM 9, an STM to store ''one'' register is a two cycle instruction which takes one Sequential cycle plus one Internal cycle on the intruction bus, followed by one Non-sequential plus one Internal cycle on the data bus.<br>
+
On the ARM, an STM to store ''one'' register is a two cycle instruction which takes one Sequential cycle plus one Internal cycle on the intruction bus, followed by one Non-sequential plus one Internal cycle on the data bus.<br>
 
An [[STR]], on the other hand, is a single-cycle instruction, taking only one Sequential cycle (instruction) and one Non-sequential cycle (data).<br>
 
An [[STR]], on the other hand, is a single-cycle instruction, taking only one Sequential cycle (instruction) and one Non-sequential cycle (data).<br>
 
It is similar for loading, only this takes more cycles in both cases.
 
It is similar for loading, only this takes more cycles in both cases.
Line 126: Line 127:
 
   LDMFD  R13!, {R4-R6, PC}
 
   LDMFD  R13!, {R4-R6, PC}
 
The final instruction will restore the three registers we don't want corrupted, push the return address into PC (to exit the procedure), and set the stack pointer to what it was on entry.
 
The final instruction will restore the three registers we don't want corrupted, push the return address into PC (to exit the procedure), and set the stack pointer to what it was on entry.
 +
  
 
Zippedy block copy function, will copy eleven words at a time (44 bytes), in eleven word chunks.<br>
 
Zippedy block copy function, will copy eleven words at a time (44 bytes), in eleven word chunks.<br>
Line 152: Line 154:
  
 
While 44 bytes might seem an odd number, with 32 bytes (or 8 registers) being a better value (8 = 256 bytes, 32 = a kilobyte, etc), it really depends on what you will be copying. Take, for example, a megabyte. That would require 32768 loops of 32 byte transfers, or 23831 loops of 44 byte transfer with special handling for the remaining 12 bytes. The special handling will likely take a lot less time than nearly nine ''thousand'' additional passes through the loop.<br>
 
While 44 bytes might seem an odd number, with 32 bytes (or 8 registers) being a better value (8 = 256 bytes, 32 = a kilobyte, etc), it really depends on what you will be copying. Take, for example, a megabyte. That would require 32768 loops of 32 byte transfers, or 23831 loops of 44 byte transfer with special handling for the remaining 12 bytes. The special handling will likely take a lot less time than nearly nine ''thousand'' additional passes through the loop.<br>
To put this into perspective, I put together a little BASIC program which ran the copy loop on a megabyte of data 1024 times (thus copying a gigabyte). ''Under emulation'', with an ARM710 clocking some 700MHz, and pushing some 600MiB/sec memory accesses, the 44 byte copy took 380cs (a mite under four seconds), while the 32 byte copy took 428cs.<br>
+
To put this into perspective, I put together a little BASIC program which ran the copy loop on a megabyte of data 1024 times (thus copying a gigabyte). ''Under emulation'', with an ARM710 clocking some 700MHz, and pushing some 600MiB/sec memory accesses (phew!), the 44 byte copy took 380cs (a mite under four seconds), while the 32 byte copy took 428cs.<br>
''Reality is likely to be quite different'', depending on the physical hardware in use - for example the (ancient) RiscPC's lethargic memory bus means you'll probably only see figures in the order of 10-12MiB/sec running flat out, meaning a gigabyte would transfer in around a minute and a half. In optimal conditions. The differences could be significant.<br>
+
''Reality is likely to be quite different'', depending on the physical hardware in use - for example the (ancient) RiscPC's lethargic memory bus means you'll probably only see figures in the order of 10-12MiB/sec running flat out, meaning a gigabyte would transfer in around a minute and a half. In optimal conditions. The differences between 32 or 44 bytes could be significant.<br>
 
Thankfully things are much nicer on more recent hardware, any Android phone or iPod for instance.
 
Thankfully things are much nicer on more recent hardware, any Android phone or iPod for instance.
  

Revision as of 07:07, 22 December 2011

STM
Instruction STM[IA¦IB¦DA¦DB (¦FD¦ED¦FA¦EA)]
Function Store Multiple
Category Load and Store
ARM family All
Notes -

Contents

STM

STM, for Store Multiple, is a way of writing multiple registers to memory in one single instruction. It is useful as both a way to quickly stack registers on entry to a subroutine, or in combination with Load Multiple as a way to produce block copies capable of saturating the memory bus (in other words, about as fast as the hardware can manage).

STM also includes (optional) writeback to the base address, so all stack or store offset calculations can be performed automatically.

STM will store registers - any subset between one and all of the general purpose R0-R15 registers. This is normally the register set visible to the current processor mode, however (using the ^ suffix) it is possible to access the banked (User) registers while the processor is in a privileged mode.

The registers are stored in sequence, with the lowest numbered register being written to the lowest memory address. Registers are written in order, regardless of the order in which they were specified, thus the following wouldn't work:

  STMFD  R13!, {R0, R1, R2, R3}  ; fails, as load/store instruction
  LDMFD  R13!, {R3, R2, R1, R0}  ; only keeps bitmask of registers,
                                 ; and not their ordering

Syntax

  STM[cond][type]  Rn[!], {<regs>}[^]

Where '!' specifies to write back the updated base pointer to the base register,
and '^' specifies to access the banked registers (32 bit).

The <regs> can be a comma-separated list, or a dashed range, or a mixture. For example:

  R0, R1, R2, R3, R4, R7, R8
  R0-R4, R7, R8

Function

 Store multiple registers to memory.

Types of memory access

STM provides four different types of access. In all cases, the ARM writes the lowest register specified to the lowest memory address (in other words, if R0 and R2 are to be written, R0 will go to the lowest address word and R2 will go to the next word up). This is absolute.

Increment After
The base register (Rn) points to the first address. The lowest register of the set to write is written to this address. The address is then incremented by four bytes. This sequence repeats until off of the registers have been written. If writeback has been selected, the base register will be updated to point to the address following the last register written.

  STM[cond]IA  Rn[!], {<registers>}[^]

Increment Before
This is functionally similar to Increment After, only this time the base address is incremented before the first register is written. For each register, the address increments, then the register is written. Hence, if writeback is in use, the base register will be updated to point to the address of the last register written.

  STM[cond]IB  Rn[!], {<registers>}[^]

Decrement After
This is more complicated. The first address is that of the base register (Rn) minus four time the number of registers to be written, plus four. The registers are written, the address increments.

It it perhaps better to think of this in logical terms rather than actual, therefore:
At the address specified in Rn, the final register to be written is stored. The address then decrements by four bytes (a word). This repeats until all of the registers have been written, in descending order so the lowest register goes to the lowest address. If writeback is enabled, the base register is updated to point to the word before (lower in memory) than the registers written.

  STM[cond]DA  Rn[!], {<registers>}[^]

Decrement Before
Thinking logically again (inside, the ARM works differently): The address specified in the base register is decremented by four bytes. The highest numbered register is written. This repeats until all of the registers have been written. If writeback is enabled, the base register is updated to point to the lowest address used, the lowest numbered register written.

  STM[cond]DB  Rn[!], {<registers>}[^]

In other words...
I understand that the above descriptions may seem a little complicated, so here is a diagram. What we have is a column representing memory locations. Each box is a 32 bit word, and higher addresses are up, lower addresses down. This column is repeated for each way of accessing memory.
With R13 pointing into the middle of this wodge of memory, we write R0 to R3 in each of the available memory access options, showing where the data is actually written, the order it is written in, and (if writeback is in operation) where R13 would be updated to point to.

STMxx Memory.png

Alternative memory access names

While the names given above will be useful for block transfer operations, where it is likely that both load and store will use the same names, this is not so useful for stack based operations, where the following would be an example:

  .my_function
    STMDB  R13!, {R0-R3, R14}
    ...some code...
    LDMIA  R13!, {R0-R3, PC}

It would be quite unpleasant to mis-remember which access name to use at which time. Therefore, there are alternative names for stack operations, based upon two criteria:

  • Stack type:
    • Full - the stack pointer points to the last used location.
    • Empty - the stack pointer points to the next unused location.
  • Stack direction:
    • Descending - the stack grows downwards in memory, starting at the highest address
    • Ascending - the stack grows upwards, starting at the lowest address

Therefore, the stack names are based upon this, and are FD, ED, FA, and EA. Their exact interpretation differs depending on whether it is a load or store, but you don't need to worry about this, only that using the same name (ie FD) in both cases will result in expected and consistant behaviour.

By way of comparison, the 6502 stack pointer counts downwards from &FF and the pointer itself is the next free location, thus it would be an "Empty Descending" stack.

Both RISC OS and ArmLinux (ie Android, etc), by convention, use a Fully Descending stack.

For the sake of completenes:

Stack name Storing Loading
FD DB IA
FA IB DA
ED DA IB
EA IA DB

Single register stacking

On the ARM, an STM to store one register is a two cycle instruction which takes one Sequential cycle plus one Internal cycle on the intruction bus, followed by one Non-sequential plus one Internal cycle on the data bus.
An STR, on the other hand, is a single-cycle instruction, taking only one Sequential cycle (instruction) and one Non-sequential cycle (data).
It is similar for loading, only this takes more cycles in both cases.

Therefore, the following observations can be made from the instruction timings: Firstly, for multiple register transfer, STM/LDM is a win, no doubt. However for single register transfer, STR and LDR may be preferable.

This has significance to you if you only wish to stack R14 (return address) on entry to your function, or stack a single register around a system call that would corrupt it.

You can implement single-register stack writes with the following:

  STR    R14, [R13, #-4]!    ; equivalent to STMFD R13!, {R14}

And you can read it back with:

  LDR    R14, [R13], #4      ; equivalent to LDMFD R13!, {R14}

Example

Given R0-R3 as scratch registers, R4-R6 corrupted in our procedure, and R13 as a stack pointer with R14 as the return address; a procedure can be wrapped as follows:

  STMFD  R13!, {R4-R6, R14}
  ...procedure code here...
  LDMFD  R13!, {R4-R6, PC}

The final instruction will restore the three registers we don't want corrupted, push the return address into PC (to exit the procedure), and set the stack pointer to what it was on entry.


Zippedy block copy function, will copy eleven words at a time (44 bytes), in eleven word chunks.
On entry R0 points to the source block start, R1 points to the source block end, and R2 points to the destination block. R13 is stack pointer. All registers are preserved.

  STMFD  R13!, {R0-R12,R14}
  ; shift registers so we have the following:
  ;   R0-R10 = available for use
  ;   R11    = source start
  ;   R12    = source end
  ;   R13    = stack pointer
  ;   R14    = destination start
  MOV    R11, R0
  MOV    R12, R1
  MOV    R14, R2
  
 .copyloop
  LDMIA  R11!, {R0-R10}  ; load data
  STMIA  R14!, {R0-R10}  ; write data
  CMP    R11, R12        ; reached end?
  BLT    copyloop
  
  ; tidy up and exit
  LDMFD  R13!, {R0-R12, PC}

Unlike the Simple Block Copy printed in the ARM Architecture Reference Manual, this version sacrifices one word capacity (ie we copy 44 bytes, not 48) for the flexibility of being a self-contained function. If you observe the ARM version, you will notice both R13 and R14 are corrupted, thus meaning you would need to do something with one (most likely R13) to allow it to be reloaded to permit the function to be exited.

While 44 bytes might seem an odd number, with 32 bytes (or 8 registers) being a better value (8 = 256 bytes, 32 = a kilobyte, etc), it really depends on what you will be copying. Take, for example, a megabyte. That would require 32768 loops of 32 byte transfers, or 23831 loops of 44 byte transfer with special handling for the remaining 12 bytes. The special handling will likely take a lot less time than nearly nine thousand additional passes through the loop.
To put this into perspective, I put together a little BASIC program which ran the copy loop on a megabyte of data 1024 times (thus copying a gigabyte). Under emulation, with an ARM710 clocking some 700MHz, and pushing some 600MiB/sec memory accesses (phew!), the 44 byte copy took 380cs (a mite under four seconds), while the 32 byte copy took 428cs.
Reality is likely to be quite different, depending on the physical hardware in use - for example the (ancient) RiscPC's lethargic memory bus means you'll probably only see figures in the order of 10-12MiB/sec running flat out, meaning a gigabyte would transfer in around a minute and a half. In optimal conditions. The differences between 32 or 44 bytes could be significant.
Thankfully things are much nicer on more recent hardware, any Android phone or iPod for instance.

Notes

  • The S bit, controlled by the ^ suffix controls whether or not privileged modes will force storing the banked (user mode) registers instead of those applicable to the current mode. Accordingly, doing this in User (or System) mode is unpredictable.
  • R15 should not be specified, anywhere, in an STM instruction. As a base register, it would be unpredictable, and as a stored register it is implementation defined.
  • If the base register is specified in the register list, and writeback is enabled, then things could go bang!. Just, don't.
  • Addresses should be word-aligned.

Technical

The instruction bit pattern is as follows:

31 - 28 27 - 25 24 23 22 21 20 19 - 16 15 - 0
condition 1 0 0 P U S W 0 Rn (base) Bitmask of registers

Where:

  • P specifies if the address is incremented before the data is written (P=1) or incremented after (P=0).
  • U specifies if the address is ascending (U=1) or descending (U=0).
  • S specifies if banked register access should occur when in privileged modes.
  • W specifies if the base register address should be updated after the data transfer.
Personal tools
Namespaces

Variants
Actions
Navigation
Contents
Toolbox