Login

Important information

This site uses cookies to store information on your computer. By continuing to use our site, you consent to our cookies.

ARM websites use two types of cookie: (1) those that enable the site to function and perform as required; and (2) analytical cookies which anonymously track visitors only while using the site. If you are not happy with this use of these cookies please review our Privacy Policy to learn how they can be disabled. By disabling cookies some features of the site will not work.

ARM Community: Branch and Call Sequences Explained - ARM Community

Jump to content

Branch and Call Sequences Explained

In this post, I will explain the various branch and call instructions available in the ARM and Thumb instruction sets, and why the variants exist. Finally, I will provide a JavaScript tool that can help you find a typical branch sequence matching your requirements.

What Does a Branch Do?

A branch, quite simply, is a break in the sequential flow of instructions that the processor is executing. Some other architectures call them jumps, but they're essentially the same thing. The following is a trivial, and hopefully familiar example of a branch:

entry_point:
    mov     r0, #0      @ Set r0 to 0.
    b       target      @ Jump forward to 'target'.
    mov     r0, #1      @ Set r0 to 1.
target:
    ...                 @ At this point, r0 holds the value 0.
    ...                 @ The second mov instruction did not execute.
Example of branch execution.

There are several variants of branches in the ARM and Thumb instruction sets. Several of these variants are in common with many other CPU architectures, but there are also a few branch variants specific to ARM. Each variant is explained in detail below:

Relative and Absolute Branch Targets

A relative branch is one where the target address is calculated based on the value of the current pc (program counter). Given the example above, an assembler would work out that the target label is eight bytes ahead of the b target instruction (in ARM code) and then generate a relative branch which means 'jump forward by eight bytes'. Relative branches are essential for position-independant code, which is expected to run correctly at any location in memory. The most common relative branches on ARM are single instructions and tend to be the most efficient branches available, though they have limited range.

An absolute branch will always jump to the specified address, regardless of the current pc. Absolute branches are used when the address of the target is provided as a function pointer, for example. However, because an absolute branch requires a full 32-bit target address, absolute branches usually require a load or some other constant-loading mechanism in addition to the branch instruction itself.

In many cases, the programmer (or compiler) may not actually care whether a branch is relative or absolute, and might just use whichever is most efficient on a case-by-case basis.

Branch Range

Because the ARM instruction set is fixed-width at 32 bits (and Thumb has either 16 or 32 bits), it is not possible to encode a full 32-bit branch offset in a single instruction. Relative branches can be encoded using a limited-range offset from the current pc. In assembly code, this is usually written as a branch to a label (as in the example above). The assembler will work out the required offset.

The range available varies between ARM and Thumb (and in a few cases also between instruction variants) but is usually very large and quite sufficient for most branches within a program. By using various combinations or additional instructions and literal pool loads, it is also possible to construct arbitrary full-range branches in case the single-instruction range is not sufficient. All practical absolute branches are necessarily full-range, since a 32-bit target address needs to be loaded.

Function Calls

Almost every modern programming language has some concept of functions. Any given function can (in general) be called from any part of a program, so processor architectures need some way to store the address of the caller. On ARM processors, this return address is stored in lr (the link register). Branch instructions with an l suffix — like bl and blx — work just like a standard b or bx branch, but also store a return address in lr.

If a function does not modify lr, then the return sequence can (and should) be a simple "bx lr". Otherwise, the lr can be pushed onto the stack at function entry. From here, the best return sequence is usually to pop directly into pc, though a number of other options are possible depending on the situation.

Interworking Branches (Between ARM and Thumb Code)

Programs on ARM processors can use either the ARM or Thumb instruction set, or both. Whilst ARM and Thumb instructions cannot be directly interleaved, it is possible to switch (or interwork) between ARM and Thumb states at run-time. This interworking is most notably achieved using special branch instructions with an x suffix, like bx and blx. Several other branch mechanisms are also capable of interworking. For example, the return sequence which writes to the pc using pop (or any other memory access) can interwork, and will always return in the appropriate state.

Note that most of the interworking instructions were added with ARMv5T, and that the only interworking branch available to ARMv4T is bx.

Branch instructions fall into three classes: Instructions that never change state (like "label"), instructions that always change state (like "blx label"), and instructions that automatically change state based on the target address (like "bx register").

Address-based interworking uses the lowest bit of the address to determine the instruction set at the target. If the lowest bit is 1, the branch will switch to Thumb state. If the lowest bit is 0, the branch will switch to ARM state. Note that the the lowest bit is never actually used as part of the address as all instructions are either 4-byte aligned (as in ARM) or 2-byte aligned (as in Thumb).

ARM Branch Instructions

The following table lists the branch instructions commonly used on ARM processors:

Instruction Relativity Linkage Interworking Notes
label Relative Simple (none) Never
bx register Absolute Simple (none) Address-based
bl label Relative Function call (lr) Never Note that assemblers will generally select between bl label and blx label automatically, regardless of which instruction you use.
blx label Relative Function call (lr) Always Note that assemblers will generally select between bl label and blx label automatically, regardless of which instruction you use.
blx register Absolute Function call (lr) Address-based
pop {..., pc} Absolute Simple (none) Address-based (since ARMv5T) A common return sequence in cases where lr has been pushed onto the stack at the start of the function.
ldr pc, =address Absolute Simple (none) Address-based (since ARMv5T) Load from a literal pool directly into pc.

It is also possible to write into the pc using arithmetic instructions, but this is useful only in specific cases [1], and use of the normal branch instructions is advisable where possible.

Most of the interworking branches were added on ARMv5T. The only way to interwork on ARMv4T was to use the bx instruction. ARMv4T interworking branch sequences are often much less efficient than the ARMv5T versions, so it's best to use ARMv5T branches unless you really need ARMv4T compatibility.

Using More Complex Branches

To encode more complex branches than those listed above, a combination of instructions must be used. In cases like this, where the target address must be calculated in advance of the branch instruction, normal methods for loading and calculated values are used. Arithmetic might be used for long-range relative branches, for example, and a constant pool load might be used for an absolute branch.

(Mostly) Typical Branch Sequences

If you have JavaScript enabled, you will be able to use the following tool to see some suggested branch sequences for specific circumstances. (If you don't have JavaScript enabled, the filter options won't work and you'll just see a big list.)

Relativity Linkage Interworking Architecture Version Branch Range

Relative branch.

    b       label
The available range (in bytes) is approximately [3] ±32MB for ARM instructions and ±16MB for Thumb, though the range in Thumb is further limited if you require a conditional branch or a narrow (16-bit) instruction.

Relative function call.

    bl      label
The available range (in bytes) is approximately [3] ±32MB for ARM instructions and ±16MB for Thumb, though the range in Thumb is further limited if you require a conditional branch or a narrow (16-bit) instruction.

Relative function call with unconditional interworking.

    blx     label
The available range (in bytes) is approximately [3] ±32MB for ARM instructions and ±16MB for Thumb, though the range in Thumb is further limited if you require a conditional branch or a narrow (16-bit) instruction.
Note that blx label is one of the few instructions that cannot be conditional in ARM state.

Full-range absolute branch with address-based interworking (since ARMv5T).

    ldr     pc, =address
On ARMv5T and above, this instruction interworks based on the target address [4].
On ARMv4T, this instruction does not interwork.

Full-range absolute branch with address-based interworking.

    ldr     rA, =address       @ Load used for illustrative purposes.
    bx      rA
Any normal method can be used to load an address into rA. A literal pool load is used above for illustrative purposes. rA may be any general-purpose register except (of course) pc.
Note that since ARMv5T, a load directly into pc will also interwork. However, using a temporary register (rA) allows the address to be constructed by other means, perhaps by calculation from some base register or using the movw and movt instructions introduced in ARMv6T2.

Return sequence using stacked return address, with address-based interworking since ARMv5T.

    push    {..., lr}       @ Function entry (prologue).
    ...
    pop     {..., pc}       @ Return branch.
This is a common return sequence in functions that store the return address (lr) on the stack on entry. A push instruction has been shown to illustrate this.
On ARMv5T and above, the pop instruction interworks based on the target address [4], and thus always returns correctly, regardless of the instruction set used in the calling code.
On ARMv4T, the pop instruction does not interwork. An interworking return on ARMv4T must pop the return address into lr and then use bx lr to return.

Simple return sequence, with address-based interworking.

    bx      lr
This is a common return sequence in functions that either save and restore lr before returning, or do not write to lr at all. It is therefore common in simple leaf functions (which don't themselves call any other functions).
This sequence will interwork based on the address [4].

Full-range absolute function call with address-based interworking.

    ldr     rA, =address       @ Load used for illustrative purposes.
    blx     rA
Any normal method can be used to load an address into rA. A literal pool load is used above for illustrative purposes. rA may be any general-purpose register except (of course) pc.

Full-range absolute function call with address-based interworking for ARMv4T.

    ldr     rA, =address       @ Load used for illustrative purposes.
    mov     lr, pc
    bx      rA
Any normal method can be used to load an address into rA. A literal pool load is used above for illustrative purposes. rA may be any general-purpose register except (of course) pc or lr.
This sequence will work on all architecture versions, but it will not perform well on most recent processors. Use of blx or bl is preferred for all function calls on recent processors.

Full-range relative branch with address-based interworking.

    ldr     rA, =offset     @ Load used for illustrative purposes.
    add     rA, rA, pc      @ ← offset is measured from the apparent pc here.
    bx      rA
Any normal method can be used to load an offset into rA. A literal pool load is used above for illustrative purposes. rA may be any general-purpose register except (of course) pc.
Note that the offset used is actually an offset from the apparent value of pc, which is 8 bytes ahead in ARM code and 4 bytes ahead in Thumb code. The offset must be adjusted accordingly to compensate.

Full-range relative branch with no interworking.

    ldr     rA, =offset     @ Load used for illustrative purposes.
    add     rA, rA, pc      @ ← offset is measured from the apparent pc here.
    mov     pc, rA
Any normal method can be used to load an offset into rA. A literal pool load is used above for illustrative purposes. rA may be any general-purpose register except (of course) pc.
Note that the offset used is actually an offset from the apparent value of pc, which is 8 bytes ahead in ARM code and 4 bytes ahead in Thumb code. The offset must be adjusted accordingly to compensate.

Full-range relative function call with address-based interworking.

    ldr     rA, =offset     @ Load used for illustrative purposes.
    add     rA, rA, pc      @ ← offset is measured from the apparent pc here.
    blx     rA
Any normal method can be used to load an offset into rA. A literal pool load is used above for illustrative purposes. rA may be any general-purpose register except (of course) pc.
Note that the offset used is actually an offset from the apparent value of pc, which is 8 bytes ahead in ARM code and 4 bytes ahead in Thumb code. The offset must be adjusted accordingly to compensate.

Full-range relative function call with address-based interworking, for ARMv4T.

    ldr     rA, =offset     @ Load used for illustrative purposes.
    add     rA, rA, pc      @ ← offset is measured from the apparent pc here.
    mov     lr, pc
    bx      rA
Any normal method can be used to load an offset into rA. A literal pool load is used above for illustrative purposes. rA may be any general-purpose register except (of course) pc or lr.
Note that the offset used is actually an offset from the apparent value of pc, which is 8 bytes ahead in ARM code and 4 bytes ahead in Thumb code. The offset must be adjusted accordingly to compensate.
This sequence will work on all architecture versions, but it will not perform well on most recent processors. Use of blx or bl is preferred for all function calls on recent processors.

Full-range relative function call with no interworking, for ARMv4T.

    ldr     rA, =offset     @ Load used for illustrative purposes.
    add     rA, rA, pc      @ ← offset is measured from the apparent pc here.
    mov     lr, pc
    mov     pc, rA          @ ← Interworks on ARMv7 if this is ARM code.
Any normal method can be used to load an offset into rA. A literal pool load is used above for illustrative purposes. rA may be any general-purpose register except (of course) pc or lr.
Note that the offset used is actually an offset from the apparent value of pc, which is 8 bytes ahead in ARM code and 4 bytes ahead in Thumb code. The offset must be adjusted accordingly to compensate.
This sequence will work on all architecture versions, but it will not perform well on most recent processors. Use of blx or bl is preferred for all function calls on recent processors.

Full-range relative branch, with interworking in ARMv7 ARM code only.

    ldr     rA, =offset     @ Load used for illustrative purposes.
    add     pc, pc, rA
Any normal method can be used to load an offset into rA. A literal pool load is used above for illustrative purposes. rA may be any general-purpose register except (of course) pc.
Note that the offset used is actually an offset from the apparent value of pc, which is 8 bytes ahead in ARM code and 4 bytes ahead in Thumb code. The offset must be adjusted accordingly to compensate.
This sequence will interwork based on the target address [4] on ARMv7, but only if written in ARM code.

Limited-range relative branch, with interworking in ARMv7 ARM code only.

    adr     pc, label
Short-range (or very specific long-range) branches can be constructed using Operand 2 offsets. Any arithmetic instrucitons may be used, but adr is shown as a typical example. In general, a b instruction would be a better choice.
This sequence will interwork based on the target address [4] on ARMv7, but only if written in ARM code.

Limited-range relative branch with address-based interworking.

    adr     rA, label
    bx      rA
Short-range (or very specific long-range) interworking branches can be constructed using Operand 2 offsets. Any arithmetic instrucitons may be used, but adr is shown as a typical example.

Thumb-2 Special-Purpose Branches

Finally, there are a few branches available specifically in the Thumb-2 instruction set that are designed for specific use-cases. These are not available to the ARM instruction set (or to the old Thumb-1 instruction set), and so I will give them only a brief mention, but if you're writing Thumb-2 code they can be very useful. For further details, refer to the ARMv7-A/R Architecture Reference Manual.

For each special-purpose branch, I will also give a roughly equivalent ARM implementation. The ARM implementations have different limitations (such as branch range) and have other side effects (such as requiring a scratch register). Nevertheless, they should serve to clarify the behaviour of the Thumb-2 instructions.

cbnz and cbz

The cbnz (compare, branch on non-zero) and cbz (compare, branch on zero) instructions are useful for very short-range forward branches, such as loop terminations, that would otherwise require two or more instructions. The two-instruction version is still available, of course, and may be useful if more range is required, or if a more complicated comparison is required.

ARM Implementation Thumb-2 Implementation
cmp     rA, #0
beq     label
cbz     rA, label
cmp     rA, #0
bne     label
cbnz    rA, label

tbb and tbh

The tbb (table branch byte) and tbh (table branch halfword) instructions are useful for the implementation of jump tables. One argument register is a base pointer to a table, and the second argument is an index into the table. The value loaded from the table is then doubled and added to the pc.

ARM Implementation Thumb-2 Implementation
ldrb    ip, [rA, rB]
add     pc, pc, ip, lsl #1
tbb     rA, rB
ldrh    ip, [rA, rB, lsl #1]
add     pc, pc, ip, lsl #1
tbh     rA, rB, lsl #1

1

A typical example of where arithmetic-based branches are useful is in the implementation of jump tables, but they are occasionally useful in other cases.

2

Old architecture variants that do not have Thumb at all support all of the ARMv4 branches that cannot interwork. (Filter by Interworking=Never and Architecture=ARMv4T to see these.)

3

There is some asymmetry in the ranges available to many branch instructions, for two reasons. Firstly, the pc reads ahead by 8 bytes in ARM mode, or 4 bytes in Thumb mode. Secondly, the offset field is encoded as a simple signed integer (with width varying from instruction to instruction), so overall the branch range is offset slightly. In practice, this rarely matters.

4

Bit 0 of the address indicates the instruction set of the target. If 1, the target is Thumb. If 0, the target is ARM.



Jacob Bramley, Embedded Software Engineer, ARM, Jacob is interested in most technical subjects, but has particular interests in code generation and hand-optimization of assembly. He also has a fascination with hardware and its interactions with software, and will happily (if inefficiently) spend hours staring at pipeline diagrams in order to save one or two cycles here and there.
All company and product names appearing in the ARM Blogs are trademarks and/or registered trademarks of ARM Limited per ARM’s official trademark list. All other product or service names mentioned herein are the trademarks of their respective owners.

0 Comments On This Entry

Please log in above to add a comment or register for an account