In this post, I will explain the various branch and call instructions available in the ARM and Thumb instruction sets, and why the variants exist. Finally, I will provide a JavaScript tool that can help you find a typical branch sequence matching your requirements.
What Does a Branch Do?
A branch, quite simply, is a break in the sequential flow of instructions that the processor is executing. Some other architectures call them jumps, but they're essentially the same thing. The following is a trivial, and hopefully familiar example of a branch:
entry_point:
mov r0, #0 @ Set r0 to 0.
b target @ Jump forward to 'target'.
mov r0, #1 @ Set r0 to 1.
target:
... @ At this point, r0 holds the value 0.
... @ The second mov instruction did not execute.
There are several variants of branches in the ARM and Thumb instruction sets. Several of these variants are in common with many other CPU architectures, but there are also a few branch variants specific to ARM. Each variant is explained in detail below:
Relative and Absolute Branch Targets
A relative branch is one where the target address is
calculated based on the value of the current pc
(program counter). Given the example above, an assembler would
work out that the target label is eight bytes
ahead of the b target instruction (in ARM
code) and then generate a relative branch which means 'jump
forward by eight bytes'. Relative branches are essential for
position-independant code, which is expected to run correctly
at any location in memory. The most common relative branches
on ARM are single instructions and tend to be the most
efficient branches available, though they have limited range.
An absolute branch will always jump to the specified
address, regardless of the current pc. Absolute
branches are used when the address of the target is provided as
a function pointer, for example. However, because an absolute
branch requires a full 32-bit target address, absolute branches
usually require a load or some other constant-loading mechanism
in addition to the branch instruction itself.
In many cases, the programmer (or compiler) may not actually care whether a branch is relative or absolute, and might just use whichever is most efficient on a case-by-case basis.
Branch Range
Because the ARM instruction set is fixed-width at 32 bits (and
Thumb has either 16 or 32 bits), it is not possible to encode a
full 32-bit branch offset in a single instruction. Relative
branches can be encoded using a limited-range offset from the
current pc. In assembly code, this is usually
written as a branch to a label (as in the example above). The
assembler will work out the required offset.
The range available varies between ARM and Thumb (and in a few cases also between instruction variants) but is usually very large and quite sufficient for most branches within a program. By using various combinations or additional instructions and literal pool loads, it is also possible to construct arbitrary full-range branches in case the single-instruction range is not sufficient. All practical absolute branches are necessarily full-range, since a 32-bit target address needs to be loaded.
Function Calls
Almost every modern programming language has some concept of
functions.
Any given function can (in general) be called from any part of
a program, so processor architectures need some way to store
the address of the caller. On ARM processors, this return
address is stored in lr (the link register).
Branch instructions with an l suffix — like
bl and blx — work just like a
standard b or bx branch, but also
store a return address in lr.
If a function does not modify lr, then the return
sequence can (and should) be a simple
"bx lr". Otherwise, the lr
can be pushed onto the stack at function entry. From here,
the best return sequence is usually to pop
directly into pc, though a number of other options
are possible depending on the situation.
Interworking Branches (Between ARM and Thumb Code)
Programs on ARM processors can use either the ARM or Thumb
instruction set, or both. Whilst ARM and Thumb instructions
cannot be directly interleaved, it is possible to switch (or
interwork) between ARM and Thumb states at run-time.
This interworking is most notably achieved using special branch
instructions with an x suffix, like
bx and blx. Several other branch
mechanisms are also capable of interworking. For example, the
return sequence which writes to the pc using
pop (or any other memory access) can interwork,
and will always return in the appropriate state.
Note that most of the interworking instructions were added with
ARMv5T, and that the only interworking branch available to
ARMv4T is bx.
Branch instructions fall into three classes: Instructions that
never change state (like "b label"),
instructions that always change state (like
"blx label"), and instructions that
automatically change state based on the target address (like
"bx register").
Address-based interworking uses the lowest bit of the address to determine the instruction set at the target. If the lowest bit is 1, the branch will switch to Thumb state. If the lowest bit is 0, the branch will switch to ARM state. Note that the the lowest bit is never actually used as part of the address as all instructions are either 4-byte aligned (as in ARM) or 2-byte aligned (as in Thumb).
ARM Branch Instructions
The following table lists the branch instructions commonly used on ARM processors:
| Instruction | Relativity | Linkage | Interworking | Notes |
|---|---|---|---|---|
b label |
Relative | Simple (none) | Never | |
bx register |
Absolute | Simple (none) | Address-based | |
bl label |
Relative | Function call (lr) |
Never |
Note that assemblers will generally select between
bl label and
blx label automatically,
regardless of which instruction you use.
|
blx label |
Relative | Function call (lr) |
Always |
Note that assemblers will generally select between
bl label and
blx label automatically,
regardless of which instruction you use.
|
blx register |
Absolute | Function call (lr) |
Address-based | |
pop {..., pc} |
Absolute | Simple (none) | Address-based (since ARMv5T) |
A common return sequence in cases where
lr has been pushed onto the stack at
the start of the function.
|
ldr pc, =address |
Absolute | Simple (none) | Address-based (since ARMv5T) |
Load from a literal pool directly into
pc.
|
It is also possible to write into the pc using
arithmetic instructions, but this is useful only in specific
cases [1], and use
of the normal branch instructions is advisable where possible.
Most of the interworking branches were added on ARMv5T. The
only way to interwork on ARMv4T was to use the
bx instruction. ARMv4T interworking branch
sequences are often much less efficient than the ARMv5T
versions, so it's best to use ARMv5T branches unless you really
need ARMv4T compatibility.
Using More Complex Branches
To encode more complex branches than those listed above, a combination of instructions must be used. In cases like this, where the target address must be calculated in advance of the branch instruction, normal methods for loading and calculated values are used. Arithmetic might be used for long-range relative branches, for example, and a constant pool load might be used for an absolute branch.
(Mostly) Typical Branch Sequences
If you have JavaScript enabled, you will be able to use the following tool to see some suggested branch sequences for specific circumstances. (If you don't have JavaScript enabled, the filter options won't work and you'll just see a big list.)
Relative branch.
b label
Relative function call.
bl label
Relative function call with unconditional interworking.
blx label
blx label is one
of the few instructions that cannot be conditional in
ARM state.
Full-range absolute branch with address-based interworking (since ARMv5T).
ldr pc, =address
Full-range absolute branch with address-based interworking.
ldr rA, =address @ Load used for illustrative purposes.
bx rA
rA. A literal pool load is used above for
illustrative purposes. rA may be any
general-purpose register except (of course)
pc.
pc will also interwork. However, using a
temporary register (rA) allows the address
to be constructed by other means, perhaps by
calculation from some base register or using the
movw and movt instructions
introduced in ARMv6T2.
Return sequence using stacked return address, with address-based interworking since ARMv5T.
push {..., lr} @ Function entry (prologue).
...
pop {..., pc} @ Return branch.
lr) on the stack
on entry. A push instruction has been
shown to illustrate this.
pop instruction
interworks based on the target address
[4],
and thus always returns correctly, regardless of the
instruction set used in the calling code.
pop instruction does not
interwork. An interworking return on ARMv4T must
pop the return address into
lr and then use bx lr to
return.
Simple return sequence, with address-based interworking.
bx lr
lr before
returning, or do not write to lr at all.
It is therefore common in simple leaf functions (which
don't themselves call any other functions).
Full-range absolute function call with address-based interworking.
ldr rA, =address @ Load used for illustrative purposes.
blx rA
rA. A literal pool load is used above for
illustrative purposes. rA may be any
general-purpose register except (of course)
pc.
Full-range absolute function call with address-based interworking for ARMv4T.
ldr rA, =address @ Load used for illustrative purposes.
mov lr, pc
bx rA
rA. A literal pool load is used above for
illustrative purposes. rA may be any
general-purpose register except (of course)
pc or lr.
blx or bl
is preferred for all function calls on recent
processors.
Full-range relative branch with address-based interworking.
ldr rA, =offset @ Load used for illustrative purposes.
add rA, rA, pc @ ← offset is measured from the apparent pc here.
bx rA
rA. A literal pool load is used above for
illustrative purposes. rA may be any
general-purpose register except (of course)
pc.
pc, which
is 8 bytes ahead in ARM code and 4 bytes ahead in Thumb
code. The offset must be adjusted accordingly to
compensate.
Full-range relative branch with no interworking.
ldr rA, =offset @ Load used for illustrative purposes.
add rA, rA, pc @ ← offset is measured from the apparent pc here.
mov pc, rA
rA. A literal pool load is used above for
illustrative purposes. rA may be any
general-purpose register except (of course)
pc.
pc, which
is 8 bytes ahead in ARM code and 4 bytes ahead in Thumb
code. The offset must be adjusted accordingly to
compensate.
Full-range relative function call with address-based interworking.
ldr rA, =offset @ Load used for illustrative purposes.
add rA, rA, pc @ ← offset is measured from the apparent pc here.
blx rA
rA. A literal pool load is used above for
illustrative purposes. rA may be any
general-purpose register except (of course)
pc.
pc, which
is 8 bytes ahead in ARM code and 4 bytes ahead in Thumb
code. The offset must be adjusted accordingly to
compensate.
Full-range relative function call with address-based interworking, for ARMv4T.
ldr rA, =offset @ Load used for illustrative purposes.
add rA, rA, pc @ ← offset is measured from the apparent pc here.
mov lr, pc
bx rA
rA. A literal pool load is used above for
illustrative purposes. rA may be any
general-purpose register except (of course)
pc or lr.
pc, which
is 8 bytes ahead in ARM code and 4 bytes ahead in Thumb
code. The offset must be adjusted accordingly to
compensate.
blx or bl
is preferred for all function calls on recent
processors.
Full-range relative function call with no interworking, for ARMv4T.
ldr rA, =offset @ Load used for illustrative purposes.
add rA, rA, pc @ ← offset is measured from the apparent pc here.
mov lr, pc
mov pc, rA @ ← Interworks on ARMv7 if this is ARM code.
rA. A literal pool load is used above for
illustrative purposes. rA may be any
general-purpose register except (of course)
pc or lr.
pc, which
is 8 bytes ahead in ARM code and 4 bytes ahead in Thumb
code. The offset must be adjusted accordingly to
compensate.
blx or bl
is preferred for all function calls on recent
processors.
Full-range relative branch, with interworking in ARMv7 ARM code only.
ldr rA, =offset @ Load used for illustrative purposes.
add pc, pc, rA
rA. A literal pool load is used above for
illustrative purposes. rA may be any
general-purpose register except (of course)
pc.
pc, which
is 8 bytes ahead in ARM code and 4 bytes ahead in Thumb
code. The offset must be adjusted accordingly to
compensate.
Limited-range relative branch, with interworking in ARMv7 ARM code only.
adr pc, label
adr is shown as a typical example. In
general, a b instruction would be a better
choice.
Limited-range relative branch with address-based interworking.
adr rA, label
bx rA
adr is shown as a typical example.
Thumb-2 Special-Purpose Branches
Finally, there are a few branches available specifically in the Thumb-2 instruction set that are designed for specific use-cases. These are not available to the ARM instruction set (or to the old Thumb-1 instruction set), and so I will give them only a brief mention, but if you're writing Thumb-2 code they can be very useful. For further details, refer to the ARMv7-A/R Architecture Reference Manual.
For each special-purpose branch, I will also give a roughly equivalent ARM implementation. The ARM implementations have different limitations (such as branch range) and have other side effects (such as requiring a scratch register). Nevertheless, they should serve to clarify the behaviour of the Thumb-2 instructions.
cbnz and cbz
The cbnz (compare, branch on non-zero) and
cbz (compare, branch on zero) instructions are
useful for very short-range forward branches, such as loop
terminations, that would otherwise require two or more
instructions. The two-instruction version is still available,
of course, and may be useful if more range is required, or if a
more complicated comparison is required.
| ARM Implementation | Thumb-2 Implementation |
|---|---|
|
|
|
|
tbb and tbh
The tbb (table branch byte) and tbh
(table branch halfword) instructions are useful for the
implementation of jump tables. One argument register is a base
pointer to a table, and the second argument is an index into
the table. The value loaded from the table is then doubled
and added to the pc.
| ARM Implementation | Thumb-2 Implementation |
|---|---|
|
|
|
|
A typical example of where arithmetic-based branches are useful is in the implementation of jump tables, but they are occasionally useful in other cases.
Old architecture variants that do not have Thumb at all support all of the ARMv4 branches that cannot interwork. (Filter by Interworking=Never and Architecture=ARMv4T to see these.)
There is some asymmetry in the ranges available to many
branch instructions, for two reasons. Firstly, the
pc reads ahead by 8 bytes in ARM mode, or 4
bytes in Thumb mode. Secondly, the offset field is encoded
as a simple signed integer (with width varying from
instruction to instruction), so overall the branch range is
offset slightly. In practice, this rarely matters.
Bit 0 of the address indicates the instruction set of the target. If 1, the target is Thumb. If 0, the target is ARM.
Jacob Bramley, Embedded Software Engineer, ARM, Jacob is interested in most technical subjects, but has particular interests in code generation and hand-optimization of assembly. He also has a fascination with hardware and its interactions with software, and will happily (if inefficiently) spend hours staring at pipeline diagrams in order to save one or two cycles here and there.












