12.3 Optimisation

Code Optimisation, Profiling, and Power

Writing code that runs is only the start. On a microcontroller, where you may have only a few hundred kilobytes of flash and a few dozen kilobytes of RAM, writing efficient code is just as important as writing correct code. Rust offers both traditional C/C++-style optimisation controls and its own unique advantages through the type system, compiler, and async model. To understand how to make the most of the ESP32, we need to look at how Rust and LLVM optimise code, how to profile and measure performance, and how to manage power in a world where battery life often matters more than raw speed.

LLVM (https://llvm.org/)(Low-Level Virtual Machine) is a modern compiler infrastructure used by both Clang (for C/C++) and rustc (for Rust). It provides a powerful, modular optimisation pipeline and a common intermediate representation (IR) that allows languages to share the same back-end code generation technology. Its importance lies in the fact that both C and Rust benefit from the same optimisation passes (such as inlining, loop unrolling, constant folding, dead-code elimination, and link-time optimisation) resulting in highly efficient machine code across many architectures. For embedded developers, LLVM ensures that Rust can achieve performance and binary sizes fully comparable to well-optimised C, and that improvements to LLVM automatically benefit both languages without requiring language-specific reinvention.

How Rust Produces Efficient Code: Compiler Optimisations

Because Rust uses the LLVM backend, it shares the same optimisation engine as modern C and C++. This puts Rust on equal footing with well-optimised C for embedded development, which is not true for many other languages. You control optimisation levels through your project’s Cargo.toml, typically in the [profile.release] section:

opt-level = 3 — Maximum speed. LLVM aggressively inlines, vectorises, and unrolls loops. Perfect for performance-critical paths like DSP, cryptography, or tight sensor loops.
opt-level = "z" — Optimise for smallest binary size. Essential on microcontrollers with limited flash storage.
lto = true — LTO (Link-Time Optimisation). A powerful optimisation where LLVM treats your entire program (including dependencies) as one unit. This allows cross-crate inlining, dead-code elimination, and constant-folding across library boundaries. Enabling LTO often shrinks embedded binaries dramatically while improving execution speed.

A concrete Cargo.toml configuration for an embedded release build combines all three settings. Two additional options are also commonly applied:

[profile.release]
opt-level     = "z"   # minimise binary size; switch to 3 for max speed
lto           = true  # whole-program optimisation across crate boundaries
codegen-units = 1     # single unit required for effective LTO
strip         = true  # remove debug symbols from the final binary
panic         = "abort" # abort immediately on panic instead of unwinding
                        # (saves the unwinding machinery, typically ~10-20 KB)

panic = "abort" is almost always appropriate for embedded targets: the no_std panic handler (e.g., panic-halt) already halts the device on panic anyway, so the unwinding machinery serves no purpose and only wastes flash space.

Profiling and Benchmarking on Embedded Systems

Optimisation without measurement is guesswork. Embedded developers must be able to see where their code spends time, and Rust provides several tools to do this safely and clearly.

Rust-side Profiling Tools: defmt logging with timestamps defmt supports extremely compact binary logs and timestamping, which lets you measure time intervals between events with minimal overhead. This is useful for: checking latency of async tasks, measuring network operation durations, and verifying timing-sensitive protocols.
Hardware-side Profiling: For deeper timing analysis ARM’s DWT (Data Watchpoint and Trace) counters provide cycle-accurate measurements directly from the CPU. We can also use GPIO toggling with a logic analyser, which is a classic trick still widely used: toggle a pin in your code and measure it externally. This trick has zero software overhead, nanosecond-level accuracy, and works even when debugging tools are unavailable.

Embedded C developers might use gprof, vendor IDE tools, FreeRTOS trace hooks, or semi hosting timers. Rust offers these too, but adds structured, low-overhead logging via defmt and ergonomic access to hardware timers.

Minimising Heap Usage: Predictability and Reliability

A deterministic memory footprint is essential in embedded systems. Rust offers three distinct strategies depending on the development path:

std + ESP-IDF: Uses the FreeRTOS heap. This is convenient but can fragment over time.
no_std: No allocator by default. You rely on stack-allocated structures or static buffers.
no_std with alloc: Add a custom allocator only when needed (often for async executors). Still far more controlled than a general-purpose heap.

The recommended tool is the heapless crate, which provides: heapless::Vec, heapless::String, and heapless::Queue. These all use fixed-capacity buffers that avoid: dynamic allocation, fragmentation, and unpredictable failure modes. This gives you C-style determinism with Rust-style safety.

Inline Functions, Loop Unrolling, and High-Level Iterator Optimisation

Inlining removes the overhead of a function call by placing the function’s code directly at the call site, while loop unrolling reduces the number of branch instructions inside a loop; both techniques reduce instruction count and control-flow overhead, which can significantly improve performance on embedded systems with limited processing power.

C/C++ developers often reach for inline or #pragma unroll to shape performance. Rust provides similar tools:

#[inline] or #[inline(always)] for encouraging the compiler to inline hot functions.
LLVM automatically unrolls loops under optimisation levels 3 or z.

More importantly, Rust’s iterator chains, for example:

    values.iter().map(|v| v * 2).filter(|v| *v > 10)

are often compiled into a single, highly optimised loop with: no allocation, no temporary objects, and no function calls. Thanks to monomorphisation and LLVM analysis, Rust’s high-level iterator code frequently outperforms equivalent hand-written C loops.

Inline Assembly for Rare, Low-Level Control

Both C/C++ and Rust allow inline assembly when absolutely necessary. For example:

C: asm volatile(...)
Rust: core::arch::asm!

Typical uses:

precise timing loops (no operation nop delays)
reading/writing special registers
extremely optimised bit-banging

This should be used sparingly: inline assembly bypasses Rust’s safety and portability guarantees. Most HALs already provide safe wrappers for such operations.

Cross-Compiling: The Rust Workflow Advantage

I have spent more hours than I care to remember on cross-compiling C/C++ code. For example, building the Linux kernel for a Raspberry Pi using the compute capability of a desktop Linux PC reduces the task from days to hours.

In Rust, the target triple is a string that identifies the architecture, vendor, operating system, and ABI for which your code is being compiled. For example, riscv32imc-unknown-none-elf for an ESP32-C3. It tells the compiler exactly what instructions, libraries, and calling conventions to use. Choosing the correct target triple is essential when cross-compiling, as it ensures the generated machine code matches the CPU and environment of your embedded device. So, a typical Rust target triple for an ESP32-S3 (which uses the Xtensa LX7, not RISC-V) is: Xtensa-esp32s3-none-elf, and for an ESP32-C3, which does use a RISC-V (RV32IMC) core, the typical Rust target triple is: riscv32imc-unknown-none-elf This is the standard target used when building no_std Rust applications for the ESP32-C3.

Cross-compiling in Rust is intentionally simple:

Step 1: Install a target: \

 rustup target add riscv32imc-unknown-none-elf

Step 2: Specify a .cargo/config.toml that defines:

the target triple
linker script
runner (e.g., probe-rs-run)

Step 3: Build: \

 cargo build --release --target=...

Cargo handles dependency resolution, target configuration, and LTO automatically, which is something that often requires substantial manual work in CMake or vendor IDEs.

Power Management: The Ultimate Form of Optimisation

On embedded devices, power efficiency is often more important than raw performance. Rust’s concurrency model lets your code naturally enter low-power states.

FreeRTOS Approach (C/C++)

Use vTaskDelay() or xTaskNotifyWait()
The OS enters tickless idle if nothing is runnable
CPU executes a low-power sleep instruction until next interrupt

Works well, but requires careful task design and tuning.

Embassy’s Async Approach
Embassy’s async model provides automatic, fine-grained power saving:

When every async task is awaiting something, the executor knows the system has nothing to do.
It safely issues a WFI (Wait For Interrupt) instruction.
CPU falls into deep sleep mode (often microamps).
A timer, GPIO, or peripheral interrupt wakes it.
The executor resumes exactly the task that needs attention.

This is significantly simpler and more efficient than manual sleep management in C/C++.

🧩 Knowledge Check

Concept Match

Match the Optimisation Concept

derekmolloy.ie

Quiz

Select 0/1

Why is `panic = 'abort'` recommended in the release profile for `no_std` embedded targets?

derekmolloy.ie

Quiz

Select 0/1

What does `codegen-units = 1` do in a Cargo release profile, and why is it needed alongside `lto = true`?

derekmolloy.ie