12.3 Optimisation

Code Optimisation, Profiling, and Power
Section titled “Code Optimisation, Profiling, and Power”Writing code that runs is only the start. On a microcontroller, where you may have only a few hundred kilobytes of flash and a few dozen kilobytes of RAM, writing efficient code is just as important as writing correct code. Rust offers both traditional C/C++-style optimisation controls and its own unique advantages through the type system, compiler, and async model. To understand how to make the most of the ESP32, we need to look at how Rust and LLVM optimise code, how to profile and measure performance, and how to manage power in a world where battery life often matters more than raw speed.
How Rust Produces Efficient Code: Compiler Optimisations
Section titled “How Rust Produces Efficient Code: Compiler Optimisations”Because Rust uses the LLVM backend, it shares the same optimisation engine as modern C and C++. This puts Rust on equal footing with well-optimised C for embedded development, which is not true for many other languages. You control optimisation levels through your project’s Cargo.toml, typically in the [profile.release] section:
opt-level = 3— Maximum speed. LLVM aggressively inlines, vectorises, and unrolls loops. Perfect for performance-critical paths like DSP, cryptography, or tight sensor loops.opt-level = "z"— Optimise for smallest binary size. Essential on microcontrollers with limited flash storage.lto = true— LTO (Link-Time Optimisation). A powerful optimisation where LLVM treats your entire program (including dependencies) as one unit. This allows cross-crate inlining, dead-code elimination, and constant-folding across library boundaries. Enabling LTO often shrinks embedded binaries dramatically while improving execution speed.
A concrete Cargo.toml configuration for an embedded release build combines all three settings. Two additional options are also commonly applied:
[profile.release]opt-level = "z" # minimise binary size; switch to 3 for max speedlto = true # whole-program optimisation across crate boundariescodegen-units = 1 # single unit required for effective LTOstrip = true # remove debug symbols from the final binarypanic = "abort" # abort immediately on panic instead of unwinding # (saves the unwinding machinery, typically ~10-20 KB)panic = "abort" is almost always appropriate for embedded targets: the no_std panic handler (e.g., panic-halt) already halts the device on panic anyway, so the unwinding machinery serves no purpose and only wastes flash space.
Profiling and Benchmarking on Embedded Systems
Section titled “Profiling and Benchmarking on Embedded Systems”Optimisation without measurement is guesswork. Embedded developers must be able to see where their code spends time, and Rust provides several tools to do this safely and clearly.
- Rust-side Profiling Tools:
defmtlogging with timestampsdefmtsupports extremely compact binary logs and timestamping, which lets you measure time intervals between events with minimal overhead. This is useful for: checking latency of async tasks, measuring network operation durations, and verifying timing-sensitive protocols. - Hardware-side Profiling: For deeper timing analysis ARM’s DWT (Data Watchpoint and Trace) counters provide cycle-accurate measurements directly from the CPU. We can also use GPIO toggling with a logic analyser, which is a classic trick still widely used: toggle a pin in your code and measure it externally. This trick has zero software overhead, nanosecond-level accuracy, and works even when debugging tools are unavailable.
Embedded C developers might use gprof, vendor IDE tools, FreeRTOS trace hooks, or semi hosting timers. Rust offers these too, but adds structured, low-overhead logging via defmt and ergonomic access to hardware timers.
Minimising Heap Usage: Predictability and Reliability
Section titled “Minimising Heap Usage: Predictability and Reliability”A deterministic memory footprint is essential in embedded systems. Rust offers three distinct strategies depending on the development path:
std+ ESP-IDF: Uses the FreeRTOS heap. This is convenient but can fragment over time.no_std: No allocator by default. You rely on stack-allocated structures or static buffers.no_stdwithalloc: Add a custom allocator only when needed (often for async executors). Still far more controlled than a general-purpose heap.
The recommended tool is the heapless crate, which provides: heapless::Vec, heapless::String, and heapless::Queue. These all use fixed-capacity buffers that avoid: dynamic allocation, fragmentation, and unpredictable failure modes. This gives you C-style determinism with Rust-style safety.
Inline Functions, Loop Unrolling, and High-Level Iterator Optimisation
Section titled “Inline Functions, Loop Unrolling, and High-Level Iterator Optimisation”Inlining removes the overhead of a function call by placing the function’s code directly at the call site, while loop unrolling reduces the number of branch instructions inside a loop; both techniques reduce instruction count and control-flow overhead, which can significantly improve performance on embedded systems with limited processing power.
C/C++ developers often reach for inline or #pragma unroll to shape performance. Rust provides similar tools:
#[inline]or#[inline(always)]for encouraging the compiler to inline hot functions.- LLVM automatically unrolls loops under optimisation levels
3orz.
More importantly, Rust’s iterator chains, for example:
values.iter().map(|v| v * 2).filter(|v| *v > 10)are often compiled into a single, highly optimised loop with: no allocation, no temporary objects, and no function calls. Thanks to monomorphisation and LLVM analysis, Rust’s high-level iterator code frequently outperforms equivalent hand-written C loops.
Inline Assembly for Rare, Low-Level Control
Section titled “Inline Assembly for Rare, Low-Level Control”Both C/C++ and Rust allow inline assembly when absolutely necessary. For example:
- C:
asm volatile(...) - Rust:
core::arch::asm!
Typical uses:
- precise timing loops (no operation
nopdelays) - reading/writing special registers
- extremely optimised bit-banging
This should be used sparingly: inline assembly bypasses Rust’s safety and portability guarantees. Most HALs already provide safe wrappers for such operations.
Cross-Compiling: The Rust Workflow Advantage
Section titled “Cross-Compiling: The Rust Workflow Advantage”I have spent more hours than I care to remember on cross-compiling C/C++ code. For example, building the Linux kernel for a Raspberry Pi using the compute capability of a desktop Linux PC reduces the task from days to hours.
Cross-compiling in Rust is intentionally simple:
Step 1: Install a target: \
rustup target add riscv32imc-unknown-none-elfStep 2: Specify a .cargo/config.toml that defines:
- the target triple
- linker script
- runner (e.g.,
probe-rs-run)
Step 3: Build: \
cargo build --release --target=...Cargo handles dependency resolution, target configuration, and LTO automatically, which is something that often requires substantial manual work in CMake or vendor IDEs.
Power Management: The Ultimate Form of Optimisation
Section titled “Power Management: The Ultimate Form of Optimisation”On embedded devices, power efficiency is often more important than raw performance. Rust’s concurrency model lets your code naturally enter low-power states.
FreeRTOS Approach (C/C++)
- Use
vTaskDelay()orxTaskNotifyWait() - The OS enters tickless idle if nothing is runnable
- CPU executes a low-power sleep instruction until next interrupt
Works well, but requires careful task design and tuning.
Embassy’s Async Approach
Embassy’s async model provides automatic, fine-grained power saving:
- When every async task is awaiting something, the executor knows the system has nothing to do.
- It safely issues a WFI (Wait For Interrupt) instruction.
- CPU falls into deep sleep mode (often microamps).
- A timer, GPIO, or peripheral interrupt wakes it.
- The executor resumes exactly the task that needs attention.
This is significantly simpler and more efficient than manual sleep management in C/C++.
🧩 Knowledge Check
Section titled “🧩 Knowledge Check”Match the Optimisation Concept
Why is `panic = 'abort'` recommended in the release profile for `no_std` embedded targets?
What does `codegen-units = 1` do in a Cargo release profile, and why is it needed alongside `lto = true`?
© 2026 Derek Molloy, Dublin City University. All rights reserved.