2.2 State-of-the-art Hardware

State-of-the-art Hardware

The landscape of edge AI hardware is characterised by a diverse array of specialised processors, each engineered to address the unique computational and power requirements of on-device AI.

Specialised Processors: GPUs, NPUs, FPGAs, and ASICs

Graphics Processing Units (GPUs): While traditionally known for graphics rendering, GPUs are widely utilised in edge AI for their parallel processing capabilities, which are highly effective for deep learning and video analytics workloads. Platforms such as the NVIDIA Jetson series are flagship edge AI computers designed for demanding applications like autonomous robotics, complex computer vision systems, and even generative AI, delivering significant AI performance measured in TOPS (trillions of operations per second). See Figure 3.

Figure 3. One example is NVIDIA development platform. The Jetson Orin Nano Super Developer Kit. See: https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/

Neural Processing Units (NPUs): These are dedicated co-processors specifically designed to accelerate AI tasks, particularly matrix multiplication and tensor operations essential for neural networks, directly on the processor with high efficiency and low power consumption. Examples include Intel AI Boost and AMD XDNA, and specialised chips like the Hailo-8.
Field-Programmable Gate Arrays (FPGAs): FPGAs offer a unique blend of flexibility, in-field upgradability, and parallel processing capabilities, making them an essential tool for contextual edge AI. Their reprogrammable logic fabric resembles neural wiring, making them excellent targets for neural networks. FPGAs can be configured to perform specific AI tasks, allowing developers to adapt applications and optimise for maximum efficiency and reliability. They are also effective in accelerating data ingestion and overcoming I/O bottlenecks. Frameworks like Xilinx (now AMD) Vitis AI provide comprehensive toolkits for FPGA-based edge computing.
Application-Specific Integrated Circuits (ASICs): ASICs are custom-made for particular AI workloads, offering superior price-to-performance ratios and exceptional power efficiency for specific, stable inference tasks. They are increasingly gaining market share for AI inference, particularly for large language models (LLMs) at the edge, where they can be simpler, cheaper, and consume less power than general-purpose GPUs for fixed workloads.
Digital Signal Processors (DSPs): Modern DSPs are evolving to integrate neural network accelerators, offering significant performance-per-watt advantages for real-time signal processing tasks such as audio analysis, image recognition, and sensor fusion. These ultra-efficient SoCs are ideal for applications requiring a balance of performance, latency, and energy efficiency without cloud reliance. Boards like the Beagle-AI from Texas Instruments integrate DSPs alongside ARM cores, a GPU and vision accelerator. See Figure 4. Figure 4. The BeagleY-AI Single Board Computer (SBC) is a low-cost, open-source, community supported development platform for developers and hobbyists in a form-factor compatible with accessories available for other popular SBCs. It has the ability to run AI applications on a dedicated 4 TOPS co-processor along with real-time I/O tasks on a dedicated 800MHz microcontroller. See: https://www.beagleboard.org/boards/beagley-ai

Emerging Architectures: Neuromorphic Computing

Inspired by the biological brain, neuromorphic computing is one example that represents a paradigm shift in AI hardware, with chips, such as Intel’s Loihi 2, employ event-driven architectures and spiking neural networks (SNNs). Unlike conventional processors that continuously process data in fixed intervals, neuromorphic chips activate only when specific “spikes” or events occur (see Figure 5 for an example of a neuromorphic camera) leading to unparalleled energy efficiency (consuming as little as 1% to 10% of the power used by traditional processors). This asynchronous processing enables real-time responsiveness, with latencies in the range of tens of microseconds, making them ideal for dynamic, latency-sensitive applications like robotics, autonomous systems, and next-generation IoT devices. Furthermore, neuromorphic systems integrate memory and processing in a single architecture, minimising energy loss from constant data movement, a significant advantage for battery-powered edge devices.

Figure 5. A simple side-by-side demo of a conventional and neuromorphic camera. A neuromorphic camera, also known as an event camera, silicon retina, or dynamic vision sensor (DVS), is a bio-inspired sensor that captures changes in light intensity rather than full frames at fixed intervals. It mimics the human eye’s ability to detect changes in luminance, rather than capturing static scenes. Please see the following link for a very clear video demonstration: https://www.youtube.com/watch?v=W5mUwBitFtg

Leading Edge AI Hardware Platforms and Their Capabilities

The market for edge AI hardware is characterised by a diverse range of platforms, each optimised for distinct power, performance, and form factor requirements. This proliferation of highly specialised AI hardware signifies a strategic shift from general-purpose computing to domain-specific architectures at the edge. This specialisation is crucial for overcoming the inherent resource constraints and achieving the precise performance and energy efficiency demanded by diverse edge AI applications. This implies that hardware selection is no longer a simple choice, but a complex engineering decision that must be coupled with the specific AI workload, power budget, latency requirements, and desired flexibility for future updates. For instance, a high-volume, fixed-function application might prioritise an ASIC for its cost and power advantages, whereas a rapidly evolving or multi-functional system may find FPGAs or powerful GPU-based SoCs more suitable due to their flexibility and adaptability. This trend inherently drives the necessity for close hardware-software co-design from the earliest stages of development. The emergence of specialised chips capable of handling generative AI and large language models at the edge further underscores this trend towards highly tailored hardware solutions for new AI paradigms. Table 1 below provides a comparative overview of some leading edge AI hardware platforms, illustrating their varied capabilities and target applications. Three further platforms that appear frequently in edge AI deployments are worth introducing:

Google Coral Dev Board: Features Google’s Edge TPU, a purpose-built ASIC co-processor that executes TensorFlow Lite models at up to 4 TOPS while consuming under 2 W. Its tight integration with the TensorFlow Lite ecosystem and compact form factor make it a popular choice for low-power vision tasks such as real-time image classification and object detection in portable devices and smart cameras.
Intel Neural Compute Stick 2 (NCS2): A USB 3.0 device housing Intel’s Movidius Myriad X VPU (Vision Processing Unit). It enables any Linux host — including a Raspberry Pi — to offload neural network inference via the OpenVINO toolkit, delivering approximately 1.2 TOPS of inference performance with USB-bus power only. It is primarily used for rapid prototyping and for adding inference capability to existing devices without hardware redesign.
Qualcomm Robotics RB5: A high-performance robotics and automation platform built around Qualcomm’s QRB5165 SoC, which integrates an octa-core Kryo 585 CPU, Adreno 650 GPU, and Hexagon 698 Tensor Accelerator. At 15 TOPS, it targets demanding applications such as multi-camera autonomous navigation, drone guidance, and industrial inspection that require sustained edge inference alongside real-time video processing.

Table 1: Comparison of Some Modern Edge AI Hardware Platforms

Platform Name	Primary Processor Type	Performance (TOPS/TFLOPS/GOPS)	Power Consumption	Key Use Cases/Strengths	Notable Features
NVIDIA Jetson AGX Orin	GPU	Up to 275 TOPS	Varies (e.g., 15-60W)	High-end robotics, autonomous vehicles, advanced computer vision, GenAI	12-core Arm CPU, 2048-core NVIDIA Ampere GPU, 64 Tensor Cores
Google Coral Dev Board	NPU (Edge TPU)	4 TOPS	Low (e.g., ~0.5-1W)	Low-power vision-based IoT, smart cameras, portable ML devices	Optimised for TensorFlow Lite, small form factor
Intel Neural Compute Stick 2	VPU (Movidius Myriad X)	~1.2 TOPS	Low (USB powered)	Prototyping Edge AI on PCs/Raspberry Pi, offloading inference tasks	Plug-and-play USB 3.0 device
AMD Xilinx Kria K26 SOM	FPGA	Varies by configuration	Moderate	Computer vision in industrial/smart city, automated optical inspection	Adaptive, in-field reconfigurability, low latency
Qualcomm Robotics RB5	SoC (CPU+GPU+AI Engine)	15 TOPS	Moderate	High-performance robotics, drones, multi-camera setups	Octa-core Kryo 585 CPU, Adreno 650 GPU, Hexagon Tensor Accelerator
Intel Loihi 2	Neuromorphic	Up to 50x faster than ANNs	Extremely low (1W)	Event-driven AI, robotics, autonomous systems, brain-inspired computing	Combines processing and memory, asynchronous operation

Microcontroller-Class Devices: The ESP32

The platforms discussed above represent the high-performance tier of the edge hardware landscape. This book focuses on a category of device that is equally pervasive but far more resource-constrained: the microcontroller. Microcontrollers are single-chip systems integrating a CPU, memory, and programmable I/O peripherals on one die, designed to be embedded directly into products with minimal supporting circuitry. They sit at the endpoint device tier of the three-layer architecture introduced in Section 2.1.

The ESP32, manufactured by Espressif Systems, is the target hardware for this book. It is a 32-bit, dual-core microcontroller running at up to 240 MHz with 520 KB of SRAM and typically 4 MB of SPI flash storage. Integrated Wi-Fi (802.11 b/g/n) and Bluetooth Low Energy (BLE 5.0) on a single chip make it one of the most capable and cost-effective platforms for connected IoT and edge applications. The ESP32 family has expanded into several variants optimised for different use cases:

Variant	Architecture	Key Additions
ESP32 (Classic)	Dual-core Xtensa LX6, 240 MHz	Wi-Fi + BT Classic + BLE
ESP32-S3	Dual-core Xtensa LX7, 240 MHz	AI vector instructions, USB OTG
ESP32-C3	Single-core RISC-V, 160 MHz	Lower cost; Wi-Fi + BLE
ESP32-H2	RISC-V, 96 MHz	Thread/Zigbee/BLE; no Wi-Fi

The ESP32-S3 is the variant most relevant to AIoT development: its Xtensa LX7 cores include vector processing extensions that accelerate the integer arithmetic used in TinyML inference, delivering up to five times the neural network throughput of the original ESP32 for quantised models.

From a programming perspective, the ESP32 is supported by Espressif’s ESP-IDF (IoT Development Framework) in C/C++, the Arduino framework, and — the focus of the later chapters of this book — Rust via esp-idf-hal (using the ESP-IDF runtime) and the bare-metal esp-hal. Both paths are explored in Chapters 12 and 13.

🧩Knowledge Check

Concept Match

Match Edge AI Hardware Concepts

derekmolloy.ie

Drag each definition into its matching concept slot, then click Submit. Tap × to return a placed card to the pool.

GPU

drag a definition here…

NPU

drag a definition here…

FPGA

drag a definition here…

Neuromorphic

drag a definition here…

Definition Pool

Reprogrammable logic fabric offering in-field reconfigurability and low-latency performance.

Dedicated co-processor specifically designed for matrix multiplication and tensor operations.

Event-driven architecture using spiking neural networks, inspired by the biological brain.

Parallel processing powerhouse for deep learning and video analytics (e.g., NVIDIA Jetson).