Feature: Embedded design
Figure 7: A minimal CMSIS Stream compute graph, showing buffers and data types Te third library is Scikit-Learn, which offers a wide range of
ML algorithms and evaluation tools. Although deep learning frameworks are oſten used for final model deployment, Scikit- Learn is extremely useful for prototyping and validating concepts. Together, these three libraries form a powerful toolkit for
transforming raw sensor recordings into meaningful training datasets.
Designing ML models Once the dataset is prepared, the next stage is model design and training. Again, a Jupyter notebook and Python is the go-to environment. Two major frameworks dominate this space:
TensorFlow TensorFlow has become one of the most widely used machine learning frameworks. Embedded developers typically interact with TensorFlow
through Keras, a high-level Python API that simplifies neural network design. For embedded deployment, models can be converted into formats compatible with lightweight runtime systems.
TensorFlow Runtime (TFRT) TensorFlow Runtime enables machine learning models to execute efficiently on embedded devices. It supports CMSIS-NN optimised kernels and Arm Ethos NPU acceleration, allowing trained models to run directly on Cortex-M microcontrollers.
PyTorch Another increasingly popular framework is PyTorch, developed by Meta AI. It has gained rapid adoption due to its flexible programming model, dynamic computation graphs and intuitive debugging workflow. For engineers exploring new algorithm designs, this flexibility can be extremely valuable. It allows models to evolve iteratively whilst being tested against real datasets. Historically, however, deploying PyTorch models directly to small embedded devices required additional tooling. Tis is where the emerging ExecuTorch framework enters the picture.
ExecuTorch ExecuTorch is a lightweight runtime designed to bring PyTorch models to edge and embedded platforms. ExecuTorch forms part of the broader PyTorch ecosystem and focuses specifically on executing trained models efficiently on
20 May 2026
www.electronicsworld.co.uk
constrained hardware such as IoT devices and microcontrollers. Te typical workflow begins with model development using
PyTorch in Python. Once the model architecture has been defined and trained, it can be exported into an intermediate representation suitable for embedded inference.
Efficient implementation on Cortex-M Once a model has been trained, the next challenge is deploying it efficiently on a microcontroller. Here specialised libraries and accelerators become essential.
CMSIS-NN CMSIS-NN is a collection of highly optimised neural network kernels for Cortex-M processors. Te kernels support int8/int16 quantisation, and are optimised for hardware extensions (SIMD, Helium, Ethos NPU) within the Cortex-M processors. Tese functions implement common neural network operations such as convolution, pooling, fully connected layers, and Soſtmax. Because they are carefully tuned for Arm processors, CMSIS-NN
can deliver up to five-fold performance improvement compared with standard C implementations.
Ethos drivers When an Ethos NPU is present, the neural network workload can be offloaded to the accelerator. Te Ethos driver manages communication between the Cortex-M processor and the NPU. Typical performance figures include ~ 512GOPS for Ethos-U56, ~ 1TOPS for Ethos-U65 and ~ 4TOPS for Ethos-U85. Tese accelerators enable real-time inference even for relatively complex neural networks.
Vela compiler To run a model on an Ethos NPU, the network must first be compiled into an optimised representation. Arm provides the Vela compiler for this purpose; see Figure 6. It analyses the neural network and partitions it between the NPU accelerator and the Cortex-M processor. Tis ensures that each part of the model runs on the most appropriate hardware resource. While this may sound like a difficult transition from soſtware
only to soſtware and NPU, it is a remarkably easy process and you can generate soſtware only and Ethos models in the same workflow.
Building the DSP and ML pipeline ML models rarely operate directly on raw sensor data. Instead,
Page 1 |
Page 2 |
Page 3 |
Page 4 |
Page 5 |
Page 6 |
Page 7 |
Page 8 |
Page 9 |
Page 10 |
Page 11 |
Page 12 |
Page 13 |
Page 14 |
Page 15 |
Page 16 |
Page 17 |
Page 18 |
Page 19 |
Page 20 |
Page 21 |
Page 22 |
Page 23 |
Page 24 |
Page 25 |
Page 26 |
Page 27 |
Page 28 |
Page 29 |
Page 30 |
Page 31 |
Page 32 |
Page 33 |
Page 34 |
Page 35 |
Page 36 |
Page 37 |
Page 38 |
Page 39 |
Page 40 |
Page 41 |
Page 42 |
Page 43 |
Page 44