Security & Monitoring
Detecting silent failures and improving runtime security monitoring in embedded systems
By Andreas Lifvendahl, CEO at Percepio
Why are non-crashing failures becoming more common in modern embedded designs?
Embedded systems have become far more complex over the last decade. We now see multicore processors, RTOS-based concurrency, third-party middleware, connectivity stacks, and increasingly AI or data-driven workloads running alongside traditional control software. In these systems, many faults do not lead to an immediate crash or exception. Instead, they show up as timing degradation, stalled threads, priority inversions, or subtle deadlocks that leave the device unresponsive but technically “alive”. As systems become more software-defined and tightly integrated, these non-crashing failure modes naturally become more common.
Why should engineers view silent failures as a security and integrity issue, not just a reliability problem? A system that freezes or behaves unpredictably without triggering a fault is not only unreliable – it is also untrustworthy. From a security and safety perspective, silent failures break assumptions about system state and control flow. They can mask denial-of-service conditions, allow compromised components to go unnoticed, or prevent security mechanisms from executing as intended. If you cannot detect that software has deviated from its expected runtime behaviour, you cannot be confident in its integrity, regardless of whether the root cause is a bug, a design flaw, or malicious interference.
Why don’t watchdogs, logging, and post-mortem analysis catch these issues?
Traditional mechanisms are designed for crashes and hard faults. Watchdogs only
20 March 2026 Dashboard view in Percepio Detect showing runtime alerts with associated forensic payloads. Components in Electronics
www.cieonline.co.uk
High-level runtime observability with Percepio Detect – designed to surface silent failures in complex, multicore embedded systems before they escalate.
help if the system fully stalls, and logs often stop just before the most interesting behaviour occurs. Post-mortem analysis depends on having a clear failure event, which silent failures frequently lack. By the time a watchdog resets the system, the valuable runtime context – which thread was misbehaving, what timing constraints were violated – is already gone. These tools are necessary, but they were never designed to explain complex runtime anomalies in modern embedded systems.
What kinds of runtime behaviours are most useful to monitor for early warning signs?
Timing, scheduling, and execution patterns are especially revealing. Changes in per-thread CPU usage, missed deadlines, unexpected execution gaps, or growing latency between events often appear long before a system becomes unusable. Another important indicator is runtime variability. A healthy embedded system behaves consistently. When execution times
or response latencies start to fluctuate more than expected, it is often a sign that the system is approaching an unstable or unsafe state.
Why is per-thread execution data more revealing than system-wide metrics?
System-wide metrics like total CPU load can look perfectly normal while individual threads are misbehaving. In RTOS-based systems, failures often originate from one
Page 1 |
Page 2 |
Page 3 |
Page 4 |
Page 5 |
Page 6 |
Page 7 |
Page 8 |
Page 9 |
Page 10 |
Page 11 |
Page 12 |
Page 13 |
Page 14 |
Page 15 |
Page 16 |
Page 17 |
Page 18 |
Page 19 |
Page 20 |
Page 21 |
Page 22 |
Page 23 |
Page 24 |
Page 25 |
Page 26 |
Page 27 |
Page 28 |
Page 29 |
Page 30 |
Page 31 |
Page 32 |
Page 33 |
Page 34 |
Page 35 |
Page 36 |
Page 37 |
Page 38 |
Page 39 |
Page 40 |
Page 41 |
Page 42 |
Page 43 |
Page 44