ECD April 2011

Computer-On-Modules with ECC support for safety-critical systems

By Franz Fischer

Franz explains why Computer-On-Modules (COMs) with Error Checking and Correcting (ECC) support especially suit safety-critical applications and describes how ECC modules perform key tasks.

With memory playing an increasingly important role in today’s industrial applications, using Error Checking and Correcting (ECC) modules is the right way to avoid running into problems and reduce errors to a minimum. Now ECC support is also available for embedded modules.

Unlike standard RAM modules, ECC modules can check and control the data flow, enabling ECC modules to correct errors. It’s possible for an ECC module to detect single as well as double-bit errors and correct them. This constitutes an improvement over parity bit, which can identify but not correct an error. Therefore, Computer-On-Modules (COMs) with ECC support closely match the needs of safety-critical applications, such as those found in server, medical, and aero- nautic systems.

ECC technology: how it works and why it suits embedded systems Initially only found in server hardware for use in data centers, ECC has made significant inroads into the smaller, conventional server market. With support by JEDEC and major DDR3 memory controllers, the technology is now being officially standardized and launched into the embedded market.

ECC is not necessarily the most appropriate abbreviation when it comes to fail-safe storage subsystems. The correct designa- tion would be Error Detection and Correction (EDC) or better yet Single Error Correction, Double Error Detection (SECED). Nevertheless, I will stick to the familiar abbreviation for the purposes of this article.

24 | April 2011 Embedded Computing Design

Alpha particles often disrupted the individual memory cells of large servers, spurring the need for error correction. The dis- integration of tiny amounts of radioactive components in the housing material of the DRAM devices triggered this radiation by alpha particles. Because the distance between the memory cells and the radiation-emitting DRAM cells is very small, the alpha particles can cause a data loss in the charged cells. The frequency of these so-called “soft errors” was several decimal factors higher than the number of DRAM failures resulting from “hard errors.” Control measures proved ineffective in reducing the frequency of errors, because data loss occurred completely randomly and was therefore unpredictable.

Without error correction, the MTBF of large servers with a large active cell area in the DRAM chips would rise to unacceptably high levels.

Today, the soft-error rate of DRAM modules is often no longer determined by the housing material but by cosmic rays, because with the transition to BGA packaging much of the problem- atic housing material has become redundant. However, DRAM modules are still prone to errors due to cosmic ray events. The transport of finished products in aircraft or the use of systems at high altitude can lead to a significant number of temporary failures of individual cells caused by protons penetrating a tran- sistor’s insulation layer. Implementing a detection and correction mechanism helps to solve the problem of single-bit errors. The affected DQ data can be identified and corrected, thereby ruling out the possibility that corrupted data remains undetected for technical reasons or due to radiation.

www.embedded-computing.com

Strategies | Small Form Factors

Page 1 | Page 2 | Page 3 | Page 4 | Page 5 | Page 6 | Page 7 | Page 8 | Page 9 | Page 10 | Page 11 | Page 12 | Page 13 | Page 14 | Page 15 | Page 16 | Page 17 | Page 18 | Page 19 | Page 20 | Page 21 | Page 22 | Page 23 | Page 24 | Page 25 | Page 26 | Page 27 | Page 28 | Page 29 | Page 30 | Page 31 | Page 32 | Page 33 | Page 34 | Page 35 | Page 36 | Page 37 | Page 38 | Page 39 | Page 40 | Page 41 | Page 42 | Page 43 | Page 44 | Page 45 | Page 46 | Page 47 | Page 48 | Page 49 | Page 50 | Page 51 | Page 52 | Page 53 | Page 54