What is a soft-error ?
Soft Errors is an old-new problem in the semiconductor industry. It is caused by cosmic particles, mostly protons coming from the sun, penetrating our magnetic field and the atmosphere and hitting electronic chips while in operation.
A soft error is an error in signal or data value that is caused by external source (i.e. cosmic rays) rather than the design itself.
How it happens?
Single Event Upset (SEU) – is the event where a storage circuit or a signal is hit by a cosmic particle charged enough to flip its value (or have enough energy).
This flipped value can propagate through the logic and damage important data (such as state machine, program instruction in a memory, program counter, etc…), and sometimes causes a whole system crash. when SEU propagates and causes a system level problem, the SEU is called a Soft Error.
Is it recoverable?
A soft error does not damage the system’s hardware and by cold booting, the systems returns to normal state. There is no implication that the system is any less reliable than before.
However, stopping the system for re-boot, or if the system crashes, get stuck or simply miss calculate, can be catastrophic in many cases, especially mission critical systems, where the MTBF (Mean Time Between Failures) requirement is very high.
Do we need to solve it ?
The Soft-error problem, aka SEU (Single Error Upset), is becoming more and more critical in the industry, due to the smaller dimensions of transistors in advanced technology nodes, which makes the particle’s charge or energy critical and enough to flip the element’s value. The problem is seen in 90 nm nodes and below.
Is there any way to mitigate the problem ?
In the past, to mitigate the problem, it was enough to protect the memories inside the chips (usually with ECC), these days with the current technology nodes, protection is needed also for flip-flops.
There are methods to protect flip-flops, like TMR (Triple Modular Redundancy), DMR (Dual Modular Redundancy) and more.
The new problem
Protecting all flip-flops of the chip can cost as high as 30% more silicon, which translate roughly to 30% increase in per-unit production costs and power consumption, lower chip frequency (due to timing) and higher NRE cost.
Hence, there is a need to select which flops in the chip are sensitive to soft-errors and which are not, so that a small portion of the flops is hardened (protected), instead of all flops. We want to achieve high reliability with the minimal cost in silicon/power/NRE and performance. This problem is considered the number one unsolved problem in soft-error mitigation.
Optima-DA provides an EDA solution for this problem. The tool’s name is CosmicASICs, it reads in the design in RTL/Gate level format and provide recommendation for the user on which flops to perform hardening for and which not.
See these introductory videos about Soft-error mitigation by Jamil Mazzawi: