Optima’s Approach Consistent with Findings of a Major Research Paper

A major research paper that was published last year represents a milestone in the error resilience literature. The research, entitled CLEAR (for “Cross-Layer Exploration for Architecting Resilience”), was led by an impressive list of authors, including Prof. Jacob A. Abraham from UTA, Prof. Subhasish Mitra from Stanford, and Pradip Bose – a Distinguished Research Staff Member from IBM. It was published, among other places, in 2016 Design Automation Conference (DAC), 2016 12th Silicon Errors in Logic – System Effects (SELSE), 2016 21th IEEE European Test Symposium (ETS), 2015 1st International Workshop on Resiliency in Embedded Electronic Systems (REES), and elsewhere. See Arxiv

The goal of the CLEAR project is to combine Hardware and Software techniques to overcome errors, and they focus on Soft Errors in Processor Cores. The paper claims that they have very good results when searching for an optimal combined solution with components from across all levels of abstraction, such as a combination of circuit-level hardening, logic-level parity checking, and micro-architectural recovery. The paper could achieve such highly effective results, because of the highly effective LEAP solution that addresses both SEUs and SEMUs.

One of the paper’s conclusions is that such combinations are not always strictly required. It shows that “Selective circuit-level hardening [alone – AR], guided by a thorough analysis of the effects of soft errors on application benchmarks, provides a highly effective soft error resilience approach. For example, a 50× SDC [Silent Data Corruption – AR] improvement is achieved at 3.1% and 7.3% energy costs for the OoO- and InO-cores, respectively”. In fact, this was just 1 percentage point higher than the best combination they found, while being much easier to follow.

The paper goes on to investigate how sensitive the results are to the specific SW used for the simulations. In general, they state that “the above conclusions about cost-effective soft error resilience techniques largely hold across various application characteristics (e.g., latency constraints despite errors in soft real-time applications)”. They go further and note that “The most cost-effective resilience techniques rely on selective circuit hardening / parity checking guided by error injection using application benchmarks. This raises the question: what happens when the applications in the field do not match application benchmarks?”.

The results of their research show that “the most vulnerable 10% of flip-flops (i.e., the flip-flops that result in the most SDCs [Silent Data Corruptions – AR] or DUEs [Detected but Uncorrected Errors – AR]) are consistent across benchmarks. Since the number of errors resulting in SDC or DUE is not uniformly distributed among flip-flops, protecting these top 10% of flip-flops will result in the ~10× SDC improvement regardless of the benchmark considered”. But to achieve very high degrees of resilience, just identifying the critical flip-flops based on a few applications would not be sufficient.

From Optima’s perspective, we believe that Optima’s approach to Soft Error prevention is consistent with these findings:
1. Selective hardening guided by a thorough analysis of the effects of soft errors on application benchmarks, provides a highly effective soft error resilience approach.
2. The results are not critically dependent on the benchmark selected (as long as it is a reasonable benchmark), since the most vulnerable 10% of flip-flops are consistent across benchmarks (although very high accuracy requires exact knowledge of the SW to be run in the field)