Search This Blog

Nov 8, 2012

Accelerating Microprocessor Silicon Validation Exposing ISA diversity


Accelerating Microprocessor Silicon Validation Exposing ISA diversity
Nikos Foutris, Dimitris Gizopoulos, Mihalis Psarakis, Xavier Vera, and Antonio Gonzalez. 2011. Accelerating microprocessor silicon validation by exposing ISA diversity. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-44 '11). ACM, New York, NY, USA, 386-397
SUMMARY
Self-checking-based methods are flourishing in recent development in post-silicon validation, and this paper provides an insightful idea on improving the self-checking mechanism with ISA diversity. It aims at, and succeeds in, 3 major goals: (1) Self checking consistency without the need for golden response, (2) digesting the validation data to provide more refined and useful information, and (3) reducing the effect of blocking bugs. The utilization of the RIT-ERIT methodology to find out the inconsistencies between two executions enables them to point out the possible buggy location on the logic paths. One of the biggest advantages of the proposed methodology, compared to others, is its novel idea about bypassing the failing instructions. This is done by replaying the segment with its equivalent class, and thus can prevent the corrupted data from propagating (that is, bypass the blocking bugs) and can detect much more bugs in a single run. Besides, its post-processing enables only some succinct data to be passed to the engineers, and thus reduce the manual efforts needed.

COMMENTS
The idea in this paper is very novel and useful. As described above, it greatly reduces the redundant information and can detect more bugs in one test by using replaying to resume its normal operation. However, there are several drawbacks exist in this method:
(1) The first step, generating the ISA diversity database, needs through understanding about the underlying architecture, which obviously requires a large amount of human efforts and can be both error-prone and time consuming.
(2) The assumption that the equivalent class of an instruction is bug-free is very strong. Since the equivalent instruction segment may be executed on some same logic path as its counterpart is, the bug existing in the shared parts of circuit could prevent its being detected.
(3) As IFRA does, this method also suffers from a large amount of candidate signals. Especially, since this method identifies only the failing “segments of instructions” which are divided by the store instruction, it could be possible that such segments actually contain most types of instruction and thus leave the search space inreducible.
As for the paper itself, it doesn’t provide a comparison on the “bugs-detected-per-test”, which is claimed as an important improvement in this paper, with other self-checking-based methods. It also doesn’t provide a detailed description about its hardware overhead. Last but not least, the claimed result of 100% bug detection looks unconvincing because bugs could exist in the instructions that have no equivalence class. While this paper declares that their result is superior to Reversi’s because the latter may suffer from portion of instructions that lack of reverse counterpart, it should also notice that the same situation could also happen to their proposed methods.

DISCUSSION POINTS
1.     Recognizing equivalent classes of instruction sequences needs through knowledge about the architecture and a lot of human efforts. However, this process is time consuming and incomplete, even error-prone. Would it be possible to automatize this step by limiting the length of candidate sequence to a small number and then apply formal methods to find all the possible sequences?
2.     There might be a situation that both an instruction and its equivalence have a bug in their logic path, respectively. How possible is it? How to cope with such situation?
3.     Bug could exist on the logic path of the instructions which have no equivalent class. According to Figure 1, there are at least 8% and up to 20% of chances that such kind of instruction can be executed. How to detect bugs in such scenario?
4.    While attempting to replay the failing segment of the program binary, how is the previous processor state restored?




NOTES
  • The experimental result shows the snapshot when the proposed method reaches 100% bug coverage.
  • Area overhead: Store/Estore addr buffer-> can be implemented using L1 Cache
  • Performance Degradation: can be shut down in final tape-out

No comments:

Post a Comment