Accelerating
Microprocessor Silicon Validation Exposing ISA diversity
Nikos Foutris, Dimitris Gizopoulos, Mihalis Psarakis, Xavier Vera, and Antonio Gonzalez. 2011. Accelerating microprocessor silicon validation by exposing ISA diversity. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-44 '11). ACM, New York, NY, USA, 386-397
SUMMARY
Self-checking-based
methods are flourishing in recent development in post-silicon validation, and
this paper provides an insightful idea on improving the self-checking mechanism
with ISA diversity. It aims at, and succeeds in, 3 major goals: (1) Self
checking consistency without the need for golden response, (2) digesting the
validation data to provide more refined and useful information, and (3) reducing
the effect of blocking bugs. The utilization of the RIT-ERIT methodology to
find out the inconsistencies between two executions enables them to point out
the possible buggy location on the logic paths. One of the biggest advantages
of the proposed methodology, compared to others, is its novel idea about
bypassing the failing instructions. This is done by replaying the segment with
its equivalent class, and thus can prevent the corrupted data from propagating
(that is, bypass the blocking bugs) and can detect much more bugs in a single
run. Besides, its post-processing enables only some succinct data to be passed
to the engineers, and thus reduce the manual efforts needed.
COMMENTS
The idea in this paper is very
novel and useful. As described above, it greatly reduces the redundant
information and can detect more bugs in one test by using replaying to resume
its normal operation. However, there are several drawbacks exist in this method:
(1) The first step, generating
the ISA diversity database, needs through understanding about the underlying
architecture, which obviously requires a large amount of human efforts and can
be both error-prone and time consuming.
(2) The assumption that the
equivalent class of an instruction is bug-free is very strong. Since the
equivalent instruction segment may be executed on some same logic path as its
counterpart is, the bug existing in the shared parts of circuit could prevent
its being detected.
(3) As IFRA does, this method
also suffers from a large amount of candidate signals. Especially, since this
method identifies only the failing “segments of instructions” which are divided
by the store instruction, it could be possible that such segments actually
contain most types of instruction and thus leave the search space inreducible.
As for the paper itself, it
doesn’t provide a comparison on the “bugs-detected-per-test”, which is claimed
as an important improvement in this paper, with other self-checking-based
methods. It also doesn’t provide a detailed description about its hardware
overhead. Last but not least, the claimed result of 100% bug detection looks
unconvincing because bugs could exist in the instructions that have no
equivalence class. While this paper declares that their result is superior to
Reversi’s because the latter may suffer from portion of instructions that lack
of reverse counterpart, it should also notice that the same situation could
also happen to their proposed methods.
DISCUSSION POINTS
1. Recognizing equivalent
classes of instruction sequences needs through knowledge about the architecture
and a lot of human efforts. However, this process is time consuming and
incomplete, even error-prone. Would it be possible to automatize this step by
limiting the length of candidate sequence to a small number and then apply
formal methods to find all the possible sequences?
2. There might be a
situation that both an instruction and its equivalence have a bug in their
logic path, respectively. How possible is it? How to cope with such situation?
3. Bug could exist
on the logic path of the instructions which have no equivalent class. According
to Figure 1, there are at least 8% and up to 20% of chances that such kind of
instruction can be executed. How to detect bugs in such scenario?
4. While
attempting to replay the failing segment of the program binary, how is the previous
processor state restored?
NOTES
- The experimental result shows the snapshot when the proposed method reaches 100% bug coverage.
- Area overhead: Store/Estore addr buffer-> can be implemented using L1 Cache
- Performance Degradation: can be shut down in final tape-out
No comments:
Post a Comment