Search This Blog

Nov 11, 2012

TAB-BackSpace: Unlimited-Length Trace Buffers with Zero Additional On-Chip Overhead


TAB-BackSpace: Unlimited-Length Trace Buffers with Zero Additional On-Chip Overhead
Flavio M. de Paula, Amir Nahir, Ziv Nevo, Avigail Orni, and Alan J. Hu. 2011. TAB-BackSpace: unlimited-length trace buffers with zero additional on-chip overhead. In Proceedings of the 48th Design Automation Conference (DAC '11). ACM, New York, NY, USA, 411-416. 

Summary
BackSpace is a promising formal-based root-cause backtracing technique. It uses pre-image computation to realize the unlimited-length trace buffer. However, because it manipulates state bits in the concrete space, it is extremely time consuming. Abstract BackSpace was then proposed to reduce the runtime requirement, by focusing only on “visible” state bits, without producing any false negative. While this seems produces a large improvement, it is still needs formal engine to compute the pre-images; besides, this abstraction introduces false positives brought by (1) spurious transitions in the trace and (2) false matches (breakpointing at the wrong time, occurs because of an abstract state is a combination of many concrete states).
Under such a circumstance, TAB-BackSpace is proposed to not only reduce the runtime but also minimize the risk of spurious traces. Similar to BackSpace, this technique needs to rerun the test many times to extend the length of the back trace. It also needs the configurable breakpoint mechanism to serve as the interconnection between runs. However, the improvements are (1) TAB-BackSpace can trace back many cycle of history at once, while BackSpace can only compute one step of pre-image, (2) TAB-BackSpace depends entirely on the contents recorded in the trace buffer, which is the actual history of the chip execution, and doesn’t need to spend time on the monstrous pre-image computation, and (3) because TAB-BackSpace tries to find trace overlap by directly matching the trace buffer without pre-image computation, one source of spurious trace is totally eliminated. The other source of spurious trace, brought by false matches, is also suppressed because it matches multiple cycles of history and makes the probability of false match become very small. According to the experimental result, this approach is able to reveal the root cause of a real bug in IBM POWER 7 within 10 iterations


Comments:
In this paper, the authors are inspired by BackSpace, a technique achieving unlimited-length backtracing by rerunning the tests and formally computing the pre-image of the trace, and propose a more efficient backtracing technique: TAB-BackSpace. One of TAB-BackSpace’s advantages over BackSpace is that it doesn’t need the time-consuming formal techniques to compute pre-images. The other advantage is that it needs only basic debugging hardware components, trace buffers and breakpoint-related logic, which are essential for a lot of state-of-the-art post-silicon validation techniques, and doesn’t cause extra on-chip area overhead.
While I consider this approach a very effective improvement over BackSpace, I still have the following concerns:
(1)     This approach cannot guarantee the traces they found are short enough. Shortcuts can exist from the root cause to the breakpoint or the crash state; however, because of its simulation nature, the search in each step is incomplete and highly depends on the test patterns or test programs. Thus, it is possible for TAB-BackSpace to detour and cannot backtrace to the root cause before reaching the limitation of runtime or iteration.
(2)     Although the authors claim that they can handle non-determinism/randomness to some extent, they still cannot cope with electrical errors because most of these errors can be irreproducible.

Discussion
1.       In the experiment, the authors demonstrate that TAB-BackSpace can trace back to the root cause of the bug by backtracing about 2000 cycles within 10 iterations. Is this distance (from the root cause to the crash state) a normal case? Can the distance of some bugs be more than several millions of cycles long?
2.       Is this technique practical for NoC or multicore processor validation? Although the authors claim that they can handle the non-determinism/randomness, it is possible that TAB-BackSpace has to run a lot of extra iterations, or even needs luck to bring it to the same partial trace.
3.       The claim “zero additional on-chip overhead” looks like a marketing slogan. This paper blames BackSpace for using too much area on signature storage and breakpoint circuit. However, TAB-BackSpace technique also needs the area for the trace buffer and breakpoint circuit. Does it provide any improvement on area?

No comments:

Post a Comment