Philippe Luc, Director of Verification at Codasip, shares his view on what bugs verification engineers should pay attention to.
Did you know that between 1,000 and 2,000 bugs can appear during the design of a complex processor core? Really, a thousand bugs? Well, that’s what experience showed us. And not all bugs were born equal: their importance and consequences can vary significantly. Let’s go through 4 categories of CPU bugs, how to find them, and what the consequences would be for the user if we did not find them.
Type 1: the processor bug that verification engineers can easily find
“Oh, I forgot the semicolon”. Yes, that is one bug. Very easy to detect, it is typically one you find directly at compile time. Apart from having your eyes wide-open, there is nothing else to do to avoid these.
“Oh, it turns out that a part of the specification has not been implemented”. That is another easy CPU bug for you to find with any decent testbench – provided that an explicit test exists. In this scenario, the first simple test exercising the feature will fail. What does your processor verification team need to do? Make sure you have exhaustive tests. The design team, on the other hand, needs to make an effort to carefully read the specifications, and follow any changes in the specification during the development.
In other words, the easy bug is one that is found simply by running a test that exercises the feature. Its (bad) behavior is systematic, not a timing condition. Being exhaustive in your verification is the key to finding such CPU bugs. Code coverage will help you but is definitely not enough. If a feature is not coded in the RTL, how can coverage report that it is missing? A code review – with the specification at hand – definitely helps.
Type 2: the corner case that verification teams like to find
A corner case CPU bug is more complex to find and requires a powerful testbench. The simple test cases that exercise the feature are correctly passing, even with random delays. Quite often, you find these bugs when asynchronous events join the party. For example, an interrupt arriving just between 2 instructions, at a precise timing. Or a line in the cache got evicted just when the store buffer wants to merge into. To reach these bugs, you need a testbench that juggles with the instructions, the parameters and the delays so that all the possible interleaving of instructions and events have been exercised. Obviously, a good checker should spot any deviation from what is expected.
Does code coverage help in that case? Unfortunately not. Simply because the condition of the bug is a combination of several events that are already covered individually. Here, condition coverage or branch coverage might be helpful. But it is painful to analyze and it is rarely beneficial in the end.
Type 3: The hidden CPU bug found by accident – or by customers
The hidden bugs are found by customers (which is bad), or by chance (internally, before release). In both cases, it means that the verification methodology was not able to find them.
If you use different testbenches or environments, you could find other cases just because the stimuli are different. Fair enough. Then, what do we mean by “found by chance”? Here comes the limit of random testbench methodology.
With random stimuli, the testbench usually generates the “same” thing. If you roll a dice to get a random number, there are very few chances to get 10 times in a row the number 6. One chance in 60 million, to be accurate. With a RISC-V CPU that has 100 different instructions, a (equiprobable) random instruction generator has only 1 chance every 10²⁰ times to generate 10 times in a row the same instruction. Just twice the number of different positions of a Rubik’s Cube… On a 10-stage pipeline processor, it is not unreasonable to test it with the same instruction present on all pipeline stages. Good luck if you don’t tune your random constraints…
Type 4: The silly bug that would not happen in real life
You can take looking for corner cases and hidden cases too far and end up creating tests that are simply too silly.
Changing the endianness back and forth every cycle while connecting the debugger is probably something that will never ever happen on a consumer product, if the consequences of a CPU bug are never visible to a customer, then it is not really a bug. If you deliberately unplug your USB stick while you copy a file, and the file is corrupted, I consider this not a bug. If some operation causes the USB controller to hang, then yes, that is a bug.
Beware of extending the scope of the verification . When silly cases are found, then you are probably investing engineering effort in the wrong place.
There are different verification techniques you can apply to efficiently find CPU bugs before your customers do. At Codasip, we use multiple component testbenches, various random test generators, random irritators, and several other techniques to verify our products. As the project evolves, we develop these techniques to have a robust verification methodology. Learn more in our blog post where we explain how we continuously improve our verification methodology.