Building a Swiss cheese model approach for processor verification
April 29, 2022
April 29, 2022
Processors all have high quality requirements and their reliability is the main concern of processor verification teams. Providing best-in-class quality products requires a strategic, diligent and thorough approach. Processor verification therefore plays a major role and it takes a combination of all industry standard techniques – like in a Swiss cheese model.
You’ve heard me say this before: processor verification is a subtle art. We need to take into account uncertainty, which means opening the scope of our verification while optimizing resources. On one hand, we want to find all critical bugs before final production, and on the other hand we must have an efficient verification strategy to fulfill time to market requirements. Producing smart processor verification means finding meaningful bugs as efficiently and as early as possible during the development of the product. One way of achieving this consists in combining all industry standard verification techniques. It is by creating redundancy that we find all critical bugs.
There are different types of bugs and each bug has a complexity – or bug score – that depends on the number of events and types of events required to trigger the bugs. Some might be found with coverage, others with formal proofs, etc. Imagine the Swiss cheese model applied to processor verification. Each slice of cheese is a verification technique which has some specific strengths to catch some categories of bugs. The risk of a bug escaping and making it into the end product is mitigated by the different layers and types of verification which are layered behind each other.
In a Swiss cheese model applied to processor verification, the principle is similar to the aviation industry: if there is a direct path going through all the slices, then there is a risk of plane crash. That is why the aviation industry is strict about procedures, checklists, and redundant systems. The objective is to add more slices and reduce the size of the holes on a slice so that in the end, there is no hole going through, and we deliver a quality processor.
By using several slices of cheese, or verification methods:
A hole in a slice is a hole in the verification methodology. The more holes, and the bigger the holes, the more bugs can escape. If the same area of the design (overlapping holes between cheese slices) is not covered and tested by any of the verification techniques, then the bug will make it through and end up in the final deliverables.
A good verification methodology must present as few holes as possible, as small as possible, on each slice. A solid strategy, experience, and efficient communication are important factors to deliver quality products.
When we find a bug, or a hole in a slice, during verification, we always fix it and check other slices for similar holes. Every slice should find the holes in the previous one and address them before progressing. Sanity checks are an efficient way to achieve this, for example by comparing our design with industry standard models such as Spike or Imperas.
In the Swiss cheese model applied to processor verification, if one technique is strengthened – an improved testbench, new assertions, etc. – the bug is found and fixed before the product goes into production. All processor verification techniques are important and it is the combination of all of them that makes each of them more efficient.
A single verification technique cannot do everything by itself, it is the action of all of them that improves the overall quality of the verification and processor design. There can be unexpected changes or factors during the development of a product, external actions that can impact the efficiency of a technique. For example, a change in the design not communicated to the verification team or a difficult Friday afternoon leading to human mistakes. These factors can increase the size of a hole in a slice, hence the importance of having more than one – and the importance of keeping engineering specifications up to date and communicating regularly between designers and verification engineers. Code reviews conducted by other team members is one efficient solution to achieve this, and that is what we do at Codasip.
At Codasip, we use verification technology and techniques that allows us to create redundancy, preventing holes to go through the pile of slices of cheese, and to deliver best-in-class RISC-V processors.
April 4, 2022
I am often asked the question “When is the processor verification done?” or in other words “how do I measure the efficiency of my testbench and how can I be confident in the quality of the verification?”. There is no easy answer. There are several common indicators used in the industry such as coverage and bug curve. While they are absolutely necessary, these are not enough to reach the highest possible quality. Indeed, such indicators do not really unveil the ability of verification methodologies to find the last bugs. With experience, I learned that measuring the complexity of processor bugs is an excellent indicator to use throughout the development of the project.
Experience taught me that we can define the complexity of a bug by counting the number of independent events or conditions that are required to hit the bug.
Let’s take a simple example. A typical bug is found in the caches, when a required hazard is missing. Data corruption can occur when:
External memory returns the previous data because the most recent data from the eviction got lost, causing data corruption.
In this example, 4 events – or conditions – are required to hit the bug. These 4 events give the bug a score of 4, or in other words a complexity of 4.
To measure the complexity of a bug, we can come up with a classification that will be used by the entire processor verification team. In a previous blog post, we discussed 4 types of bugs and explained how we use these categories to improve the quality of our testbench and verification. Let’s go one step further and combine this method with bug complexity.
An easy bug can require between 1 and 3 events to be triggered. The first simple test fails. A corner case is going to need 4 or more events.
Going back to our example above, we have a bug with a score of 4. If one of the four conditions is not present, then the bug is not hit.
A constrained random testbench will need several features to be able to hit the example above. The sequence of addresses should be smart enough to reuse previous addresses from previous requests, delays on external buses should be sufficiently atypical to have fast Reads and slow-enough Writes.
A hidden case will need even more events. Perhaps a more subtle bug has the same conditions as our example, but it only happens when an ECC error is discovered on the cache, at the exact same time as an interrupt happens, and only when the core finishes an FPU operation that results in a divide-by-zero error. With typical random testbenches, the probability to have all these conditions together is extremely low, making it a “hidden” bug.
Making these hidden bugs more reachable in the testbench is improving the quality of verification. It consists in making hidden cases become corner cases.
This classification does not have any limit. Experience has shown me that a testbench capable of finding bugs with a score of 8 or 9 is a strong simulation testbench and is key to delivering quality RTL. From what I have seen, today the most advanced simulation testbenches can find bugs with a complexity level up to 10. Fortunately, the use of formal verification makes it much easier to find bugs that have an even higher complexity, paving the way to even better design, and giving clues about what to improve in simulation.
This classification and methodology is useful only if it is used from the moment verification starts and throughout the project development, for 2 reasons:
Finally, by combining this approach with our methodology that consists of hunting bugs flying in squadrons, we ensure high-level quality verification that helps us be confident that are going beyond verification sign-off criteria.
March 14, 2022
Creating a quality RISC-V processor requires a verification methodology that enforces the highest standards. In this article, Philippe Luc, Director of Verification at Codasip, explains the methodology that is adopted at Codasip to bring processor verification to the next level.
After analyzing bugs on several generations of CPUs, I came to the conclusion that “bugs fly in squadrons”. In other words, when a bug is found in a given area of the design, the probability that there are other bugs with similar conditions, in the same area of the design, is quite high.
Finding a CPU bug is always satisfying, however it should not be an end in itself. If we consider that bugs do not fly alone but rather fly in groups – or squadrons – finding one bug should be a hint for the processor verification team to search for more of them, in the same area.
Here is a scenario. A random test found a bug after thousands of hours of testing. We could ask ourselves: How did it find this bug? The answer is likely to be a combination of events that had not been encountered before. Another question could be: Why did the random test find this bug? It would most likely be due to an external modification: a change in parameter in the test, an RTL modification, or a simulator modification for example.
With this new, rare bug found, we know that we have a more performant testbench that can now test a new area of the design. However we also learn that, before the testbench got improved, that area of the design was not stressed. If we consider that bugs fly in squadrons, it means we have a new area of the design to further explore to find more bugs. How are we going to improve our verification methodology?
To improve our testbench and hit these bugs, we can add checkers and assertions, and we can add tests. Let’s focus on testing.
To enlarge the scope so that we are confident we will hit these bugs, we use smart-random testing. When reproducing this bug with a directed testing approach, only the exact same bug is hit. However, we said that bugs fly in groups and the probability that there are other bugs in the same area, with similar conditions, is high. The idea is then to enlarge our scope. Random testing will not be as useful in this case, because we have an idea of what we want to target, following the squadron pattern.
Let’s assume that the bug was found on a particular RISC-V instruction. Can we improve our testing by increasing the probability of having this instruction tested? At first glance, probably, because statistically you get more failures exposing the same bug. However, most bugs are found with a combination of rare events: a stalled pipeline, a full FIFO, or some other microarchitectural implementation details. Standard testbenches can easily tune the probability of an instruction by simply changing a test parameter. But making a FIFO full is not directly accessible from the test parameter. It is a combination of other independent parameters (such as delays) that make the FIFO full more often.
Using smart-random testing in our verification methodology allows us to be both targeted and broad enough to efficiently find more bugs in this newly discovered area. It consists in tuning the test to activate more often the other events that trigger the bug. In other words, it means adjusting several parameters of the test, and not just one. It may seem more time consuming, but this methodology is really efficient in terms of improving the quality of our testing.
Improving testbenches by following bug squadrons, and killing each of them during the product development is key. This is exactly what the Codasip verification teams do to offer best-in-class quality RISC-V processors to our customers.
March 7, 2022
Philippe Luc, Director of Verification at Codasip, shares his view on what bugs verification engineers should pay attention to.
Did you know that between 1,000 and 2,000 bugs can appear during the design of a complex processor core? Really, a thousand bugs? Well, that’s what experience showed us. And not all bugs were born equal: their importance and consequences can vary significantly. Let’s go through 4 categories of CPU bugs, how to find them, and what the consequences would be for the user if we did not find them.
“Oh, I forgot the semicolon”. Yes, that is one bug. Very easy to detect, it is typically one you find directly at compile time. Apart from having your eyes wide-open, there is nothing else to do to avoid these.
“Oh, it turns out that a part of the specification has not been implemented”. That is another easy CPU bug for you to find with any decent testbench – provided that an explicit test exists. In this scenario, the first simple test exercising the feature will fail. What does your processor verification team need to do? Make sure you have exhaustive tests. The design team, on the other hand, needs to make an effort to carefully read the specifications, and follow any changes in the specification during the development.
In other words, the easy bug is one that is found simply by running a test that exercises the feature. Its (bad) behavior is systematic, not a timing condition. Being exhaustive in your verification is the key to finding such CPU bugs. Code coverage will help you but is definitely not enough. If a feature is not coded in the RTL, how can coverage report that it is missing? A code review – with the specification at hand – definitely helps.
A corner case CPU bug is more complex to find and requires a powerful testbench. The simple test cases that exercise the feature are correctly passing, even with random delays. Quite often, you find these bugs when asynchronous events join the party. For example, an interrupt arriving just between 2 instructions, at a precise timing. Or a line in the cache got evicted just when the store buffer wants to merge into. To reach these bugs, you need a testbench that juggles with the instructions, the parameters and the delays so that all the possible interleaving of instructions and events have been exercised. Obviously, a good checker should spot any deviation from what is expected.
Does code coverage help in that case? Unfortunately not. Simply because the condition of the bug is a combination of several events that are already covered individually. Here, condition coverage or branch coverage might be helpful. But it is painful to analyze and it is rarely beneficial in the end.
The hidden bugs are found by customers (which is bad), or by chance (internally, before release). In both cases, it means that the verification methodology was not able to find them.
If you use different testbenches or environments, you could find other cases just because the stimuli are different. Fair enough. Then, what do we mean by “found by chance”? Here comes the limit of random testbench methodology.
With random stimuli, the testbench usually generates the “same” thing. If you roll a dice to get a random number, there are very few chances to get 10 times in a row the number 6. One chance in 60 million, to be accurate. With a RISC-V CPU that has 100 different instructions, a (equiprobable) random instruction generator has only 1 chance every 10²⁰ times to generate 10 times in a row the same instruction. Just twice the number of different positions of a Rubik’s Cube… On a 10-stage pipeline processor, it is not unreasonable to test it with the same instruction present on all pipeline stages. Good luck if you don’t tune your random constraints…
You can take looking for corner cases and hidden cases too far and end up creating tests that are simply too silly.
Changing the endianness back and forth every cycle while connecting the debugger is probably something that will never ever happen on a consumer product, if the consequences of a CPU bug are never visible to a customer, then it is not really a bug. If you deliberately unplug your USB stick while you copy a file, and the file is corrupted, I consider this not a bug. If some operation causes the USB controller to hang, then yes, that is a bug.
Beware of extending the scope of the verification . When silly cases are found, then you are probably investing engineering effort in the wrong place.
There are different verification techniques you can apply to efficiently find CPU bugs before your customers do. At Codasip, we use multiple component testbenches, various random test generators, random irritators, and several other techniques to verify our products. As the project evolves, we develop these techniques to have a robust verification methodology. Learn more in our blog post where we explain how we continuously improve our verification methodology.
February 28, 2022
Finding a hardware bug in silicon has consequences. The severity of these consequences for the end user can depend on the use case. For the product manufacturer, fixing a bug once a design is in mass-production can incur a significant cost. Investing in processor verification is therefore fundamental to ensure quality. This is something we care passionately about at Codasip, here is why you should too.
Luckily for the semiconductor industry, there are statistically more bugs in software than in hardware, and in processors in particular. However, software can easily be upgraded over the air, directly in the end-products used by consumers. With hardware, on the other hand, this is not as straightforward and a hardware issue can have severe consequences. The quality of our deliverables, which will end up in real silicon, seriously matters.
Processors are ubiquitous. They control the flash memory in your laptop, the braking system of your car or the chip on your credit card. These CPUs have different performance requirements but also different security and safety requirements. In other words, different quality requirements.
Is it a major issue if the Wi-Fi chip in your laptop is missing a few frames? The Wi-Fi protocol retransmits the packet and it goes largely unnoticed. If your laptop’s SSD controller drops a few packets and corrupts the document you have been working on all day It will be a serious disruption to your work, there may be some shouting, but you will recover. It’s a bug that you might be able to accept.
Other hardware failures have much more severe consequences: What if your car’s braking system fails because of a hardware issue? Or the fly-by-wire communication in a plane fails? Or what if a satellite falls to earth because its orbit control fails? Some bugs and hardware failures are simply not acceptable.
Processor quality and therefore its reliability is the main concern of processor verification teams. And processor verification is a subtle art.
Processor verification requires strategy, diligence and completeness.
Verifying a processor means taking uncertainty into account. What software will run on the end product? What will be the use cases? What asynchronous events could occur? These unknowns mean significantly opening the verification scope. However, it is impossible to cover the entire processor state space, and it is not something to aim for.
Processor quality must be ensured while making the best use of time and resources. At the end of the day, the ROI must be positive. Nobody wants to find costly bugs after the product release, and nobody wants to delay a project because of an inefficient verification strategy. Doing smart processor verification means finding relevant bugs efficiently and as early as possible in the product development.
In other words, processor verification must:
Processor quality is fundamental. The art of verifying a processor is a subtle one that is evolving as the industry is changing and new requirements arise. At Codasip, we put in place verification methodologies that allow us to deliver high-quality RISC-V customizable processors. With Codasip Studio and associated tools, we provide our customers with the best technology that helps them follow up and verify their specific processor customization.