We had the pleasure of sponsoring and hosting a hackathon at the RISC-V Summit North America 2024 in Santa Clara, California. As part of our commitment to advancing CPU design, we invited participants to take on a challenging technical mission that tested their skill through Codasip Studio, our advanced design tool. This tool, along with CodAL—our unique C-like language—enabled participants to create sophisticated custom CPU logic that could handle one of the most unforgiving environments: deep space.
Laying the groundwork at RISC-V Summit Europe
In Munich, at the RISC-V Summit Europe, we introduced hackathon participants to Codasip Studio with an initial exploration challenge. There, we provided access to a base RISC-V core and tasked participants with extending it by adding custom hardware instructions. Their additions optimized a neural network application, specifically a multilayer perceptron, by customizing the hardware to accelerate specific software algorithms. Codasip Studio’s integrated workflow automatically detected and integrated these new instructions into the compiler, enabling efficient testing and rapid iteration. The results showed how powerful RISC-V’s customizable architecture can be, particularly in AI applications that benefit from custom hardware-accelerated operations. Learn more from the participants.
The challenge in North America: Designing for space-resilience
At the RISC-V Summit North America, we set an even more ambitious goal for the hackathon: to create custom error correction logic for CPU register files, a critical component of CPU design in space environments. The theme for the event—Codasip Cosmic Compute—envisioned a near-future mission to build space-ready CPUs that could withstand radiation-induced errors. The challenge was inspired by the needs of space-bound computing systems, where exposure to cosmic radiation poses significant risks to system reliability and mission success.
Understanding the risks: Soft errors and Single Event Upsets
In deep space, without the Earth’s protective atmosphere, high-energy particles such as cosmic rays can strike a CPU, causing soft errors—temporary disruptions that do not physically harm the hardware but may corrupt data stored in a CPU’s register file. Single Event Upsets (SEUs) are a common manifestation of these soft errors, where a single particle can flip a bit, leading to incorrect calculations and potentially system failure.
Given these stakes, designing systems that can detect and correct such errors is crucial for any mission’s success. We provided participants with an overview of techniques for radiation resilience, such as Error Correction Codes (ECC), redundant systems, and system scrubbing, which participants could implement directly within Codasip Studio.
Codasip Studio and CodAL: The tools for innovation
Codasip Studio’s high-level design capabilities, combined with the CodAL language, provided participants with the tools they needed to implement and test error-correcting logic in the RISC-V base core. The seamless integration of hardware and software workflows allowed participants to write new error-correcting hardware logic in CodAL and test them with ease. This integration was invaluable, given the complexity of the error-correction logic and the need to rapidly iterate on designs to ensure system reliability in a simulated space environment.
Technical details
The CPU designed for this hackathon runs a basic software algorithm: bubble sort. The purpose of this sort operation is twofold: it not only tests CPU performance under simulated cosmic radiation but also serves as a mechanism to check data integrity and CPU operation accuracy.
- Input Data: The program has two static vectors of elements—one sorted and the other unsorted. The unsorted array undergoes a bubble sort, and upon completion, it is compared to the pre-sorted array.
- Integrity Check: If the sorted result matches the reference array, the program executes a special instruction called watchdog_kick. This instruction resets an internal hardware counter back to zero. This check serves as an essential safeguard against execution errors, signaling that the CPU is operating correctly.
- Error Detection with Cycle Monitoring: The bubble sort takes approximately 2000 cycles to complete under normal conditions. If the watchdog_kick instruction does not execute within 3000 cycles, it indicates that the CPU may be compromised due to a failure in the execution flow. In such a case, the internal counter is not reset, causing the simulation to halt and signaling a failure in error correction.
Simulated Radiation and Soft Error Injection
The hardware simulation emulates the effects of radiation by randomly introducing errors into the CPU’s register file. Here’s how this challenging environment was set up:
- Error Injection in the Register File: Bit flips are introduced randomly within the hardware register file, simulating the effect of cosmic radiation. These bit flips can range from single-bit to multi-bit errors, adding to the complexity of correction.
- Increasing Error Frequency: With every 50 resets of the watchdog_kick, the probability of register file corruption increases. This escalation tests the robustness and scalability of participants’ error correction mechanisms, simulating the accumulation of radiation damage over time.
- System Failure: If the watchdog_kick does not reset the counter within 3000 cycles due to persistent errors in the register file, the simulation will halt. Without any ECC the simulation encounters a catastrophic failure after 336 cycles—representing the limits of radiation tolerance in the absence of effective correction mechanisms.
Error detection and correction: The hardware pipeline
To ensure the CPU can function correctly despite bit-flip errors, participants worked on two critical modules within the hardware pipeline: decode_and_fix and encode_and_insert.
- decode_and_fix Module: This combinatorial module is responsible for detecting and correcting errors within the register file. Each register signal is read by this module, where it is examined for errors, decoded if needed, and corrected before proceeding back to the register file. Given that this module directly affects data integrity, it is crucial that all corrections are accurate; any error passing through here risks corrupting the program’s execution flow. This module also routes the corrected signals to the appropriate locations in the CPU pipeline.
- encode_and_insert Module: The role of this module is to manage write-back operations. When the WB (write-back) stage writes new data to the register file, encode_and_insert ensures that the data is encoded or otherwise fortified before it’s saved. This prevents further corruption of data as it circulates within the CPU.
The goal was to keep the CPU running for as long as possible with the smallest number of added register bits!
As an internal test run, we have implemented a few simple error correcting schemes and plotted them on a bits vs resets and SER (soft error rate) vs resets plots. The second plot clearly shows how the severity of radiation increases exponentially (log plot) as we travel through interstellar space. For example, 0.1 means that on average 0.1 bits per word (32-bits) are corrupted per cycle.
The job of the participants was to beat or match these algorithms.
Conclusion
For participants, the hackathon provided invaluable experience in designing error-tolerant systems and a deeper understanding of RISC-V’s versatility in supporting highly customized logic for specific applications.
The RISC-V North American Summit hackathon was an inspiring showcase of innovation and technical skill, pushing participants to apply Codasip Studio and CodAL to solve complex, real-world challenges in CPU design.
Codasip is committed to supporting the RISC-V community in developing versatile, robust, and innovative CPU architectures. With tools like Codasip Studio, we’re enabling engineers to push the boundaries of what’s possible in computing. For academia and students, we also have a thriving University program where interested parties can get access to our solutions to use in research and teaching.
We would like to extend our thanks to the RISC-V International team responsible for organizing the summit and to Megan Lehn and Camille Calichon for driving the hackathon events.