May 21, 2021
May 21, 2021
For about fifty years, IC designers have been relying on different types of semiconductor scaling to achieve gains in performance. Best known is Moore’s Law which predicted that the number of transistors in a given silicon area and clock frequency would double every two years. This was combined with Dennard scaling which predicted that with silicon geometries and supply voltages shrinking, the power density would remain the same from generation to generation, meaning that power would remain proportional to silicon area. Combining these effects, the industry became used to processor performance per watt doubling approximately every 18 months. With successively smaller geometries, designers could use similar processor architectures but rely on more transistors and higher clock frequencies to deliver improved performance.
48 Years of Microprocessor Trend Data. Source K. Rupp.
Since about 2005, we have seen the breakdown of these predictions. Firstly, Dennard scaling ended with leakage current rather than transistor switching being the dominant component of chip power consumption. Increased power consumption means that a chip is at the risk of thermal runaway. This has also led to maximum clock frequencies levelling out over the last decade.
Secondly, the improvements in transistor density have fallen short of Moore’s Law. It has been estimated that by 2019, actual improvements were 15× lower than predicted by Moore in 1975. Additionally, Moore predicted that improvements in transistor density would be accompanied by the same cost. This part of his prediction has been contradicted by the exponential increases in building wafer fabs for newer geometries. It has been estimated that only Intel, Samsung, and TSMC can afford to manufacture in the next generation of process nodes.
With the old certainties of scaling silicon geometries gone forever, the industry is already changing. As shown in the chart above, the number of cores has been increasing and complex SoCs, such as mobile phone processors, will combine application processors, GPUs, DSPs, and microcontrollers in different subsystems.
However, in a post-Dennard, post-Moore world, further processor specialization will be needed to achieve performance improvements. Emerging applications such as artificial intelligence are demanding heavy computational performance that cannot be met by conventional architectures. The good news is that for a fixed task or limited range of tasks, energy scaling works better than for a wide range of tasks. This inevitably leads to creating special purpose, domain-specific accelerators.
This is a great opportunity for the industry.
A domain-specific accelerator (DSA) is a processor or set of processors that are optimized to perform a narrow range of computations. They are tailored to meet the needs of the algorithms required for their domain. For example, for audio processing, a processor might have a set of instructions to optimally implement algorithms for echo-cancelling. In another example, an AI accelerator might have an array of elements including multiply-accumulate functionality in order to efficiently undertake matrix operations.
Accelerators should also match their wordlength to the needs of their domain. The optimal wordlength might not match common ones (like 32-bits or 64-bits) encountered with general-purpose cores. Commonly used formats, such as IEEE 754 which is widely used, may be overkill in a domain-specific accelerator.
Also, accelerators can vary considerably in their specialization. While some domain-specific cores may be similar to or derived from an existing embedded core, others might have limited programmability and seem closer to hardwired logic. More specialized cores will be more efficient in terms of silicon area and power consumption.
With many and varied DSAs, the challenge will be how to define them efficiently and cost-effectively.
February 11, 2021
Using the open RISC-V ISA is a great starting point for creating a domain-specific processor that combines application-specific capabilities and access to portable software. But how do you create an optimized ISA, profiling software and experimenting with adding/removing instructions, in a smart and easy way? In other words, how do you customize an existing RISC-V processor efficiently?
The industry will generally offer you two approaches. Either you do it manually or you automate the process as much as possible.
The old-fashioned way to modify the instruction set would be to:
This requires an extensive amount of manual work with associated technical risks. The resulting SDK will almost certainly need to make any custom instructions available as intrinsics or as inline assembler code. The alternative of modifying and verifying the compiler is costly in effort, but the end result is much better for the software developers. Similarly, if a processor is extended, traditionally it would be necessary to modify the microarchitecture by editing the RTL and then verify it against the ISS as the golden reference.
In contrast, describing the ISA in a processor description language like CodAL significantly improves the efficiency of this process using design automation tools. Codasip Studio can automatically generate both the ISS and a new compiler for the modified ISA, making the processor customization process more straightforward. But the CodAL processor description language is not just limited to creating instruction-accurate descriptions – it can also be used to describe microarchitecture (cycle-accurate). The consistency of the two descriptions can be checked within the Studio environment using static analysis.
A far easier approach is to not just to start with the RISC-V ISA, but with a complete RISC-V processor core design described in CodAL.
Codasip’s RISC-V processors are all designed in Codasip Studio using the CodAL language. The range of cores spans simple 32-bit embedded cores to 64-bit Linux-capable application processors with multi-core capabilities. Thus, you can choose a processor that meets known baseline requirements such as pipeline depth and/or OS support, and then focus on creating custom extensions to improve performance. Starting with an already-proven processor design means that the microarchitecture for any new instructions is incremental, saving time and significantly reducing risk.
Codasip Studio can be used to generate the HDK including the RTL, a testbench, EDA scripts, and a UVM environment. The UVM environment enables the all-important verification of the new processor RTL against its golden ISS reference. The generated UVM environment includes default cover points and assertions for key areas of functionality such as register files, bus protocols, memories, and caches. The third-party RTL simulator can measure the functional and RTL code coverage achieved.
After extending the microarchitecture, Codasip Studio profiler provides coverage analysis tools to assess code coverage of the CodAL, including line, condition, and expression coverage. In order to achieve acceptable code coverage, Codasip Studio provides random assembler generators to exercise the code very thoroughly. In difficult corner cases, it may be necessary to write directed tests.
All-in-all a Codasip RISC-V processor licensed as CodAL source code can be efficiently modified in the Codasip Studio environment and thoroughly verified. This is a cost-effective and super-efficient approach to creating domain-specific processors.
Learn more in our white paper “Creating Domain-Specific Processors with RISC-V custom ISA instructions”.
January 29, 2021
If you are going to create a domain-specific processor, one of the key activities is to choose an instruction set architecture (ISA) that matches your software needs. So where do you start?
Some companies have created their instruction sets from scratch, but if you have such an ISA, a penalty may be the costs of porting software. Today, the RISC-V open ISA can provide you with an excellent starting point and a software ecosystem. Depending on what you need, there are several obvious starting points. In case of a 32-bit processor, if you start with RISC‑V, the base ISA (RV32I) is just 47 instructions. Using this base set is easier than creating proprietary instructions with similar functionality, as well as meaning that software is already available from the RISC-V ecosystem.
|Starting point||Number of instructions|
Many use cases require multiplication suggesting that [M] extensions would be useful, and it is sensible to take advantage of the 16-bit compressed [C] instructions for code density, so it is commonplace to use the RV32IMC set which amount to 101 instructions. Using RISC-V as a starting point will ensure that it is straightforward to use common software such as an RTOS or protocol stack. If you additionally require floating point computation, then the RV32GC (G=IMAFD) instructions may be appropriate, additionally including atomic [A], single-precision floating point [F], and double-precision floating point [D] extensions. Even RV32GC only has 164 instructions.
The RISC-V ISA is designed in a modular way that allows processor designers to add not only any of the standard extensions, but also to create their own custom instructions while keeping full RISC-V-compliance. The standard extensions are a convenient option thanks to being readily available; however, some may substantially increase the instruction set complexity. For example, the complete set of packed SIMD extensions [P] adds 331 additional instructions. In many cases, sufficient gains for a particular application can be made with custom instructions with potentially a lower overhead in silicon area and power.
Having chosen the starting point for your domain-specific processor, it is then necessary to work out what special instructions are needed to meet your computational requirements. This requires a careful analysis of the software that you need to run on your processor core. A profiling tool allows computational hotspots to be identified. Once such hotspots are known, a designer can create custom instructions to address them. Therefore, a designer can iterate by experimenting with adding or deleting instructions, then profiling the software again and assessing whether the changes have achieved their objectives.
However, while this iterative process is logical, how would you actually do it in practice? You might be able to access an open-source instruction set simulator and toolchains such as GNU or LLVM, but modifying these by hand is something for toolchain specialists and is time-consuming.
The alternative is to describe the instruction set using a processor description language. In Codasip Studio, an instruction accurate (IA) model of a processor can be created using the CodAL processor description language. An SDK including compiler, instruction set simulator (ISS), debugger, and profiler can be automatically generated from the IA description.
By describing the ISA at a high level and automatically generating the SDK, it is possible to rapidly iterate experiments in extending the instruction set. In this way, it is possible to choose a well-optimized ISA for a domain-specific processor, sometimes known as an Application Specific Instruction Processor (ASIP). Generating the SDK automatically is not only faster, but less prone to errors than manual changes, meaning that the design process is cheaper and more predictable, avoiding unnecessary risk and roadmap disruptions.
November 12, 2020
The area of any part of a processor design contributes both to the silicon cost and to the power consumption. A simplistic following of the “A” in a processor IP vendor’s PPA numbers can be misleading. A processor is never used in isolation but is part of a subsystem additionally including instruction memory, data memory, and peripherals. In most cases, instruction memory will be dominant and the processor area is much less important.
The size of the instruction memory will be influenced by the target instruction set, the compiler and the compiler switches used. In the case of RISC-V, the choice of optional standard extensions and custom extensions can greatly influence the codesize.
To illustrate this, the following table shows the effect of adding extensions to both core and codesize.
|ISA||Core size (kgates)||Increase over base||Codesize (kbytes)||Decrease over base|
|RV32I (base)||16.0||× 1.0||232||× 1.0|
|RV32IM||26.2||× 1.6||148||× 1.6|
|RV32IM+DSP||38.7||× 2.4||64||× 3.6|
Source: Implementing RISC-V for IoT applications, Dan Ganousis & Vijay Subramaniam, DAC 2017
In this example, Microsemi used a Codasip RISC-V L31 processor to implement an audio processing application. Starting with just the 32-bit base instruction set, they had an unacceptably high codesize and cycle count. Some improvement was achieved by adding multiplication [M] extensions, but the breakthrough was using custom DSP instructions. These led to a 3.6× reduction in codesize at the price of a 2.4× increase in core size compared with the base core. With instruction memory dominating the area, this was a good trade-off; furthermore, the performance goals were readily achieved.
With typical vendor PPA data, synthetic benchmarks such as CoreMark/MHz are often quoted with a complex set of compiler switches – this is something we discussed in our article dedicated to processor performance. But in practice, embedded software is probably going to be compiled using common switches such as ‑Os or ‑O3.
Consider compiling the CoreMark benchmark with different switches using the common GCC compiler. In this case, the target was a Codasip RV32IMC RISC-V core with a 3-stage pipeline.
The chart below shows CoreMark/MHz and codesize measures for different compiler settings. The last example is one that is typical of vendor performance data where many switches are used for CoreMark (CM = “-O3 -flto -fno-common -funroll-loops -finline-functions -falign-functions=16 -falign-jumps=8 -falign-loops=8 -finline-limit=1000 -fno-if-conversion2 -fselective-scheduling -fno-tree-dominator-opts -fno-reg-struct-return -fno-rename-registers –param case-values-threshold=8 -fno-crossjumping -freorder-blocks-and-partition -fno-tree-loop-if-convert -fno-tree-sink -fgcse-sm -fgcse-las -fno-strict-overflow”).
In this example, the CoreMark/MHz score grows as the switches change from left to right. However, it is interesting to note that the most complex set of switches increases the codesize by 40 % over ‘‑O3’ while the performance only improves by 14 %.
Not every example will behave in this way, but compiler switches influence both performance and codesize. It is important to be realistic about what compiler switches you would like to use, and to ensure that the switches for any performance benchmark data matches those you would use for assessing codesize.
PPA numbers will inevitably be used to compare processors. However, these indicators need context to be representative of your actual needs, and not be misleading. Some key considerations, including OS support and ISA choices, will also influence your processor choice. To find out more, read our white paper on “What you should consider when choosing a processor IP core”.
October 15, 2020
For each embedded product, software developers need to consider whether they need an operating system; and if so, what type of embedded OS. Operating systems vary considerably, from real-time operating systems with a very small memory footprint to general-purpose OSes such as Linux with a rich set of features.
Choosing a proper type of operating system for your product – and consequently working out the required features of the embedded processor – depends significantly on whether you face a hard real-time requirement. Safety-critical and industrial systems such as an anti-lock braking system or motor control will have hard maximum response times. At the other end of the spectrum, consumer systems such as audio or gaming devices may be able to tolerate buffering, as long as the average performance is adequate. Such systems are said to have soft real-time requirements.
A hard real-time requirement can be achieved by writing so-called bare-metal software that directly controls the underlying hardware. Bare-metal programming is typically used when the processor resources are very limited, the software is simple enough, and/or the real-time requirements are so tight that introduction of a further abstraction layer would complicate meeting these hard real-time requirements. The disadvantage to this approach is that such bare-metal software needs to be written as a single task (plus interrupt routines), making it difficult for programmers to maintain the software as its complexity grows.
When dealing with more complex embedded software, it is often advantageous to employ a Real-Time Operating System (RTOS). It allows the programmer to split the embedded software into multiple threads whose execution is managed by the small, low-overhead “kernel” of the RTOS. The use of the multi-threaded paradigm enables developers to create and maintain more complex software while still allowing for sufficient reactivity.
RTOSes typically operate with a concept of “priority” assigned to individual threads. The RTOS can then “pre-empt” (temporarily halt) lower-priority threads in favour of those with higher priority, so that the required real-time constraints can be met. The use of an RTOS often becomes necessary when adopting complex libraries or protocol stacks (such as TCP/IP or Bluetooth) as this third-party software normally consists of multiple threads already.
The embedded processor requirements of a simple RTOS, such as FreeRTOS or Zephyr, are truly modest. It is sufficient to have a RISC-V processor with just machine mode (M) and a timer peripheral. However, rigorous software development is needed as machine mode offers unconstrained access to all memory and peripherals with associated risks. Extra protection is possible through a specialized RTOS such as those developed for functional safety, like SAFERTOS, or for security.
If a processor core supports both machine (M) and user (U) privilege modes and has physical memory protection (PMP), it is possible to establish separation between trusted code (with unconstrained access) and other application code. With PMP, the trusted code sets up rules for each portion of the application code, saying which parts of memory (or peripherals) it is allowed to access. PMP can for instance be used to prevent third-party code from interfering with the data of the rest of the application, or to detect stack overflows. Employing PMP therefore increases the safety and security of a system, but at the cost of additional hardware required for its support.
We also discuss embedded OS support in this video!
For applications requiring a more advanced user interface, sophisticated I/O and networking, such as in set-top boxes or entertainment systems, an RTOS is likely to be too simplistic. The same applies if there are complex computations, requirements for a full process isolation and multitasking, filesystem & storage support, or a full separation of application code from hardware via device drivers. Systems like these generally have soft real-time requirements and can be best served by a general-purpose rich operating system such as Linux. As mentioned in our blog post dedicated to processor complexity, Linux requires multiple RISC-V privilege modes – machine, supervisor, and user modes (M, S, U) – as well as a memory management unit (MMU) for virtual-to-physical address translation. Also, the memory footprint of such a system is significantly larger compared to a simple RTOS.
Finally, for embedded systems that require both hard real-time responses and features of a rich OS like Linux, it is common to design them with two communicating processor subsystems, one supporting an RTOS and the other running Linux.
Choosing the appropriate embedded OS for your product and identifying the features required for your embedded processor depends heavily on the type of real-time requirements you face. Together with processor performance and complexity, among other key considerations, the OS support you need should be taken into account when choosing a processor. To find out more, read our white paper on “What you should consider when choosing a processor IP core”.
September 10, 2020
The more complex a processor core, the larger the area and power consumption. But increasing complexity is not a single dimension as processors can be more complex in different ways. In selecting a processor IP core, it is important to choose the right sort of processor complexity for your project.
There are different ways of thinking about processor complexity. Word length, execution units, privilege modes, virtual memory and security features are important considerations that will make your processor core more complex. It is important to understand what you really need for your project.
Generally, the smaller the word length, the smaller the core and the lower the power, however this is not always the case. An 8-bit core, such as the 8051, is comparable in gate count to the smallest 32-bit cores, but power consumption is usually worse. An 8-bit core requires more memory accesses due to less computation per clock cycle requiring more cycles. The net impact is that it requires more power to complete a computation.
Processor cores vary considerably in the complexity of their execution units. The simplest are basic single ALUs requiring many common operations to be implemented by the simple instructions – for example using shift and add to implement a multiplication. It is therefore commonplace for cores to have a hardware multiplier and divider. In the event of needing good floating-point performance, adding a hardware Floating Point Unit (FPU) will provide significantly better performance. This option is available for Codasip’s Low-Power (L) and High-Performance (H) Embedded RISC-V processor cores but at the price of roughly doubling the core size.
So far, we have assumed a single computational thread and scalar processing units which execute one instruction at a time. Superscalar architectures have instruction-level parallelism able to fetch multiple instructions and dispatch them to different execution units. A dual-issue core processing one thread can theoretically have up to double the performance of a single-issue core. However, a thread can stall making both execution units temporarily inactive. If there are two hardware threads (harts), then if one thread stalls, the other can continue execution.
Processors can vary considerably in pipeline depth and there is a direct relationship between this depth and latency. Some applications can tolerate high latency, with the consequence being slower response to interrupts, in return for high clock frequencies and throughput. Other applications require rapid responses to interrupts so need shorter pipelines.
We also discuss processor complexity in this video!
Another area of complexity is privilege modes. The more modes, the more complex the core logic. Many embedded applications run in machine mode, which means that the code has full access to the core – like root privilege in Linux. Such code must be completely trusted to avoid negative consequences. In more sophisticated applications, a range of privileges such as machine, supervisor and user may be offered. Normal applications will run in user mode with the greatest amount of protection and some software requiring greater privilege will use supervisor mode.
Virtual memory also requires additional processor resources such as a memory management unit (MMU) and translation lookaside buffer (TLB) to handle translating virtual memory addresses to physical addresses. This brings additional costs in terms of area and power dissipation without improving processor throughput. Nevertheless, virtual memory is necessary for using rich operating systems such as Linux which enable more complex software to be used.
So, when choosing a processor core, work out what sort of execution units, memory management, privilege and security you need. That combination will determine the complexity of the core.
So, when choosing a processor core, work out what sort of execution units, memory management, privilege and security you need. That combination will determine the complexity of the core. But that’s not all. If PPA numbers are typically considered when looking at the wide choice of processor IP cores on the market, that’s not enough. Processor complexity is one element, but processor performance, software requirements and the ISA, among others, are key considerations to investigate. We cover these in our white paper “What you should consider when choosing a processor IP core”.
August 20, 2020
Looking at any processor IP, you will find that their vendors emphasize PPA (performance, power & area) numbers. In theory, they should provide a level playing field for comparing different processor IP cores, but in reality, the situation is more complex. Let us consider processor performance.
The first thing to think about is what aspect of performance you care about. Do you care more about the absolute throughput that you want (performance per second), or the performance per MHz? In an application such as machine vision, which is continuously running and requiring the use of complex algorithms, it is likely that you will care about the absolute throughput. However, if you have a wireless sensor node with a low duty cycle, when the node wakes up, you will want it to be active for as few clock cycles as possible. This means you will care about how much computation you achieve per MHz.
About 40 years ago, computers were compared on the basis of MIPS (millions of instructions per second) although the problem is – what is an instruction? Instructions vary considerably in complexity and from one architecture to another, thus an operation will generally require less cycles in a CISC processor than a RISC one. MIPS were only helpful when comparing products with similar architectures and were called “meaningless indices of performance” by some!
Another thing to think about is the type of computation that you expect to care most about. Is it integer operations – and if so, which ones – or, say, floating-point computations? In the past, MFLOPS (million floating point operations per second) was a popular measure. But again, what is an ‘operation’?
Today, synthetic benchmarks are universally used with processor IP cores. They have the following characteristics:
A benchmark that has been popular for the last 36 years is the Dhrystone benchmark. Its name is a play on words comparing it with the once-popular Whetstone benchmark. While Whetstone focused on floating point operations, Dhrystone focused on integer and string operations. The Dhrystone benchmark results are generally quoted as DMIPS (the Dhrystone score divided by that of a nominally 1 MIPS machine). The benchmark has been criticized because modern compilers can optimize away parts of the work, meaning that it partly tests compiler rather than processor performance.
For floating point, Whetstone is rarely used at present and it is more likely that LINPACK would be used. LINPACK involves LU decomposition of a matrix using floating point numbers. The result is expressed in MFLOPS.
Another popular synthetic benchmark for embedded applications has been EEMBC’s CoreMark® which aims to undertake operations that are representative of embedded integer processing needs. These include list processing, matrix operations, finite state machines, and CRC.
Find more details and some tips to measure processor performance according to your needs in this video!
There are various benchmark systems out there, each suited for measuring a slightly different type of performance. So how do you assess performance when choosing processor IP for your project?
If your embedded software has similar operations to a synthetic benchmark, then that benchmark may give you useful initial guidance quickly and simply. However, normally such benchmarks are quoted per MHz, for example CoreMark/MHz. The per MHz figure is normally a good indication for a low-power application where you are looking for good results per cycle. However, if you are looking for high absolute performance, this may be misleading. Instead you should consider, say, the CoreMarks achievable at your target clock frequency.
If your main issue is floating-point performance, bear in mind that DMIPS and CoreMark are integer benchmarks. You would be better comparing cores on the basis of a floating-point benchmark such as LINPACK.
Ultimately, it always makes sense to invest the time in running realistic software on a processor core to assess whether the core gives you the performance you need. If you are looking at RISC-V, then profiling your software to understand where the computational bottlenecks are can also lead to assessing whether adding custom instructions can give you improvements in performance.
In this article we have looked at processor performance, but that is only one aspect of PPA and one factor to consider when choosing a processor. PPA numbers are always about balance and all of them matter when choosing an IP for a project, among other key considerations. The ISA, processor complexity, processor memory or even the licensing model will impact your choice. Find out more in our white paper “What you should consider when choosing a processor IP core“.