Meet us at the RISC-V Summit in San Jose, CA, on December 13-14, 2022! 

Domain-Specific Accelerators


May 21, 2021

Semiconductor scaling has fundamentally changed

For about fifty years, IC designers have been relying on different types of semiconductor scaling to achieve gains in performance. Best known is Moore’s Law which predicted that the number of transistors in a given silicon area and clock frequency would double every two years. This was combined with Dennard scaling which predicted that with silicon geometries and supply voltages shrinking, the power density would remain the same from generation to generation, meaning that power would remain proportional to silicon area. Combining these effects, the industry became used to processor performance per watt doubling approximately every 18 months. With successively smaller geometries, designers could use similar processor architectures but rely on more transistors and higher clock frequencies to deliver improved performance.

48 Years of Microprocessor Trend Data. Source K. Rupp.

Since about 2005, we have seen the breakdown of these predictions. Firstly, Dennard scaling ended with leakage current rather than transistor switching being the dominant component of chip power consumption. Increased power consumption means that a chip is at the risk of thermal runaway. This has also led to maximum clock frequencies levelling out over the last decade.

Secondly, the improvements in transistor density have fallen short of Moore’s Law. It has been estimated that by 2019, actual improvements were 15× lower than predicted by Moore in 1975. Additionally, Moore predicted that improvements in transistor density would be accompanied by the same cost. This part of his prediction has been contradicted by the exponential increases in building wafer fabs for newer geometries. It has been estimated that only Intel, Samsung, and TSMC can afford to manufacture in the next generation of process nodes.

Change in design is inevitable

With the old certainties of scaling silicon geometries gone forever, the industry is already changing. As shown in the chart above, the number of cores has been increasing and complex SoCs, such as mobile phone processors, will combine application processors, GPUs, DSPs, and microcontrollers in different subsystems.

However, in a post-Dennard, post-Moore world, further processor specialization will be needed to achieve performance improvements. Emerging applications such as artificial intelligence are demanding heavy computational performance that cannot be met by conventional architectures. The good news is that for a fixed task or limited range of tasks, energy scaling works better than for a wide range of tasks. This inevitably leads to creating special purpose, domain-specific accelerators.
This is a great opportunity for the industry.

What is a domain-specific accelerator?

A domain-specific accelerator (DSA) is a processor or set of processors that are optimized to perform a narrow range of computations. They are tailored to meet the needs of the algorithms required for their domain. For example, for audio processing, a processor might have a set of instructions to optimally implement algorithms for echo-cancelling. In another example, an AI accelerator might have an array of elements including multiply-accumulate functionality in order to efficiently undertake matrix operations.

Accelerators should also match their wordlength to the needs of their domain. The optimal wordlength might not match common ones (like 32-bits or 64-bits) encountered with general-purpose cores. Commonly used formats, such as IEEE 754 which is widely used, may be overkill in a domain-specific accelerator.

Also, accelerators can vary considerably in their specialization. While some domain-specific cores may be similar to or derived from an existing embedded core, others might have limited programmability and seem closer to hardwired logic.  More specialized cores will be more efficient in terms of silicon area and power consumption.
With many and varied DSAs, the challenge will be how to define them efficiently and cost-effectively.

Roddy Urquhart

Roddy Urquhart

Related Posts

Check out the latest news, posts, papers, videos, and more!

Building the highway to automotive innovation

May 5, 2022
By Jamie Broome

RISC-V Summit 2021

December 17, 2021
By Rupert Baines

Why and How to Customize a Processor

October 1, 2021
By Roddy Urquhart

Understanding the Performance of Processor IP Cores


August 20, 2020

Looking at any processor IP, you will find that their vendors emphasize PPA (performance, power & area) numbers. In theory, they should provide a level playing field for comparing different processor IP cores, but in reality, the situation is more complex. Let us consider processor performance.

What does processor performance mean?

The first thing to think about is what aspect of performance you care about. Do you care more about the absolute throughput that you want (performance per second), or the performance per MHz? In an application such as machine vision, which is continuously running and requiring the use of complex algorithms, it is likely that you will care about the absolute throughput. However, if you have a wireless sensor node with a low duty cycle, when the node wakes up, you will want it to be active for as few clock cycles as possible. This means you will care about how much computation you achieve per MHz.

About 40 years ago, computers were compared on the basis of MIPS (millions of instructions per second) although the problem is – what is an instruction? Instructions vary considerably in complexity and from one architecture to another, thus an operation will generally require less cycles in a CISC processor than a RISC one. MIPS were only helpful when comparing products with similar architectures and were called “meaningless indices of performance” by some!

Another thing to think about is the type of computation that you expect to care most about. Is it integer operations – and if so, which ones – or, say, floating-point computations? In the past, MFLOPS (million floating point operations per second) was a popular measure. But again, what is an ‘operation’?

Popular synthetic benchmarks

Today, synthetic benchmarks are universally used with processor IP cores. They have the following characteristics:

  • They are relatively small and portable.
  • They are representative of commonly used relevant applications.
  • They are reproducible and transparent.
  • They can be applied to a range of processors fairly.
  • They express the benchmark result as a single number.

Dhrystone

A benchmark that has been popular for the last 36 years is the Dhrystone benchmark. Its name is a play on words comparing it with the once-popular Whetstone benchmark. While Whetstone focused on floating point operations, Dhrystone focused on integer and string operations. The Dhrystone benchmark results are generally quoted as DMIPS (the Dhrystone score divided by that of a nominally 1 MIPS machine). The benchmark has been criticized because modern compilers can optimize away parts of the work, meaning that it partly tests compiler rather than processor performance.

For floating point, Whetstone is rarely used at present and it is more likely that LINPACK would be used. LINPACK involves LU decomposition of a matrix using floating point numbers. The result is expressed in MFLOPS.

CoreMark

Another popular synthetic benchmark for embedded applications has been EEMBC’s CoreMark® which aims to undertake operations that are representative of embedded integer processing needs. These include list processing, matrix operations, finite state machines, and CRC.

Find more details and some tips to measure processor performance according to your needs in this video!

Assessing performance when choosing a processor

There are various benchmark systems out there, each suited for measuring a slightly different type of performance. So how do you assess performance when choosing processor IP for your project?

If your embedded software has similar operations to a synthetic benchmark, then that benchmark may give you useful initial guidance quickly and simply. However, normally such benchmarks are quoted per MHz, for example CoreMark/MHz. The per MHz figure is normally a good indication for a low-power application where you are looking for good results per cycle. However, if you are looking for high absolute performance, this may be misleading. Instead you should consider, say, the CoreMarks achievable at your target clock frequency.

If your main issue is floating-point performance, bear in mind that DMIPS and CoreMark are integer benchmarks. You would be better comparing cores on the basis of a floating-point benchmark such as LINPACK.

Ultimately, it always makes sense to invest the time in running realistic software on a processor core to assess whether the core gives you the performance you need. If you are looking at RISC-V, then profiling your software to understand where the computational bottlenecks are can also lead to assessing whether adding custom instructions can give you improvements in performance.

It is not just about processor performance and scores

In this article we have looked at processor performance, but that is only one aspect of PPA and one factor to consider when choosing a processor. PPA numbers are always about balance and all of them matter when choosing an IP for a project, among other key considerations. The ISA, processor complexity, processor memory or even the licensing model will impact your choice. Find out more in our white paper “What you should consider when choosing a processor IP core“.

Roddy Urquhart

Roddy Urquhart

Related Posts

Check out the latest news, posts, papers, videos, and more!

How to reduce the risk when making the shift to RISC-V

November 17, 2022
By Lauranne Choquin

DAC 2022 – Is it too risky not to adopt RISC-V?

July 18, 2022
By Brett Cline

Processor design automation to drive innovation and foster differentiation

July 7, 2022
By Lauranne Choquin