Meet us at the RISC-V Summit in San Jose, CA, on December 13-14, 2022! 

Domain-Specific Accelerators

May 21, 2021

Semiconductor scaling has fundamentally changed

For about fifty years, IC designers have been relying on different types of semiconductor scaling to achieve gains in performance. Best known is Moore’s Law which predicted that the number of transistors in a given silicon area and clock frequency would double every two years. This was combined with Dennard scaling which predicted that with silicon geometries and supply voltages shrinking, the power density would remain the same from generation to generation, meaning that power would remain proportional to silicon area. Combining these effects, the industry became used to processor performance per watt doubling approximately every 18 months. With successively smaller geometries, designers could use similar processor architectures but rely on more transistors and higher clock frequencies to deliver improved performance.

48 Years of Microprocessor Trend Data. Source K. Rupp.

Since about 2005, we have seen the breakdown of these predictions. Firstly, Dennard scaling ended with leakage current rather than transistor switching being the dominant component of chip power consumption. Increased power consumption means that a chip is at the risk of thermal runaway. This has also led to maximum clock frequencies levelling out over the last decade.

Secondly, the improvements in transistor density have fallen short of Moore’s Law. It has been estimated that by 2019, actual improvements were 15× lower than predicted by Moore in 1975. Additionally, Moore predicted that improvements in transistor density would be accompanied by the same cost. This part of his prediction has been contradicted by the exponential increases in building wafer fabs for newer geometries. It has been estimated that only Intel, Samsung, and TSMC can afford to manufacture in the next generation of process nodes.

Change in design is inevitable

With the old certainties of scaling silicon geometries gone forever, the industry is already changing. As shown in the chart above, the number of cores has been increasing and complex SoCs, such as mobile phone processors, will combine application processors, GPUs, DSPs, and microcontrollers in different subsystems.

However, in a post-Dennard, post-Moore world, further processor specialization will be needed to achieve performance improvements. Emerging applications such as artificial intelligence are demanding heavy computational performance that cannot be met by conventional architectures. The good news is that for a fixed task or limited range of tasks, energy scaling works better than for a wide range of tasks. This inevitably leads to creating special purpose, domain-specific accelerators.
This is a great opportunity for the industry.

What is a domain-specific accelerator?

A domain-specific accelerator (DSA) is a processor or set of processors that are optimized to perform a narrow range of computations. They are tailored to meet the needs of the algorithms required for their domain. For example, for audio processing, a processor might have a set of instructions to optimally implement algorithms for echo-cancelling. In another example, an AI accelerator might have an array of elements including multiply-accumulate functionality in order to efficiently undertake matrix operations.

Accelerators should also match their wordlength to the needs of their domain. The optimal wordlength might not match common ones (like 32-bits or 64-bits) encountered with general-purpose cores. Commonly used formats, such as IEEE 754 which is widely used, may be overkill in a domain-specific accelerator.

Also, accelerators can vary considerably in their specialization. While some domain-specific cores may be similar to or derived from an existing embedded core, others might have limited programmability and seem closer to hardwired logic.  More specialized cores will be more efficient in terms of silicon area and power consumption.
With many and varied DSAs, the challenge will be how to define them efficiently and cost-effectively.

Roddy Urquhart

Roddy Urquhart

Related Posts

Check out the latest news, posts, papers, videos, and more!

Building the highway to automotive innovation

May 5, 2022
By Jamie Broome

RISC-V Summit 2021

December 17, 2021
By Rupert Baines

Why and How to Customize a Processor

October 1, 2021
By Roddy Urquhart

More than Moore with Domain-Specific Processors

May 6, 2020

Last month was the 55th anniversary of Gordon Moore’s famous paper Cramming more components onto integrated circuits. He took a long-term view of the trends in integrated circuits being implemented using successively smaller feature sizes in silicon. Since that paper, integrated circuit developers have been relying on three of his predictions:

  1. The number of transistors per chip increasing exponentially over time.
  2. The clock speed increasing exponentially.
  3. The cost per transistor decreasing exponentially.

These predictions have largely held true for almost half a century, enabling successive generations of processors to achieve higher computational performance through greater processor complexity and higher clock speeds. These improvements were mainly delivered through general-purpose processors implemented in new technology nodes.

From about 2005, the improvements in clock frequency began to level off, leading to a levelling-off of single thread performance. Since then, using multiple cores on a single die has become commonplace, but again these cores were mainly general-purpose ones, whether application processors or MCUs.

If you are designing a chip with some performance challenges, do you simply follow Moore’s law and move into a smaller silicon geometry and use general-purpose processor cores? That could be a costly approach since mask-making costs are higher in small geometries. Also, you may not achieve your performance in the most efficient way.

Many embedded applications involve cryptography, DSP, encoding/decoding or machine learning. Each of these operations typically run inefficiently on a general-purpose core. For example, Galois fields are commonly used in cryptography, but multiplication operations take many clock cycles.

So, what can be done to deal with computational bottlenecks? In extreme cases, like dealing with real time video data, it may make sense to create dedicated computational units to handle a narrow range of computationally intensive operations. However, this may not be the best trade-off.

It will often be desirable to run both computationally intensive operations and other embedded functions on the same processor core.  Ideally, the most silicon-efficient processor implementation will be one that is tuned for your particular application.

This can be achieved by creating a processor core that has custom instructions targeted to address the bottlenecks. Adding custom instructions does not have to be expensive in silicon resources. For example, Microsemi created custom DSP instructions for their RISC-V-based audio processor products: Their custom Bk RISC-V processor delivered 4.24× the performance of the original RV32IMC core but required only a 48% increase in silicon area. Furthermore, the code size shrunk to 43% of the original size1.

So how can you efficiently implement custom instructions? And equally importantly, how do you verify that they are implemented correctly? We will be visiting these topics on future posts.

1 Dan Ganousis & Vijay Subramaniam, Implementing RISC-V for IoT Applications, Design Automation Conference 2017

Roddy Urquhart

Roddy Urquhart

Related Posts

Check out the latest news, posts, papers, videos, and more!

Processor design automation to drive innovation and foster differentiation

July 7, 2022
By Lauranne Choquin

Single unified toolchain empowering processor research

May 23, 2022
By Keith Graham

Design for differentiation: architecture licenses in RISC‑V

May 16, 2022
By Rupert Baines