Understanding the Performance of Processor IP Cores

Blog, News & Docs

Looking at any processor IP, you will find that their vendors emphasise PPA (performance, power & area) numbers. In theory, they should provide a level playing field for comparing different processor IP cores, but in reality, the situation is more complex. Let us consider performance.

The first thing to think about is what aspect of performance you care about. Do you care more about the absolute throughput that you want (performance per second), or the performance per MHz? In an application such as machine vision, which is continuously running and requiring the use of complex algorithms, it is likely that you will care about the absolute throughput. However, if you have a wireless sensor node with a low duty cycle, when the node wakes up, you will want it to be active for as few clock cycles as possible. This means you will care about how much computation you achieve per MHz.

About 40 years ago, computers were compared on the basis of MIPS (millions of instructions per second) although the problem is – what is an instruction? Instructions vary considerably in complexity and from one architecture to another, thus an operation will generally require less cycles in a CISC processor than a RISC one. MIPS were only helpful when comparing products with similar architectures and were called “meaningless indices of performance” by some!

Another thing to think about is the type of computation that you expect to care most about. Is it integer operations – and if so, which ones – or, say, floating point computations? In the past, MFLOPS (million floating point operations per second) was a popular measure. But again, what is an ‘operation’?

Today, synthetic benchmarks are universally used with processor IP cores. They have the following characteristics:

  1. They are relatively small and portable,
  2. They are representative of commonly used relevant applications,
  3. They are reproducible and transparent,
  4. They can be applied to a range of processors fairly,
  5. They express the benchmark result as a single number.

A benchmark that has been popular for the last 36 years is the Dhrystone benchmark. Its name is a play on words comparing it with the once-popular Whetstone benchmark. While Whetstone focused on floating point operations, Dhrystone focused on integer and string operations. The Dhrystone benchmark results are generally quoted as DMIPS (the Dhrystone score divided by that of a nominally 1 MIPS machine). The benchmark has been criticised because modern compilers can optimize away parts of the work, meaning that it partly tests compiler rather than processor performance.

For floating point, Whetstone is rarely used at present and it is more likely that LINPACK would be used. LINPACK involves LU decomposition of a matrix using floating point numbers. The result is expressed in MFLOPS.

Another popular synthetic benchmark for embedded applications has been EEMBC’s CoreMark® which aims to undertake operations that are representative of embedded integer processing needs. These include list processing, matrix operations, finite state machines, and CRC.

As you can see, there are various benchmark systems out there, each suited for measuring a slightly different type of performance. So how do you assess performance when choosing processor IP for your project? If your embedded software has similar operations to a synthetic benchmark, then that benchmark may give you useful initial guidance quickly and simply. However, normally such benchmarks are quoted per MHz, for example CoreMark/MHz. The per MHz figure is normally a good indication for a low-power application where you are looking for good results per cycle. However, if you are looking for high absolute performance, this may be misleading. Instead you should consider, say, the CoreMarks achievable at your target clock frequency.

If your main issue is floating point performance, bear in mind that DMIPS and CoreMark are integer benchmarks. You would be better comparing cores on the basis of a floating-point benchmark such as LINPACK.

Ultimately, it always makes sense to invest the time in running realistic software on a processor core to assess whether the core gives you the performance you need. If you are looking at RISC-V, then profiling your software to understand where the computational bottlenecks are can also lead to assessing whether adding custom instructions can give you improvements in performance.

Roddy Urquhart