ZPUino vs Arduino


monsonite

Recommended Posts

I have been using Alvie's ZPUino for a few weeks now, and wanted to find a method of comparing it to the standard 16MHz Arduino based on the ATmega328.

 

I thought it might be possible to run a few benchmark programs to try to characterise its performance.

 

I have written this up in my blog - reproduced for you below.

 

http://sustburbia.blogspot.co.uk/2015/05/benchmarking-arduino-and-his-chums.html

 

I cannot guarantee that my method is ideal, but it asks some questions about the relative performance of Arduino and ZPUino, and shows that there is scope for improvement.

 

ZPUino is a stack based processor, and it may well be that the C compiler fails to exploit this architecture efficiently.

 

I'm looking forward to discussion, so we might better understand the ZPUino architecture and ways to better use it to full advantage.

 

 

regards

 

Ken

 

Benchmarking Arduino - and his Chums.

 
Background

The standard Arduino based on the ATmega328 is an 8-bit device with a 16MHz clock frequency and 2K bytes of RAM.

I have for some time been exploring more powerful alternatives to the Arduino - especially the 32 bit STM32Fxxx ARM Cortex M4 range of microcontrollers, and some softcore processors implemented in a FPGA.

These processors can all be programmed using the Arduino IDE - so in theory, code written for the Arduino will run on all devices -almost without modification. The flavour of C++ used by Arduino has become a kind of lingua franca for these widely varying processors, allowing access to vast knowledge base and range of libraries that permit the easy interfacing of hardware devices. In truth, if you wish to use an integrated device or sensor, then someone will already have created an Arduino library for it.

Since the earliest days of commercial computers, both manufacturers and users have had a strong interest in their computing performance. Computers were expensive, and computing time was equally expensive. Any way of increasing performance and reducing programming costs was sought after. As memory technologies improved, processor cycle times reduced to match the shorter access time of the memory.

When launched in 1965, the PDP-8 was capable of 312,500 12-bit additions per second.  How does the Arduino compare with that figure.

One solution is to use standard benchmark code, of which there are several well documented programmes, designed to test the various performance aspects of the processor. These include:

Dhrystone  - an integer arithmetic benchmark
Whetstone  - a floating point benchmark
CoreMark - for multi-core processors
LinPak      - for Linux based systems

The Dhrystone benchmark code - suitable for small microcontrollers is available here - however the Dhrystone is causing some difficulties in converting it to an Arduino compatible format.

The Whetstone test code adapted for Arduino by Thomas Kirchner is here
When run on a standard Arduino 16MHz Duemillenove the Whetstone produced the following result

Starting Whetstone benchmark...
Loops: 1000 Iterations: 1 Duration: 81740 millisec.
C Converted Double Precision Whetstones: 1.22 MIPS

On the STM32F103 board with a 72MHz clock

Starting Whetstone benchmark...
Loops: 1000 Iterations: 1 Duration: 19691 millisec.

C Converted Double Precision Whetstones: 5.08 MIPS

So the STM32F103 appears to be running at approximately four times the speed of the Arduino.  This speed increase is dominated by the faster clock on the STM32F103, and not that the ARM processor is executing the compiled code any more efficiently than the AVR.

Whilst an indication of processor performance, the benchmarks are a somewhat artificial test, and the actual performance of one processor compared to another will depend on the application.  Additionally, the manner in which the compiler interprets the C source code and efficiently converts it into the native machine language of the processor has an effect on the overall processing speed.

Into Practice

The Arduino is a great platform for trying things out.  Whilst not the fastest board available, it's resources are easily accessible, and the millis() and micros() timer functions allow simple benchmarking to be done. 

Remembering the claimed performance for the PDP-8, I decided to set up a simple addition test for Arduino. First I formed an array of 16 bit integers - remembering that in the Arduino there is only sufficient RAM space for about 500 16-bit words. Exceeding this gives a risk of overwriting some of the stack, heap and system variables.

I then loaded up the array with random integers 

void setup()
{  
  Serial.begin(115200);
  
  for(int i = 0; i <=500; i++)
  {
  m = random(0,65535);
  }  

}

The main routine would then add two of the memory locations together, working it's way through the array. The time taken for the 500 iteration function was calculated using the micros() function.

Results were as follows:

1. Adding a constant to memory    1uS
2. Adding contents of two memory locations into a variable   1.4uS
3. Adding  contents of two memory locations and storing back into a third memory location 1.6uS

So based on this, the Arduino is performing addition of  memory located operands at between 2 to 3 times the speed of the 1965 PDP-8.

However  - we should bear in mind that at 16MHz, the Arduino is executing 16 instructions per microsecond.  Whilst an add is a single cycle instruction, by the time we have made it a 16-bit add, and involved a memory access, the Arduino is taking roughly a microsecond to achieve a common operation in a typical program.  This goes some way to explaining why the Arduino comes in at 1.22MIPS in the Whetstone benchmark.

I then conducted the same test on the 72MHz STM32F103 board programmed using Arduino_STM32.

1. Adding a constant to memory    0.156uS    (6.4X faster)
2. Adding contents of two memory locations into a variable  0.294uS  (4.76X faster)
3. Adding  contents of two memory locations and storing back into a third memory location 0.32uS (5X faster).

Next was the turn for the 96MHz ZPUino - a softcore running in a Xilinx Spartan 6 - on a Papilio Duo FPGA board.

1. Adding a constant to memory    0.58uS    (1.72X faster than Arduino)
2. Adding contents of two memory locations into a variable  0.708uS  (1.39X faster)
3. Adding  contents of two memory locations and storing back into a third memory location 0.706uS (1.41X faster).

The results for the ZPUino were a little disappointing - bearing in mind it is being clocked at 6X the speed of the Arduino. However it is a stack based processor, and probably C does not compile efficiently to its stack based architecture.

Conclusions

The Arduino can execute real code at around 1 million instructions per second.
Moving up to a 72MHz STM32F103 ARM will give about a 5X speed advantage over the Arduino.
Soft core processors are interesting, and it must be possible to improve their performance.

The skeleton Arduino code for these benchmarks is on this Github Gist

 
Link to comment
Share on other sites

Hmm your numbers do not make any sense.

 

STM32 is, by far, faster than AVR8. ZPU is not that faster, but still fast enough.

 

Plus, you forget a couple of things: Both AVR and STM32 use internal memory, ZPUino uses external SDRAM in most cases, which makes it slower.

Also, "int" in AVR8 is a 16-bit, and 32-bit on the other two.

 

ZPUino also does not have D-Cache, so all memory reads/writes need to be performed directly with RAM.

 

I wonder if I can run your tests in my new CPU (XThunderCore). Even without DCACHE, it should perform very well.

Link to comment
Share on other sites

No, I did not see it as a criticism :)

 

Actually, what I find odd is how STM32 performs. It should be more performant. Note that you should not use the clock speed a measurement indicator - all platforms behave differently.

This meaning that a 5MHZ CPU is not 5x faster than a 1MHz CPU. Not even with same CPU core. Other factors exist that may limit or improve the overall speed.

 

What we do like to know is how much can you perform, per MHz, and how much power you dissipate per MHz.

 

Easy for microcontrollers/small CPUs, not that easy for big ones. TLB misses in intel architectures can take more than 800 clock cycles to complete (refetching tlb, refetching cache).

http://www.7-cpu.com/cpu/IvyBridge.html

 

* RAM Latency = 30 cycles + 53 ns

 

If TLB needs to fetch from RAM, at 3.4GHz (clock period 0.294ns), you have 180 clocks just for latency. Add 3x time latency (one fetch for TLB pointer [may be miss], one fetch for TLB itself [may be miss], one fetch for Cache contents [miss]) .. add transaction delay.... it's huge.

 

We don't realise how slow these things are....

 

Alvie

Link to comment
Share on other sites

As a reference, I ran the add_test_4 on Arduino ported to the free microblaze_mcs soft processor running at 100 MHz and got 60 micros/loop or 60/500 = 0.12 uS per add.

 

Edit:  I also ran it on a system using the full (i.e. license required) Microblaze core, also at 100 MHz and running in BRAM, and got 55 micros/loop.

 

Magnus

Link to comment
Share on other sites

Magnus,

 

Thanks for your feedback.

 

I am not over familiar with the Microblaze architecture. Is it using internal RAM?  The external RAM on the PapilioDuo may well be the bottleneck - especially if we have to do four byte reads to get a 32 bit integer.

 

I hope that none of my comments have been taken as criticism. I just want to open up discussion - so that we can all have a better understanding of how memory bandwidth and other architectural constraints can limit the performance of a cpu.

 

If you wish - please try the Whetstone test on the microblaze_mcs

 

https://developer.mbed.org/users/kirchnet/code/Nucleo_vs_Arduino_Speed_Test/

 

 

regards

 

Ken

Link to comment
Share on other sites

Microblaze_mcs is the free version of the 32-bit RISC Microblaze soft core.  It uses a 3-stage pipeline and is restricted to run in internal ram (BRAM) at max 64 kB.  I can meet timing in a Spartan6 -3 speed grade at 120 MHz.

 

The full Microblaze core can have a 5-stage pipeline and can run in internal RAM as well as external DRAM with optional Icache and dcache (write-back or write-through).

 

I did not run the test using the full Microblaze core with program and data in DRAM since that's really a test of the DRAM/icache/dcache system and not the processor core itself.

 

Magnus

Link to comment
Share on other sites

Monsonite,

Very good timing. I was just running some math(multiply and divide for different data types) timings on zpuino for performance testing last night on my duo.

 

My project I am working does lots of math so just wanted to get ideas of timeframes for calculations.

 

 

Alvie,

Is there a good breakdown of all the different ZPUIno versions and functionality they have? I was trolling through old posts the other day and saw mention of different ZPUino versions(ZPUino, ZPUino 1.0,, ZPUino 2.0, ZPUino Vanilla, a version with F16MUL instruction, etc). Also any idea of which functionality is in all of them. Not sure if there is some posts I missed describing them, if so, a point in that direction would help.

 

If I can get some details I will will volunteer to put together a document with all the data for versions so everyone can use for reference and maintain going forward.

 

Thanks,
Chris

Link to comment
Share on other sites

 

Alvie,

Is there a good breakdown of all the different ZPUIno versions and functionality they have? I was trolling through old posts the other day and saw mention of different ZPUino versions(ZPUino, ZPUino 1.0,, ZPUino 2.0, ZPUino Vanilla, a version with F16MUL instruction, etc). Also any idea of which functionality is in all of them. Not sure if there is some posts I missed describing them, if so, a point in that direction would help.

 

Well, most designs use ZPUino Extreme Core, which is used by 1.0 and 2.0.

 

FMUL16 is a quite dedicaded instruction, which helps if you do fixed-point 16.16, but needs to be instantiated in asm (and cannot be done from within "C" due to how GCC interacts with ZPU). FMUL16 behaves as:

uint32_t fmul16(uint32_t lhs, uint32_t rhs){    uint64_t result = (uint64_t)lhs * (uint64_t)rhs;    return result >> 16;}

Quite useful for some DSP functions (like FFT).

Link to comment
Share on other sites

Below is a link to a zip file with an Arduino-version of the Dhrystone test.  I took out a lot of printing to save memory but it still requires more than 2K of RAM (i.e. it wont run on an Atmega 328).

 

On a Arduino Mega board (ATmega1280@16MHz) I get 12711.22 Dhrystones/sec and 7.23 VAX MIPS using Arduino 1.5.2

 

You might have to increase the value of Numbers_Of_Runs on a faster processor in order for the test to take at least 2 sec.

 

http://www.saanlima.com/download/dhry21a.zip

 

 

Magnus

Link to comment
Share on other sites

Hi Magnus

 

Your Dhrystone benchmark was just what I needed.  It ran straight out of the tin on my 72MHz STM32F103 board using Arduino_STM32 

 

Results were

 

Dhrystone Benchmark, Version 2.1 (Language: C)
Execution starts, 300000 runs through Dhrystone
 
Execution ends
Microseconds for one run through Dhrystone: 11.66
Dhrystones per Second: 85762.68
VAX MIPS rating = 48.81
 
 
I then ported it to ZPUino2.0  - and after a little fiddling got the following:
 
ZPUINO
CP
Loaded, starting...
Dhrystone Benchmark, Version 2.1 (Language: C)
Execution starts, 300000 runs through Dhrystone
 
Execution ends
Microseconds for one run through Dhrystone: 46.91
Dhrystones per Second: 21319.04
VAX MIPS rating = 12.13
 
So without modification, the ZPUino runs at about 1.67 times that of your 16MHz ATmega1280.
 
This is probably explained by the fact that the ZPUino needs to do four external 8 bit SRAM accesses.
 
I am using ZPUino 2.0 - but it's the variant used for 800x600 VGA and the Adafruit GFX library. I will do further tests to see if this is perhaps an influence on the speed.
 
Edit.
 
I ran it with the plain vanilla LogicStartShield circuit and got the following result
 
Microseconds for one run through Dhrystone: 37.95
Dhrystones per Second: 26351.79
VAX MIPS rating = 15.00
 
 
 
If anyone else wants to try this for independent confirmation - much appreciated.
 
 
 
Ken
Link to comment
Share on other sites

ZPU is a stack processor, and stack processors are slow :)

 

One thing they are bad at is branching, because it can never be predicted (it will be a word in stack, but you don't know which until you execute). Also bad at reloading the stack pointer.

 

Another thing is they are bad for normal compliler integration, which do expect registers to exist (we emulate this behaviour by placing vregs in stack offsets). Adding "1" to a variable (which needs to be in stack) requires quite a few ops:

loadsp [variable offset]im 1addstoresp [variable offset]

In most register-based architectures, if your variable is a register, it takes a single instruction.
Some other also allow adding to a memory location directly.

 

Alvie

Link to comment
Share on other sites

I just wanted to pipe in real quick here, but I think the point of the ZPUino processor is not speed, performance, or dhrystone results. It is flexibility, when you need speed you want to take that functionality out of the Soft Processor and implement it as hardware on the Wishbone bus. ZPUino makes controlling hardware on an FPGA much, much easier and it does so without eating up all of your FPGA fabric. Other soft processors, like microblaze and the latticemico processor perform better, but they eat up more resources too. 

 

With an FPGA you want to have an easy and convenient to use processor so you aren't stuck doing simple control functionality in state machines. But your heavy lifting should be done with Wishbone peripherals connected to the soft processor. This is where the ZPUino excels, you have to look at the overall picture of what you can accomplish rather then the benchmarking results of the soft processor.

 

For example, with the RGB LED panels, an Arduino can barely drive those panels. You can only draw simple lines and geographic shapes. Looking at your bench mark results for the ZPUino you would think it can't do much better, and if you used it in the same way as an Arduino it probably can't. But when you make a Wishbone controller for the RGB panel that drives all the control lines and writes to the RGB panel directly from a memory location shared with the ZPUino soft processor then you can accomplish something the Arduino can't. You just use the ZPUino for simple control functionality, such as copying data from a SD card to the memory space for the RGB panel controller. That would be hard to do in VHDL but is simple on a soft processor. Overall you have the flexibility to drive exotic new hardware at any speed necessary and can accomplish things that is a reach for a AVR or STM chip, even though the ZPUino is not drastically more powerful then either of them... It's comparing apples to oranges in my opinion... The AVR chip and STM chip will never have the flexibility to do these things.

 

Jack.

Link to comment
Share on other sites

Jack, Alvie,

 

I agree - we don't want to get too hung up about absolute speed here.  Both the Whetstone and the Dhrystone are artificial benchmarks, and generally not over representative of the sort of applications we commonly deal with.

 

As Alvie suggests, a stack based processor is not a good fit to a C compiler - which has been optimised for processors with a rich register set.

 

I discovered on Friday that the ZPUino can shunt lots of data around memory very quickly, when I moved a 64K x16 block of memory to another area in RAM in just over 1uS per word.

 

This ability to manipulate data with the flexibility of using unmodified Arduino code, running at more than twice speed - and the massive 2MB RAM address space on the Papilio Duo makes ZPUIno a valuable asset.

 

Understanding what it excels at, is all part of the process of learning how to use it effectively.

 

 

 

Ken

Link to comment
Share on other sites

Absolutely, I had to write that post quickly so I hope the tone was not harsh or anything. I just wanted to get a couple points down, but overall I think this exploring and understanding the ZPUino's capabilities has been very useful and I wanted to thank you for starting this topic.

 

Thanks!

Jack.

Link to comment
Share on other sites

Jack,

I think those are some great points. Correct me if I am off base here, but what I see is the flexibility to add hardware accelerators, which does not exist in microcontrollers(aside from improving an algorithm). So if you had a project with lots of integer multiply or divides or floating point, you could implement some of those functions in hardware and call from ZPUino, right?

 

Now an example, even a simple one, of how that is done would be a great addition to the knowledge base. Would they wishbone peripherals or have to be put in a custom ZPUino?

 

Just my thoughts.

 

Thanks,
Chris

Link to comment
Share on other sites

  • 4 weeks later...

Hello,

I'm in the process of learning FPGA using a DUO but what really caught my eye was the comment from monsonite "using unmodified Arduino code" 

 

My goal is to learn FPGA basics and also be able to run unmodified arduino code at the same time (to test with some shields).

 

But I have two questions:

#1) What version of ZPUino is needed to be able to run unmodified arduino code?

#2) Can I use DUO kb512 for this purpose? Or I better get a DUO 2MB?

 

Thanks for the info

 

Jack, Alvie,

 

I agree - we don't want to get too hung up about absolute speed here.  Both the Whetstone and the Dhrystone are artificial benchmarks, and generally not over representative of the sort of applications we commonly deal with.

 

As Alvie suggests, a stack based processor is not a good fit to a C compiler - which has been optimised for processors with a rich register set.

 

I discovered on Friday that the ZPUino can shunt lots of data around memory very quickly, when I moved a 64K x16 block of memory to another area in RAM in just over 1uS per word.

 

This ability to manipulate data with the flexibility of using unmodified Arduino code, running at more than twice speed - and the massive 2MB RAM address space on the Papilio Duo makes ZPUIno a valuable asset.

 

Understanding what it excels at, is all part of the process of learning how to use it effectively.

 

 

 

Ken

Link to comment
Share on other sites

  • 1 month later...

I am a newbie to this board, but a long time embedded FPGA softcore programmer ..

 

Relating to Jack's latest post, I think that the comparison is dependent on the task that is being accompliished.

 

My comparison benchmark was to implement an SD card reader for jpg files on the Papilio Pro and show them in sequence for animation. I used the gfx library and 

the jpeg decoder to store the images in RGB arrays and then sequenced through them. The animations are quite fast with no apparent jitter.

 

There would be no equivalent comparison of a regular Arduino because the chip could not connect to the address and data buses of the Ram except through

GPIO registers, and if implemented, the results would be rather dismal.

 

If there were a way to get fast graphics/animations on a DUO without using the Zupino, I would be all ears.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.