Thomas Hornschuh

RISC-V on Papilio Pro

12 posts in this topic

Hi all,

over the last half year I have implemented a processor and surrounding SoC bringing the RISC-V ISA (http://riscv.org) to the Papilio Pro. It implements the 32Bit integer subset (RV32IM).

The project is hosted on Gitub (https://github.com/bonfireprocessor). It still needs some additional documentation, cleanup and ready-to-run ISE projects to make it easy reproducable for others. But I post this link now, to find out if anybody is interested in my work. I will soon also post a bitstream here so anybody with access to a Papilio Pro can play with it.

I have also ported eLua to it http://www.eluaproject.net 

@Jack: If you like I can also present the project in the GadgetFactory blog.

 

Regards

Thomas

 

Share this post


Link to post
Share on other sites

Hello Thomas,

This looks very cool, I've been doing a contract job that is eating up most of my time. But hopefully I can give this a spin this weekend. Thank you for posting it, Dhia is going to get it up on the blog and social media tomorrow.

Thanks!

Jack.

Share this post


Link to post
Share on other sites

Do you have clang+llvm working for the platform ? Last time I looked it seemed like work in progress. Or do they use gcc ?

What's the current clock rate for the system ? Can it go past 50Mhz ?

Alvie

Share this post


Link to post
Share on other sites

Currently I'm using gcc. There is a LLVM port going on by Alex Bradburry from lowrisc (http://www.lowrisc.org/), I recenlty spoke with Alex at an event in Munich. They made great progress, LLVM is able to pass 90% of the gcc torture tests right now. They also in the process of upstreaming both the gcc and the llvm ports. The llvm port will now also support RV32, the preliminary port on the riscv.org website only supports RV64.

My design currently qualifies for 100Mhz. I think I can quite easily reach about ~130Mhz, currently the limitations are more in some not so optimal code in the SoC (e.g. the 32KB BRAM for the bootloader is organized with 16*2K*32Bit blocks wich is not the best way to organize it, but it helps to run the same setup in simulation and on hardware quickly...).

The whole system (with UART, SPI interface and DRAM controller) uses 60% of the LX9 slices, the CPU itself 743 slices. It can go down to less than 500 slices if the M extension (Mul and div) is removed and some of the privilege mode things (e.g. 64Bit cycle counters...). The RISC-V privilege mode is not very FPGA friendly, because the CSR registers are allocated in a spares 12Bit address space, consuming a lot of comparators and muxes to implement.

Running completly in Block RAM I reach 0,67DMIPS/Mhz, in DRAM it reaches only 0,35DMIPS/Mhz. Main reason is that I don't have a data cache implemented yet, only instruction cache. 

I will upload the bitstream and the binaries for eLua and dhrystone soon so you can easily test it :-)

 

Thomas
 

 

 

 

 

 

 

 

Share this post


Link to post
Share on other sites
17 minutes ago, Thomas Hornschuh said:

My design currently qualifies for 100Mhz. I think I can quite easily reach about ~130Mhz

Not if you use the embedded multipliers. Those are slow (never managed to get a 32x32 to work above ~105MHz or so).

 

18 minutes ago, Thomas Hornschuh said:

Main reason is that I don't have a data cache implemented yet, only instruction cache. 

I have a data cache I wrote for ZPUino (not published, it's a two-way associative). Let me know if you want to take a look.

Regarding bitfiles: how you program the design afterwards - or do you have to embed the code inside the bitfile ?

We can try porting the ZPUino bootloader for your new platform, should be pretty much trivial.

Alvie

Share this post


Link to post
Share on other sites

Another question - note that I have had not much time to look at your implementation - why are you snooping the data bus cyc in the instruction cache (I assume it's your change, has a TH comment on it) ?

Alvie

Share this post


Link to post
Share on other sites

Hi

 

4 hours ago, alvieboy said:

 

4 hours ago, Thomas Hornschuh said:

My design currently qualifies for 100Mhz. I think I can quite easily reach about ~130Mhz

Not if you use the embedded multipliers. Those are slow (never managed to get a 32x32 to work above ~105MHz or so).

 

I hope with the 4 stage mutiplier

https://github.com/bonfireprocessor/bonfire-cpu/blob/riscv/rtl/lxp32_mulsp6.vhd

clock can be higher. Of course the mult instructions now take 4 clocks instead of 2.

It also consumes less LUTs than the original design.

4 hours ago, alvieboy said:

I have a data cache I wrote for ZPUino (not published, it's a two-way associative). Let me know if you want to take a look.

Definitely:-) Data cache is the hard part compared to code.

 

4 hours ago, alvieboy said:

Regarding bitfiles: how you program the design afterwards - or do you have to embed the code inside the bitfile ?

The boot monitor is in the bitfile, added with data2mem, currently I have 32Kb for it, the final version should be smaller.

The 2nd stage in then loaded from a fixed address in flash to DRAM or downloaded with XModem. The second stage should implement a file system (e.g. SPIFFS).

Currently my boot monitor  has also flash write command. So initalisation of the flash is done with first xmodem download and then write to flash. It automatically writes the downloaded number of sectors to flash, and also a small header with information about the size.

To simplify testing the boot loader implements a small subset of Linux type syscalls.It uses the same ABI as the RISC-V spike simulator (proxy kernel). So I can execute programs compiled for Spike

 

Share this post


Link to post
Share on other sites
4 hours ago, alvieboy said:

Another question - note that I have had not much time to look at your implementation - why are you snooping the data bus cyc in the instruction cache (I assume it's your change, has a TH comment on it) ?

Alvie

I think you mean lxp32_icache.vhd?

Basically this is outdated. The original lxp32 design (which I use as base for Bonfire)  has no real cache, it is more a 256Byte prefetch buffer. When used with large prefetch_size values it has a very negative impact on data access performance with single port RAMs like external SDRAM: It blocks the bus until prefetch is finished.

I tried to solve this with monitoring the dbus_cyc and aborting the prefetch. It didn't have a noticeable effect. Finally I decided to build a real direct mapped cache

https://github.com/bonfireprocessor/bonfire-cpu/blob/riscv/rtl/bonfire_dm_icache.vhd

It still contains the dbus_cyc signal, but it is not used. Actually I like this cache because it is clean, easy to understand and only consumes 20 slices + RAM.

It also has a few drawbacks:

  • When the cache line to be accessed changes there is a one clock penalty because of the tag RAM  access
  • The tag RAM is only updated when the full cache line is read, therefore the cache miss latency is always the time for reading the full cache line

The second topic is something I like to change at some time but it has no high priority yet. I think adding a data cache and a branch prediction will help more...

Still the repo needs some cleanup, there are unused files and also I changed the name from wildfire to bonfire because I saw more potentially conflicting other users of the wildfire name compared to bonfire. But the old name is still used partly

Thomas

 

Share this post


Link to post
Share on other sites
11 hours ago, Thomas Hornschuh said:

The tag RAM is only updated when the full cache line is read, therefore the cache miss latency is always the time for reading the full cache line

Implementing a IWF cache (Important Word First) is quite complex. I did it for xThundercore (which is another CPU I am developing), but ended up quite big, and to be honest I did not see any spectacular performance improvement.

 

11 hours ago, Thomas Hornschuh said:

branch prediction will help more...

One technique (which is simple, but may require compiler awareness) is to assume all forward branches to be a miss, and all backward branches to be a hit.

I'll send you my dcache by private message (and the write buffer).

Alvie

Share this post


Link to post
Share on other sites
On 7.4.2017 at 11:16 AM, alvieboy said:

Implementing a IWF cache (Important Word First) is quite complex. I did it for xThundercore (which is another CPU I am developing), but ended up quite big, and to be honest I did not see any spectacular performance improvement.

Good to hear, that it may not be worth the effort. My idea to ease the implementation was to switch from Wishbone "incrementing burst" to e.g. "Wrap-8" mode and just start with the offset of the access triggering the miss.

So if for example the initial miss is at offset 4 the burst will be 4-5-6-7-0-1-2-3. The line offset counter would wrap-around automatically anyway. Nevertheless the hit determination would need additonal logic to determine validity for single words in the cache line. 

 

On 7.4.2017 at 11:16 AM, alvieboy said:

One technique (which is simple, but may require compiler awareness) is to assume all forward branches to be a miss, and all backward branches to be a hit.

I'll send you my dcache by private message (and the write buffer).

Indeed the RISC-V ISA spec exactly specifies this approach as simplest way of branch prediction. The code RISC-V gcc gnerates also seems to obey this rule. The RISC-V spec itself tries to be micro-architecture agnostic, but the code generator of a compiler of course cannot be. For example the code generator assumes that the processor has a barrel-shifter and shifts are cheap: Masking upper bits of a word (e.g converting int to char) is done with a shift left/shift right pair with the number of bits to shift. 

This was already a discussion on some of the RISC-V workshops/presentations. The RISC-V inventors at UCB focus mainly on designing a Linux-capable 64-Bit processor comparable with ARM Cortex-A series designs (without the "bloat" of course). In the community there are more designs which are focused more on Microcontroller class processors. One example is PicoRV. 
 

Thomas
 

 

 

 

 

 

 

 

Share this post


Link to post
Share on other sites

Hi all, sorry for the long delay since my last post. I was distracted by a few other things, in addition it took the German Telekom two weeks to get the upgrade of my internet connection to VDSL working. Finally I have now 50/10Mbit instead of 3/0.5Mbit :D, so it was  worth the trouble. 

Attached to this post is a Bitstream with the working Bonfire SoC for the Papilio Pro. It boots into a monitor program, which allows some basic operation of the board. Connection speed is 500000 Baud per default. If this is a problem, I can also provide bitstreams with other default baud rates.

It should print a message like this:

Bonfire Boot Monitor 0.2d
MIMPID: 0001000e
MISA: 40001100
UART Divisor: 11
UART Revision 00000012
Uptime 0 sec
SPI Flash JEDEC ID: 001720c2

The monitor supports the following commands:

  • D <address>: dump memory, it will always dump 64 32Bit words, starting per default with address 1000000. Without entering a address the dump command will automatically dump the next 64 words
  • X <load adr> <max size in hex>: Download a file with xmodem-crc protocol and load to <load adr>. Default load address is 100000. When no size is specified it will load the whole file in case it fits into the DRAM. Normally it is sufficent to just enter x without arguments. It has been tested to work with minicom under Linux
  • G <address> jumps to <address> (default is again hex 1000000 when ommited, can be used to start a program downloaded with the X command
  • E print xmodem error status. Shows the status of the last xmodem download.
  • T test DRAM. Makes a simple (destructive) pattern test of the DRAM. When running the bitstream the first time it is best to use this command to check that everything is fine.
  • B change baudrate. The user will be prompted for a the new baudrate. Every value between 300 and 500000 is allowed, no check further check is done, so it is possible to enter baudrates like 2423 :-)
  • I re-display the boot message with some system info
  • W: Write boot image. Writes the image downloaded with the X command to the flash ROM. It will write a 4KB header to flash offset 512KB, and then the image data directly behind it. The command can only be executed directly after
    a X command, because it will take the size of the downloaded file to determine the size of the image. In addition the X command that the heap "sbrk" address to the first free address after the downloaded code, this address is also written to
    the flash header. 
  • R: run boot image. Will run an image written with the W command.

The second attachment is a compiled binary of my eLua  (http://www.eluaproject.net/)  implementation for RISC-V (source is on https://github.com/ThomasHornschuh/elua). 
To run it, download it with the X command into RAM and start with G (both commands works with their default parameter).

 

To permanently add it to flash do the following

  • Reset the Papilio Pro with the reset button (or reload the bistream, in case you don't like to program the bitstream to flash
  • Download with the X command
  • Write to Flash with the W command

From now on you can start eLua after boot just with "R" command

>r
Reading Header
...OK
Boot Image found, length 339968 Bytes, Break Address: 00063b00
...OK
Heap: 00062eb0 .. 007ef7ff
eLua for Bonfire SoC 1.0a
__virt_timer_period 1920000

eLua v0.9_bonfire_RV32IM-7-g7996f83  Copyright (C) 2007-2013 www.eluaproject.net
eLua# 

You can enter help to get a command help...

Tipp: From the eLua# promt run:

 lua /rom/life.lua 

 for a demo of the game of live in Lua. It runs 50 iterations and prints then the runtime:

---------O---------------O------
----------O---------------------
-----OO---------O-O-------------
----OO---------OO-OO------------
---OO--O-OO-OOO---O-O----O------
--OO--OO-O---O------O---O-O-----
---O---OOO---O-----O----O-O-----
----OOO------O-O---------O------
-----OOO----OO-OO---------------
------------O-O--------------OO-
-------OO-------------------O--O
---------OO------------------OO-
--------O--O--------------------
---------O-O-------------O------
----------O-------------O-O-----
------------------------O-O-----
Life - generation 50, mem 22.8 kB
Execution time 16.903 sec (16903.39) ms
eLua# 

 

Enjoy and please give me feedback if you like it. 

Regards

Thomas
 

 

 

monitor.bit

elua_lua_bonfire_papilio_pro.bin

Share this post


Link to post
Share on other sites

Thanks for sharing. I still have to ship your stuff, will do so during this week. Also quite busy over here, but definitely have no connectivity issues :)

I will try your design later, but more interested in the HDL design than in the demo. There's been quite some hype around RISC-V, but to be honest I believe it won't live longer - there are some issues with ISA, and some extensions may not play well with others. Also, MIPS is for sale AFAIK, if they drop the patents it may well be a more serious contender to ARM (I think at least Cavium may have some interest in buying MIPS from Imagination - and Imagination is eager to sell everything due to recent contract changes with Apple)

Alvie

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now