Thomas Hornschuh

RISC-V on Papilio Pro

10 posts in this topic

Hi all,

over the last half year I have implemented a processor and surrounding SoC bringing the RISC-V ISA (http://riscv.org) to the Papilio Pro. It implements the 32Bit integer subset (RV32IM).

The project is hosted on Gitub (https://github.com/bonfireprocessor). It still needs some additional documentation, cleanup and ready-to-run ISE projects to make it easy reproducable for others. But I post this link now, to find out if anybody is interested in my work. I will soon also post a bitstream here so anybody with access to a Papilio Pro can play with it.

I have also ported eLua to it http://www.eluaproject.net 

@Jack: If you like I can also present the project in the GadgetFactory blog.

 

Regards

Thomas

 

1 person likes this

Share this post


Link to post
Share on other sites

Hello Thomas,

This looks very cool, I've been doing a contract job that is eating up most of my time. But hopefully I can give this a spin this weekend. Thank you for posting it, Dhia is going to get it up on the blog and social media tomorrow.

Thanks!

Jack.

Share this post


Link to post
Share on other sites

Do you have clang+llvm working for the platform ? Last time I looked it seemed like work in progress. Or do they use gcc ?

What's the current clock rate for the system ? Can it go past 50Mhz ?

Alvie

Share this post


Link to post
Share on other sites

Currently I'm using gcc. There is a LLVM port going on by Alex Bradburry from lowrisc (http://www.lowrisc.org/), I recenlty spoke with Alex at an event in Munich. They made great progress, LLVM is able to pass 90% of the gcc torture tests right now. They also in the process of upstreaming both the gcc and the llvm ports. The llvm port will now also support RV32, the preliminary port on the riscv.org website only supports RV64.

My design currently qualifies for 100Mhz. I think I can quite easily reach about ~130Mhz, currently the limitations are more in some not so optimal code in the SoC (e.g. the 32KB BRAM for the bootloader is organized with 16*2K*32Bit blocks wich is not the best way to organize it, but it helps to run the same setup in simulation and on hardware quickly...).

The whole system (with UART, SPI interface and DRAM controller) uses 60% of the LX9 slices, the CPU itself 743 slices. It can go down to less than 500 slices if the M extension (Mul and div) is removed and some of the privilege mode things (e.g. 64Bit cycle counters...). The RISC-V privilege mode is not very FPGA friendly, because the CSR registers are allocated in a spares 12Bit address space, consuming a lot of comparators and muxes to implement.

Running completly in Block RAM I reach 0,67DMIPS/Mhz, in DRAM it reaches only 0,35DMIPS/Mhz. Main reason is that I don't have a data cache implemented yet, only instruction cache. 

I will upload the bitstream and the binaries for eLua and dhrystone soon so you can easily test it :-)

 

Thomas
 

 

 

 

 

 

 

 

Share this post


Link to post
Share on other sites
17 minutes ago, Thomas Hornschuh said:

My design currently qualifies for 100Mhz. I think I can quite easily reach about ~130Mhz

Not if you use the embedded multipliers. Those are slow (never managed to get a 32x32 to work above ~105MHz or so).

 

18 minutes ago, Thomas Hornschuh said:

Main reason is that I don't have a data cache implemented yet, only instruction cache. 

I have a data cache I wrote for ZPUino (not published, it's a two-way associative). Let me know if you want to take a look.

Regarding bitfiles: how you program the design afterwards - or do you have to embed the code inside the bitfile ?

We can try porting the ZPUino bootloader for your new platform, should be pretty much trivial.

Alvie

Share this post


Link to post
Share on other sites

Another question - note that I have had not much time to look at your implementation - why are you snooping the data bus cyc in the instruction cache (I assume it's your change, has a TH comment on it) ?

Alvie

Share this post


Link to post
Share on other sites

Hi

 

4 hours ago, alvieboy said:

 

4 hours ago, Thomas Hornschuh said:

My design currently qualifies for 100Mhz. I think I can quite easily reach about ~130Mhz

Not if you use the embedded multipliers. Those are slow (never managed to get a 32x32 to work above ~105MHz or so).

 

I hope with the 4 stage mutiplier

https://github.com/bonfireprocessor/bonfire-cpu/blob/riscv/rtl/lxp32_mulsp6.vhd

clock can be higher. Of course the mult instructions now take 4 clocks instead of 2.

It also consumes less LUTs than the original design.

4 hours ago, alvieboy said:

I have a data cache I wrote for ZPUino (not published, it's a two-way associative). Let me know if you want to take a look.

Definitely:-) Data cache is the hard part compared to code.

 

4 hours ago, alvieboy said:

Regarding bitfiles: how you program the design afterwards - or do you have to embed the code inside the bitfile ?

The boot monitor is in the bitfile, added with data2mem, currently I have 32Kb for it, the final version should be smaller.

The 2nd stage in then loaded from a fixed address in flash to DRAM or downloaded with XModem. The second stage should implement a file system (e.g. SPIFFS).

Currently my boot monitor  has also flash write command. So initalisation of the flash is done with first xmodem download and then write to flash. It automatically writes the downloaded number of sectors to flash, and also a small header with information about the size.

To simplify testing the boot loader implements a small subset of Linux type syscalls.It uses the same ABI as the RISC-V spike simulator (proxy kernel). So I can execute programs compiled for Spike

 

Share this post


Link to post
Share on other sites
4 hours ago, alvieboy said:

Another question - note that I have had not much time to look at your implementation - why are you snooping the data bus cyc in the instruction cache (I assume it's your change, has a TH comment on it) ?

Alvie

I think you mean lxp32_icache.vhd?

Basically this is outdated. The original lxp32 design (which I use as base for Bonfire)  has no real cache, it is more a 256Byte prefetch buffer. When used with large prefetch_size values it has a very negative impact on data access performance with single port RAMs like external SDRAM: It blocks the bus until prefetch is finished.

I tried to solve this with monitoring the dbus_cyc and aborting the prefetch. It didn't have a noticeable effect. Finally I decided to build a real direct mapped cache

https://github.com/bonfireprocessor/bonfire-cpu/blob/riscv/rtl/bonfire_dm_icache.vhd

It still contains the dbus_cyc signal, but it is not used. Actually I like this cache because it is clean, easy to understand and only consumes 20 slices + RAM.

It also has a few drawbacks:

  • When the cache line to be accessed changes there is a one clock penalty because of the tag RAM  access
  • The tag RAM is only updated when the full cache line is read, therefore the cache miss latency is always the time for reading the full cache line

The second topic is something I like to change at some time but it has no high priority yet. I think adding a data cache and a branch prediction will help more...

Still the repo needs some cleanup, there are unused files and also I changed the name from wildfire to bonfire because I saw more potentially conflicting other users of the wildfire name compared to bonfire. But the old name is still used partly

Thomas

 

Share this post


Link to post
Share on other sites
11 hours ago, Thomas Hornschuh said:

The tag RAM is only updated when the full cache line is read, therefore the cache miss latency is always the time for reading the full cache line

Implementing a IWF cache (Important Word First) is quite complex. I did it for xThundercore (which is another CPU I am developing), but ended up quite big, and to be honest I did not see any spectacular performance improvement.

 

11 hours ago, Thomas Hornschuh said:

branch prediction will help more...

One technique (which is simple, but may require compiler awareness) is to assume all forward branches to be a miss, and all backward branches to be a hit.

I'll send you my dcache by private message (and the write buffer).

Alvie

Share this post


Link to post
Share on other sites
On 7.4.2017 at 11:16 AM, alvieboy said:

Implementing a IWF cache (Important Word First) is quite complex. I did it for xThundercore (which is another CPU I am developing), but ended up quite big, and to be honest I did not see any spectacular performance improvement.

Good to hear, that it may not be worth the effort. My idea to ease the implementation was to switch from Wishbone "incrementing burst" to e.g. "Wrap-8" mode and just start with the offset of the access triggering the miss.

So if for example the initial miss is at offset 4 the burst will be 4-5-6-7-0-1-2-3. The line offset counter would wrap-around automatically anyway. Nevertheless the hit determination would need additonal logic to determine validity for single words in the cache line. 

 

On 7.4.2017 at 11:16 AM, alvieboy said:

One technique (which is simple, but may require compiler awareness) is to assume all forward branches to be a miss, and all backward branches to be a hit.

I'll send you my dcache by private message (and the write buffer).

Indeed the RISC-V ISA spec exactly specifies this approach as simplest way of branch prediction. The code RISC-V gcc gnerates also seems to obey this rule. The RISC-V spec itself tries to be micro-architecture agnostic, but the code generator of a compiler of course cannot be. For example the code generator assumes that the processor has a barrel-shifter and shifts are cheap: Masking upper bits of a word (e.g converting int to char) is done with a shift left/shift right pair with the number of bits to shift. 

This was already a discussion on some of the RISC-V workshops/presentations. The RISC-V inventors at UCB focus mainly on designing a Linux-capable 64-Bit processor comparable with ARM Cortex-A series designs (without the "bloat" of course). In the community there are more designs which are focused more on Microcontroller class processors. One example is PicoRV. 
 

Thomas
 

 

 

 

 

 

 

 

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now