XThunderCore is taking shape


alvieboy

Recommended Posts

Hey guys,

 

Finally XThunderCore is taking shape. For those that missed the earlier discussion, XTC is a 32-bit CPU with a design that resembles all other known CPUs placed in a cocktail shaker, shaken heavily, and then poured on the silicon fabric :P kidding. But yes, it takes bits from a lot of designs in order to implement a small size, performant and compact-code CPU.

 

It's a RISC, but with variable-sized instructions, which can be intermixed, but this is accomplished with a postfix - so the base instruction is always there and is decoded as it would be on a RISC system. The postfix may include additional registers to use, immediate values, and condition codes. Most part of the ISA is now working, and it runs happily @100MHz on a Spartan6, and benchmarks are better overall than Microblaze MCS (but slower than normal Microblaze).

 

It has a 5-stage pipeline, with feedback from the ALU. Jumps and branches include a delay-slot (hello Sparc!). Almost all instructions can be conditionally executed (hello X86 CMOV!). Memory loads only stall the pipeline when the destination register is required. It includes support for 4 coprocessors with dedicated instructions (hello ARM!), which are meant to control the several aspects of its environment.

 

Two designs are implemented as of now: a simple, BRAM-based design, and an SDRAM-based one with an instruction cache. The data cache is still not there, but will be soon.

 

And it will include an MMU.

 

I am working on the MMU right now, and still to define how MMU and the caches will interact. The main problem is I do want to create a SMP-able system, so cache coherency is mandatory. And  this is really complex.

 

Since design is harvard, I have two TLBs (iTLB and dTLB). First version will not have a page-table walk system, instead the coprocessor interface is used to re-fill the TLBs in case of a TLB miss.

 

My idea is to have I-cache and D-cache PIPT (Physically-Indexed, Physically-Tagged). This allows for a simple cache-snooping mechanism in order to support SMP. However, this requires the TLB/MMU to be placed before the caches, and can lead to increased delays for memory/instruction accesses. In order to have MMU/TLB lookups to occur in parallel, I'd have to use other cache mechanisms (PIVT/VIPT/VIVT), but this will make SMP support way harder, if not impossible.

 

What are your suggestions on this ? Note also that we need to maintain the clock speed @ roughly 100Mhz on a S6 -2 device, so I cannot afford large combinational circuits.

 

Also to define for dcache is whether to support Write-Back+WriteAllocate only, Write-Through, or Write-Combine for some areas (eg., video buffers).

 

Perhaps our friend etchedpixels can help here :)

 

Alvie

 

Link to comment
Share on other sites

Ok, I think I settled with a VIPT instruction cache. This allows the MMU/TLB look up to be done in parallel. Tests show that this works, and implementation is able to maintain the desired clock speed.

 

Let me add the dcache now, and dTLB, and see if all works well.

Link to comment
Share on other sites

Hi,

 

I tried to build the LLVM (./configure; make), but got this:

make[1]: Entering directory `/home/gk/Downloads/xtc-llvm-master/llvm-3.3.src/tools/llvm-config'
llvm[1]: Constructing LLVMBuild project information.
llvm-build: fatal error: missing LLVMBuild.txt file at: '/home/gk/Downloads/xtc-llvm-master/llvm-3.3.src/projects/LLVMBuild.txt'
llvm[1]: Building llvm-config BuildVariables.inc file.
llvm[1]: Compiling llvm-config.cpp for Release+Asserts build
llvm-config.cpp:45:35: error: LibraryDependencies.inc: No such file or directory
llvm-config.cpp:55: error: ‘AvailableComponent’ was not declared in this scope
llvm-config.cpp:55: error: template argument 1 is invalid

 

It sounds like an interesting project. I wonder how hard it would be to build and run a complete bare-bones hello-world example.

 

edit:

Oops, I thought this was the C compiler (SDCC) but I think it's not.

Link to comment
Share on other sites

That looks like an internal "problem" on LLVM itself.

 

Let me try a clean checkout+build.

 

 

I wonder how hard it would be to build and run a complete bare-bones hello-world example.

 

I'm testing the beast with a couple of examples, like AES FIPS and Dhrystone MIPS.

 

You will also need to compile+install binutils first.

 

I think I can provide some README for that, gimme until monday.

Link to comment
Share on other sites

Side note: don't build LLVM on the same directory. Create a build one, and then try something like this:

../llvm-3.3.src/configure --enable-debug-runtime --enable-debug-symbols --enable-keep-symbols --enable-shared --prefix=/usr/local/xtc/ \--target=xtc-elf --enable-optimized --with-binutils-include=/home/alvieboy/xtc/binutils-2.23.1/include/ \--with-gcc-toolchain=/usr/local/xtc/bin/ target_alias=xtc-elf CC=gcc CXX=g++ --no-create --no-recursion
Link to comment
Share on other sites

thanks... tried that but "make" doesn't work in either directory.

I'm missing the "/usr/local/xtc" toolchain in any case. But, no need to hurry - I can have a look at it later.

 

BTW, do you have any opinion on the openRisc processor, for example minsoc? Is it too big for the LX9?

Link to comment
Share on other sites

I tried openRISC once, and performance (clock speed) was very poor on a Spartan3, so I gave up on it.

 

The "/usr/local/xtc" toolchain is binutils.

 

Are you using 32 or 64-bit ? I can get you a snapshot of everything already built (binutils+llvm+clang+compilerrt+newlib).

Link to comment
Share on other sites

Features planned for XTC are:

 

32x32->64 multiplier (pipelined, and operation stalls the pipeline for now). It uses a special-purpose register for the upper 32-bit result, but might change in the future.

single-clock shifter (left, logical right and arithmetic right).

 

You can however use one of the co-processor interfaces for faster algorithm operation. XTC supports up to 4 co-processors, and each can implement up to 16 registers directly accessible through the instruction set. The first co-processor is reserved for the system cop, which handles caches, faults, mmu, so on, but the others are not implemented by default.

Link to comment
Share on other sites

That.. depends :)

 

The current multiplier is forcing the pipeline to stall right now, in order to simplify the writeback unit. For example, memory reads do not stall unless you attempt to use the destination register before the value is available. Same can be done for the multiplier, but adds some complexity. The advantage is you can assume a 1-clock multiply instruction if you don' t need its result in the next 4 or 5 cycles.

 

This depends heavily on your workload. Some can be easily written in order to take this into account, some others do not.

 

Anyway, you're doing audio, correct ? Probably you can live with a single-clock multiplier (18x18->36/sext48) and also take advantage of the accumulator and pre-adder.

 

I could help you more if I'd knew exactly what are you attempting to implement :)

 

Alvie

Link to comment
Share on other sites

Quick update on the caches:

 

The D-cache still has a bug somewhere, it's been hard to track. Hopefully will be able to do it soon - this looks like a corner-case, so depends on the exact instruction layout.

 

However, there's already some interesting figures regarding the implementations.

 

Right now we have two base "cores", one based in BRAM and another in SDRAM. Each can have instruction / data caches implemented or not (except SDRAM which requires icache).

 

I am running two tests: one is a AES benchmark, another is Dhrystone MIPS.

 

Starting with AES, the BRAM and BRAM+icache have same speed if the icache is filled (2nd and next rounds are as fast as w/o icache). This is good, proves that icache presents no overhead.

 

for the SDRAM version, design with only icache takes 2.5x times more to complete. Remember SDRAM has a 16-bit interface, so  things really slow down. With dcache, the slowdown is only about 2% (1.02x slower), and although there's still a bug, the flow seems to be correct. This is good news for the dcache.

 

Regarding the Dhrystone MIPS, we're getting 34.41DMIPS@100MHz for the BRAM implementation, without significant optimizations. most delays are due to branching. For the SDRAM icache-only version, we get about 21DMIPS, and with dcache 33.19DMIPS. Again, dcache impact is very significant.

 

Let's see if I can find that ugly bug, so we can start running real software.

 

Alvie

Link to comment
Share on other sites

well... 18x18 beats 16x16 any time...

But still, the more I think of it, the less it makes sense to use one large CPU on an FPGA. For example, one of the main advantages is the crazy memory bandwidth I can reach by doing things in parallel on many small block rams. A CPU-centric design gives up that advantage. So I guess I'll use it mainly for control purposes.

Link to comment
Share on other sites

yea but then why bother with coprocessors when I can design dedicated hardware units that talk to each other, without the big central square where everybody is waiting for the bus :)

 

Still, a faster processor would be very useful for me. And in the long term, distributing algorithms across RTL and C code makes the design hard to reuse.

Link to comment
Share on other sites

Thanks, this is interesting.

Two former colleagues on the publication list... will definitely have a look.

 

Do you have any experience how this scales for larger programs, when you start worrying about code density?

This is one of the main selling arguments for ZPU that I can run it off block ram, and this recovers some of the lost speed.

Link to comment
Share on other sites

Perhaps we should start a new thread if this discussion continues.

 

I have not tried it with larger programs, but the instruction sizes scale directly with the number of transport buses and the short immediate width of those buses as the slot for a bus has to be wide enough for the short immediate that can be used on it.  If you use an immediate unit and long immediates with a few busses you can shorten the short immediate sizes but still keep the ability to use 32bit immediate values.  Aside from the instruction width I guess the issue is mostly how many instructions you can fit in the blockrams available and whether your processor is able to be used efficiently by your program, resulting in fewer overall instructions.  Also you have to be careful since the transport bus connection logic is the most resource consuming part usually, you don't want every bus connected to ever FU in the processor.

Link to comment
Share on other sites

offroad:

 

I don't think I will be making this (the tools for XTC) available to you yet:

alvieboy@paddie:/usr/local/xtc$ du -sh7.9G    .

LLVM/Clang are also installing the dev. libraries (.a), which are painfully large.

 

Let me see if I can remove them and strip all other components. Note: I build LLVM and all companions with full debug, I need that at this time.

 

Alvie

Link to comment
Share on other sites

You don't need hardware cache coherency for SMP. At minimum you need an inter processor interrupt scheme and cache invalidate/cache flush instructions. The rest can be done in software. x86 has hardware coherency but many other platforms do not and implement it in software via the MMU (catching write faults).

 

There are lots of reasons to do it in hardware (including not having all the programmers hating you!) and in many cases you can do the L1 cache lookup (private local cache), MMU walk and the coherency protocol in parallel.

 

There dual ported RAM on the spartans always looked ideal for doing a BFI (Brute Force & Ignorance) shared cache implementation and Will did use it this way for one of his SocZ80 experiments.

 

There are some other ugly cases to watch - processors modifying each others instruction stream as you decode, processors updating shared MMU tables and so on.

Link to comment
Share on other sites

You don't need hardware cache coherency for SMP. At minimum you need an inter processor interrupt scheme and cache invalidate/cache flush instructions. The rest can be done in software. x86 has hardware coherency but many other platforms do not and implement it in software via the MMU (catching write faults).

 

There are lots of reasons to do it in hardware (including not having all the programmers hating you!) and in many cases you can do the L1 cache lookup (private local cache), MMU walk and the coherency protocol in parallel.

 

There dual ported RAM on the spartans always looked ideal for doing a BFI (Brute Force & Ignorance) shared cache implementation and Will did use it this way for one of his SocZ80 experiments.

 

There are some other ugly cases to watch - processors modifying each others instruction stream as you decode, processors updating shared MMU tables and so on.

 

MMU walk I am planning to do in software, but cache coherency can be more tricky - it just takes too long to flush/invalidate the caches.

 

The dual-port rams are even more important to avoid multiplexing inputs, hence getting a faster and smaller design - use one port for the CPU, and the other to perform filling.

 

Anyway, let's see how it goes performance-wise. I can always implement some cache snooping mechanism.

 

Alvie

Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.