Need a new name for a new CPU


alvieboy

Recommended Posts

Hey guys,

 

I am developing a new CPU (for fun and beyond), which aims to replace the slow ZPU we have been using so far. The new CPU design is coming along very well, and should match and eventually outperform the Xilinx Microblaze in program size, performance (MHz) and implementation size (well, perhaps this one not, let's see).

 

The CPU is 32-bit, RISC-like, with 31 general purpose registers, a zero register, and a few special registers. It's an hybrid of well known CPUs, like Microblaze, ARM, SPARC, and others. All instructions are 16-bit, and can be extended for immediate values. It has 2 to 5 asymmetric ALU, which in certain scenarios allows the CPU to execute two (or more) instructions at the same time. All normal addressing modes are supported. The design uses 3 to 6 pipeline stages, depending on configuration. All branch instructions have delay slots.

 

The objective is to have a fast CPU (something between 100MHz and 166Mhz) , superscalar, and have it fit nicely on a PPro/Papilio One while using the same Wishbone interface as ZPUino does.

 

The current state is: it works in simulation, an assembler/linker is already working, still missing the C/C++ compiler (LLVM),

 

Now... I really need to name it. And this is where I need your advice and help.

 

The best name I found  so far is "XThunderCore", or abbreviated, "XTC".

 

What are your ideas ? Can you come up with a better name for it ?

 

Best,

Alvie

Link to comment
Share on other sites

Hmmmm, this sounds really exciting. :)

 

I like the name ThunderCore, and it is also important to have a nice abbreviation. I'm going to put my thinking cap on for the next couple days, but honestly, I don't think I can top that name. 

 

ThunderCore! I process like the storm! Or, this processor brings the thunder to your code!

 

I like ThunderCore. :)

 

Jack.

Link to comment
Share on other sites

Actually the "X" there ended up dual-purpose. First, nothing shows up on google for it (which is a good thing). Second, the sound of "XThunder" seems to emphasize the "Thunder" part, like a very very impressive one. Also, the "XT" part might resemble "eXTreme" (or eXTended), which is also a good thing.

 

You know, right now and all along the implementation, I've been calling it "newcpu". Such a dumb name...

 

And all vendors seem to adapt naming which resembles something odd. "SPARC" (Scalable Processor ARChitecure). "SuperH". "Blackfin". "S+core". "Microblaze". "Picoblaze". "XGate". "XStormy". "XTensa". "Dragonfly". We need something as powerful as these.

 

:)

Link to comment
Share on other sites

Man, I need to start watching, it sounds exciting!

 

Jack.

 

i think you missed the boat, so to speak ;) It's all over now, maybe next time. I imagined they would have made a big deal in the news about it over there. Font page news material, unless it was only a big deal in San Francisco.

Link to comment
Share on other sites

Guys,

 

I am happy to announce that first implementation of XTC, with a simple assembly program that does something unusual - it prints 'Hello World!" through the serial port :P ) actually works in a Papilio Pro!!!!

 

xtc-helloworld.png

 

The assembly program is very simple. I'm posting it here so you can see some of the XTC assembly instructions:

.text.globl _start_start:        limr    0x80000000, r3    /* Load IO base address into r3 */        limr    55, r6            /* 104MHz. Baud rate: 115200, 16x oversample,                                      gives 55 for baud divider */        copy    r4, r3            /* r4 <- r3 */        addi    4, r4             /* Add 4 for the UART control register. */        stw     r4, r6            /* Store baud rate divider in UART control reg */.endless:        limr    mystring, r2      /* Load mystring offset into r2 */        call    putstring, r0     /* Call putstring */        nop        call    delay, r0         /* Delay a few clock cycles */        nop        bri     .endless          /* Repeat */        nop        .global delaydelay:        limr    0x400000, r2      /* 0x400000 cycles */.wait:        or      r2, r2            /* is r2 zero ? */        brine   .wait             /* No, jump into .wait ... */        addi    -1, r2            /* .. and decrement r2 (this is delay slot) */        ret        nop.global putstring.type putstring, @functionputstring:        limr    2, r5               /* Load 2 into r5 */.waitready:        ldw     r4, r1              /* Load the UART control register */        and     r1, r5              /* Check if bit 1 is set (and with 2) */        brine   .waitready          /* No, jump into wait ready, UART is still busy */        nop        ldb+    r2, r1              /* Load a char from string (at r2) into r1, increment r2 */        or      r1, r1              /* Is a null char ? */        brine   .waitready          /* No, not a null char, jump ... */        stw     r3, r1              /* But store it in UART transmit register (this is delay slot) */        ret                         /* Return from subroutine and ... */        limr  0, r1                 /* set r1 to zero (the subroutine return value (this is delay slot) */.data        .global mystringmystring:        .string "Hello World!\r\n\0"  /* Our string! */

Still a few thing to tune. But seeing it working made me feel very happy :)

 

Alvie

Link to comment
Share on other sites

The nop (or other instructions) after branching instructions are called "delay slots". These instructions are executed even if the branch is taken, and are used to increase the throughput due to the pipeline latencies.

 

http://en.wikipedia.org/wiki/Delay_slot

 

This has almost no impact on interrupts. If the delay slot is being executed when the interrupt occurs, the interrupt is delayed until the next cycle.

 

The only thing that impacts interrupts a bit more are the load/stores, multiplications and immediate loading.

 

Since the architecture specifies instructions of 16-bit, loading a 32-bit value for example into a register might take more than one instruction. This is accomplished in this case by using an internal register called "immreg" which can be filled in chunks. So, taking a look at the first instruction actually:

 

   limr 0x80000000, r3

 

This will be expanded into 3 assembly instructions:

 

   0:   8800            imm     0x800  // Load lowe12-bits into immreg
   2:   8000            imm     0x000  // shift immreg left by 12, set lower 12 bits to 0
   4:   e00f             limr      0x00, r15 // shift immreg left by 8, set lower 8 bits to zero. Load immreg into r15. Immreg now has 12+12+8 == 32 bits.
 

However, not all values need those three instructions to load immediates. They only need to be emmited if the value does not fit into the 8-bit immed value we have on the instructions. For unknown values (like symbol addresses) we do emit these two extra IMM, but they will be "relaxed" afterwards by the linker. Relaxation is done when all the symbols are resolved, and the extra instructions are removed if not needed. One example:

 

   imm 0xFFF

   imm 0xFFF

   limr  0xFE, r1

 

The value to be set is -2 (0xFFFFFFFE). Since loading immediates has a sign extent feature, only the last instruction is actually needed.

 

As a rule of thumb, the number of extra instructions for an immediate load is:

 

zero:  if immediate is between -128 and 127 (8-bit signed),

one:  if immediate is between -524288 and 524287 (20-bit signed)

three: all other cases.

 

This might affect interrupts, because since there is no way to read the immreg, we need to disable interrupts before processing the first "imm" instruction and until we have the actual instruction (on this case, limr (Load IMmediate into Register) ).

 

This immediate technique is used in ZPU and Microblaze. For ZPU, the imm size is 7 bits, for microblaze it's 16 (microblaze has 32-bit instructions)

Link to comment
Share on other sites

 

This might affect interrupts, because since there is no way to read the immreg, we need to disable interrupts before processing the first "imm" instruction and until we have the actual instruction (on this case, limr (Load IMmediate into Register) ).

 

This immediate technique is used in ZPU and Microblaze. For ZPU, the imm size is 7 bits, for microblaze it's 16 (microblaze has 32-bit instructions)

 

Could immedate loads be acheived through PC relative addressing?

 

eg at the end of the function's code have a table of constants, then access them with something like

 

   regX <= mem[PC+offset]

 

(not knowing the syntax for the assembler you are using).

 

That way constants could be shared, and maybe linking could be achived just by plugging values into the table of constants? 

Link to comment
Share on other sites

It's an option, indeed. But that introduces other problems, like memory access latency stalling the pipeline, and would only be possible for 8-bit PC-relative addressing. So, 4 bits for opcode, 4 for register index, and 8 for the offset. Loading values from the memory must cause the pipeline to completely stall at this point - one improvement would be to "dirty" the desitnation register only, and allow all other operations that do not use that register to proceed.

 

Note, however, that XTC is meant to be superscalar in some situations. One of those situations might be automatically executing "imm+ otherinst" in a single clock cycle, improving the performance.

 

I't a tradeoff between performance and code size.

Link to comment
Share on other sites

Another example: "strcpy":

.global strcpy.type strcpy, @functionstrcpy:        /* Source string in r2, destination string in r3  */        copy    r1, r3    /* Save destination pointer in r1 (result from function).nextchar:        ldb+    r2, r4    /* Load char into r4, increment source pointer */        or      r4, r4    /* NULL (terminating) char ? */        brine   .nextchar /* No, fetch next char ... */        stb+    r3, r4    /* But store it into dest, increment dest pointer */        ret               /* return from function */        nop               /* Delay slot */

This is a byte-per-byte strcpy. Only uses 7 instructions, 18 bytes of code. Nothing useful to put on the delay slot (ret) on this one, but we use the delay slot after the "brine" (BRanch Indirect if Not Equal) to store the char into destination, and increment the destination pointer.

Link to comment
Share on other sites

  • 1 month later...

I've set up a domain and a mailing list for XTC.

 

If you're interesting in participating, send an email to:

 

   majordomo <at> xthundercore.com

 

With a line (not in the subject!) containing

 

  subscribe dev

 

Then follow the instructions which you'll receive in your email address in order to subscribe the mailing list.

 

Best  :)

 

Alvie

Link to comment
Share on other sites

  • 3 months later...

Archived

This topic is now archived and is closed to further replies.