alvieboy

XThunderCore is taking shape

47 posts in this topic

First real HW boot of XTC..... and it almost worked :P

 

Stupid me for two small mistakes (one related to carry add, and connecting the wrong clock line to SDRAM).

 

And it boots now, and tests memory OK (whole 8MB of Papilio Pro).

 

first-xtc-boot.png

 

I'm happy. This shows both assembly code and "C" code working. I have data cache disabled on this test, let's see if dcache also works.

 

Alvie

Share this post


Link to post
Share on other sites

:)

D-cache also works as expected. Now, moving on to traps/exceptions. And something came up, perhaps Alan can help here, it's a tricky issue.

 

XTC uses delay slots for all branches. A delay slot is basically an instruction that *follows* the branch instruction, and it's executed even if the branch is taken. It's used to improve performance due to branch delays - SPARC is one of architectures that makes use of it, but others do too (microblaze for example, but they can be disabled on each branch instruction if necessary).

 

The problem is: what happens if the delay slot raises a trap/exception ? Eventually the PC at time of trap must point not to the delay slot, but to the branch instruction itself. Plus: how should the fetch unit address a dual-branch (the real from the branch instr and the one from the trap caused by the insn at the delay slot) ? It will probably have to drop the 1st branch request, so that it can serve the trap handler, and return to the branch instruction. Another problem: the branch instruction *will* modify registers (if it's a function call), and I see no easy way to recover from that register write. And what it both branch and delay slot generate traps/exceptions (like branching to an invalid address + delay slot messing up with processor status register) ?

 

Open to ideas from all of you.

 

Alvie

Share this post


Link to post
Share on other sites

Most of the processors that do this dump enough internal state onto the stack for the trap and then throw the lot at the OS and say "you clean it up". For some processors this was *evil* but usually consisted of a chunk of nasty to understand assembler the manufacturer provided.

 

I would think if your instructions are restartable then you probably only need to know PC of trapping instruction and delay slot flag. At that point you can reconstruct and resume execution (you might need to know if the jump was taken).

 

So I think I'd push

        condition codes

        flags [delayslot, etc]

 

onto the trap stack or even push both a trap pc and a resume pc (the same except for delay slots when resume pc is the

jump)

 

It becomes something like

                       restartaddr = stack[trap_pc];

                       if (stack[flags]&DELAY_SLOT) {

                          restartaddr -= JUMP_SIZE

                          stack[trappc] = restartaddr;

                       ret

 

(restores condition code, continues in usermode. The branch will be re-executed and go the same way as before)

 

 

 

 

Probably even hideable in hardware. One thing that's nice about hiding it is that you keep compatibility if your behaviour has to change in future processors (eg x86 hides all sorts of parallelism in the real processor when it comes to throwing exceptions)

Share this post


Link to post
Share on other sites

nice, looking forward to try it.

Now I can only speak for myself but I hope you manage to keep it simple and stupid. Complexity is the enemy :)

Share this post


Link to post
Share on other sites

As simple as possible, but these things are never simple anyway....

 

I believe I am still missing something on the design - I have to think about it for a couple of days before I decide on the approach. The design lacks a proper trap/interrupt register "window", and this makes things very difficult because you cannot use the registers in interrupt/trap mode without disturbing the original values.

 

Two main approaches here:

a) Simplest one, add another SPR (Special Purpose Register) as scratch register, so you can use it to temporary store one of the GPR so you can set up a proper stack.

B) Add a full/partial register window (either all 16 registers or a subset of 8). The advantage is better use of the register banks, but requires a more complex/large approach for the dirty validation. This would also need a way to switch between both modes, and to fetch register from the other mode (ARM uses something similar, for both IRQ and FIQ).

 

For a), upon entering the trap/interrupt handler, we could do something like this:

  sr <= r15; // Save R15 (stack pointer) onto the SPR SR (Scratch Register).

  r15 <= interruptstaklocation;

  *r15-- <= r1; // Save R1

  (...) /* Save all other regs up to r14 */

  r1 <= sr; // Reload SR into r1 (old R15, stack)

  *r15-- <= r1; // Save it in stack too.

  // We can now eventually load SPR TPC (Trap PC) and store it on stack aswell.

 

 

  .... (normal interrupt code)

 

  sr <= *r15++; // Reload old stack pointer into SR

  // Eventually set up TPC (Trap PC) in order to return to wherever we please with RETE instruction. We can also mess with SPSR (Saved PSR)

 

  r14<=*r15++; // Restore r14

  (...) /* Restore all other regs down to r1*/

 

  r15 <= sr; // Restore stack pointer.

 

  rete; // Return from exception - restores PSR (Processor Status Register) and jumps to SPR TPC (Trap PC)

 

For B) things would be a bit more complex, and I won't describe here the flow for now.

Share this post


Link to post
Share on other sites

Ok, thinks are now working it seems. Thanks Alan for:

 

 

I would think if your instructions are restartable

 

This idea happens to have solved a complex issue of having temporary registers (temporary immediate values, as in ZPU and Microblaze). just by pointing the restart address to the 1st instruction. XTC is now able to deal with traps and interrupts in the same clock cycle as they appear. I will push these to ZPUino too, it's a cleaner approach, makes design faster and overhead is negative (less LUT's, a few more flip flops). So, as long as the Trap PC points to the first instruction loading the immediate, all goes well.

XThunderCException caught at address 0x8000007ER1 : 00000068 R2 : 90000000 R3 : 00000000 R4 : 00000000 R5 : 00000000 R6 : 8000002C R7 : 00000000 R8 : 8000084B R9 : 00000000 R10: 00000000 R11: 00000000 R12: 00000000 R13: 8000005A R14: 00000000 R15: 00010000 

It's working as expected.

 

Now, Alan / EtchedPixels, I still have a problem. Whenever the MMU faults due to a missing TBL entry, and since we don't have a MMU TLB miss walk in hardware, what should the state of the MMU be ? Fully physical<->physical mapping ? Since we need to "fix" the missing TLB line, shall the MMU be disabled while we do it ? Most kernels do have their own virt<->phys mapping, having the MMU disabled forces the OS to compute the phys address itself - this may lead to more complex HDL, and to complex trap contexts.

 

Hoping to have made myself clear. This is definitely non-trivial (let's say we are at a requirement capture phase here).

Share this post


Link to post
Share on other sites

It's nastier than that if you are not very careful. Consider the sequence

 

 

TLB miss

fetch TLB miss handler instruction, oh bugger it's not there -> BOOM

 

and

TLB miss

fetch instruction

save old stack pointer, oh bugger -> BOOM

 

(and thats a general trap handling issue with TLB misses - where do you put the trap vector and restart data

 that won't itself cause a TLB miss)

 

 

I would vote for running the TLB miss handler physically mapped. In fact if you don't have many TLB entries, or your TLB entries don't have a size field I'd vote for running "supervisor mode" code physically mapped always.

 

If you've only got fixed size say 4K TLBs then you have to take hits executing kernel code, which is stupid, and you have some other horrible cases (the infamous one is drawing a vertical line on a frame buffer)

 

On x86 we try and do things like map the Linux kernel and its view of physical RAM using large pages, because even with a hardware TLB fetcher and a big TLB the TLB misses hurt.

 

Another approach used by some processors is to in effect sacrifice a couple of bits of virtual address space to "direct mapped" and "uncached" and things like that.

 

Then it becomes address[31] = physical mapped "0"&address[30 down to 0] else TLB

 

If you do that then I think you are probably ok providing the user makes sure their TLB trap vector, code and the like is all in physical space.

Share this post


Link to post
Share on other sites

I am still defining the TLB architectures in terms of entries and page sizes. The main constraint seems to be area.

 

So, for a full (32-bit address space) TLB with 8 entries, and two page sizes (4K and 256K), 6-bit Context/ASID, I have 546 FF and 262 LUTS. Not sure these FF can be mapped to block RAM, but I will investigate.

 

With 8 entries and four page sizes (4K, 256K, 1M and 16M), same context/ASID size, it eats 627 FF and 342 LUT. The increase might pay off, and I have both versions available.

 

I might reduce the context ID to a single bit (supervisor/user).

 

Each page includes cache information, like uncacheable, WT, WB_WA. Buffered might be also useful here. With the MMU disabled, all IO is done using uncacheable, and memory as WT. Still to define eventual bits for the iTLB/iCache.

 

So, you see no problem to use only physical mappings for supervisor mode ?

 

And thanks a lot for your insight. My experience with MMU/TLB is not big, and mostly for ARM and SPARC (Leon).

Share this post


Link to post
Share on other sites

I can only speak for the Linux case, but I think Linux will be quite happy with physical mappings in supervisor mode. Do you need to force a page size or can you match on a base/mask pair  as the 68010/68451 pair did ? Physical without proper caching would be bad though. I guess with 16MB tlbs for the kernel it wouldn't be too bad.

 

I guess the other alternative is segment based addressing 8) There are reasons a lot of the earlier microprocessors with memory protection used segments even if it made programming them less fun in some cases (x86 due to the 16bit size). Does make full virtual memory harder but it makes the MMU architecture much simpler because you cache the entry with the segment register. Would limit you to ucLinux but with protection (although in theory with a bit of core kernel hacking you could also get fork() etc working) or perhaps a retrobsd/2BSD.

 

Would going to 8 or 16K pages help - seems like it would also help for performance, especially if your code isn't very compact. There's definitely going to be a trade-off on how much time you spend reloading TLB entries and efficiency of memory use. 16K pages isn't that unreasonable and x86 is really only 4K nowdays because of compatibility. 16K ought to mean less misses and two less match bits to worry about in the cam

 

Other trick is to ignore some bits of the virtual address space for now (and support it later as needed). Some 64bit cpus do this today.

 

Not sure I'd bother with a context/ASID. If you only have 8 entries then it'll be cheap to save/reload them on a task switch and if that lets you have more TLBs that I imagine would be a bigger win ?

Share this post


Link to post
Share on other sites

Basics, back to basics....

 

Bootloader, loads app from SPI flash, which in turn uses SD card and bootstraps application from there.

XThunderCore Boot Loader v0.1 (C) 2014 Alvaro LopesTesting memory: OK, connecting to SPI flashSPI Flash Identification: 0x00BF258DProgram size: 0x000052C7, CRC 0x0000Signature: 0x310AFAD5Target board: 0x00000000Loading: Checksum: 0xDCB4446F, mem 0xDCB4446F doneStarting application.Registered console dev:serial0STDIO base registered in console dev:serial0Starting....SD Card initialisedApplication found, size 445480Application read (445480 bytes), starting....Registered console dev:serial0STDIO base registered in console dev:serial0French version                         DOOM 2: Hell on Earth v1.10                           V_Init: allocate screens.M_LoadDefaults: Load system defaults.Z_Init: Init zone memory allocation daemon.W_Init: Init WADfiles. couldn't openError: W_InitFiles: no files found

Now, need to port SD library to work with the application/zposix interface. Should be fairly easy.

 

Alvie

Share this post


Link to post
Share on other sites

Soo many lessons learned :)

 

First, arduino's SDFat library seems to be broken. So, don't "seek" on your file. That made me switch to another implementation, which is working very well up to now.

 

Back to debugging. Porting DOOM has helped me a lot finding bugs in the core, in compiler, in binutils, newlib, so on. But even with proper traps in place, it's very hard to find exactly what is happening. It's not running, and I suspect some memory corruption (software) somewhere.

M_LoadDefaults: Load system defaults.Z_Init: Init zone memory allocation daemon. W_Init: Init WADfiles.SD: Opening file 'doom1.wad'Exception caught at address 0x0003B974PSR : 20000031SPSR: 80000003 R1 : B9FBA9B2 R2 : 00054D36 R3 : 00000000 R4 : 00054D38 R5 : 005AEF08 R6 : 000723A8 R7 : FD3DDDBD R8 : 00000169 R9 : 00000000 R10: 000723A0 R11: 000000F1 R12: 00000000 R13: 0003B818 R14: 000774FF R15: 001652F4 

Trap is "invalid memory access", and comes from newlib's "_free_r" (the reeentrant version of free() implementation). This is after reading the WAD file (which is reading OK). it's dereferencing R7 register, which holds an invalid pointer.

 

This will take days to debug.... :(

Share this post


Link to post
Share on other sites

EtchedPixels: your expertise in debugging might help me here, so I beg for some advise.

 

I have quite a few options to debug this CPU, I may implement them all or only partially.

 

First one is to use a soft debugger inside the IRQ/Exception handler, with his own stack, and a command line parser to inspect memory. Eventually implementing GDB protocol.

Second is to add a trace buffer, but due to complexity of data I am not sure of what to put there except the program counter and eventually the two ALU operands (as I do for the simulation trace file).

Third, JTAG or similar debug unit, OOB. Can be used in conjunction with the trace buffer.

Fourth: slowish UART trace for all instructions (even at high speeds). May not work properly as of now, CPU is not able to be "halted" or "stalled" at will.

 

In your opinion, and since you had to do a lot of hardcore debugging with only basic tools, what would be a fantastic, spectacular way to debug a CPU/OS/application ?

 

Note: MMU is not up, so only basic invalid IO accesses are caught, as well as external NMI (pin-based).

 

Alvie

Share this post


Link to post
Share on other sites

Ok, I added an instruction tracebuffer (128-depth), and a small memory tracebuffer (16-depth) with filtering.

 

Things are getting interesting. There's still an ugly bug somewhere, and looks software-related (perhaps compiler).

 

The interesting thing is: with this ISR handler and tracebuffers, I can:

a) Inject a "swi" instruction anywhere, which will cause a full register dump, as well as the tracebuffers.

B) Use NMI (which is mapped to a pin) to trigger the dump.

c) Catch invalid accesses which in turn call the dump.

 

The "swi" approach is good, since I have no debugger. It is supposed to be non-intrusive, and so it looks. I also am to implement an "interactive" mode that will allow me to inspect memory without a debugger per-se. The ISR has its own stack, so it does not impact the "running" system.

 

Alvie

Share this post


Link to post
Share on other sites

Ok, I spent a few days working with emulation with QEMU for XThunderCore.

 

Works very well so far. Here's a demo of DOOM runnning inside QEMU. XTC implemented with just a bunch of peripherals (UART, SPI for flash, SPI for SD card, VGA).

xtc-qemu-doom-demo.png

 

I can now play with some optimizations at compiler level.

 

Alvie

Share this post


Link to post
Share on other sites

Well, not yet, but not XTC issue. I have a wrong WAD file (I may be able to locate original one).

Demo is from a different game version (109 versus 108)!

But will be too fast actually: QEMU generates host (x86_64 code in this case) and it will perform quiiite fast.

 

I'll let you posted. I was not to fix this (I'm using DOOM as a really CPU stressing application, in order to catch issues), but I can give it a quick attempt. I had this working once (I actually disabled this check), but it eventually crashed on middle of demo, I suspect due to some incompatibilities in both versions.

 

Alvie

Share this post


Link to post
Share on other sites

Well, this is a nice step, can't wait until you unleash the XThunderCore on the world!

 

Jack.

Share this post


Link to post
Share on other sites

XThunderCore now seems to boot with Pipistrello too, but only @80Mhz.
 

XThunderCore Boot Loader v0.1 (C) 2014 Alvaro LopesTesting 0x04000000 bytes of memory: OKConnecting to SPI flashSPI Flash Identification: 0x0020BA18Program size: 0x0CF52B0A, CRC 0x310ASignature: 0xFADEBA01 - INVALID

Man, the LPDDR is sooo slow. Even when doing burst reads, sometimes I see 45 clock cycles before the 1st burst read starts outputting data. Best latency I observe is around 12 cycles. This makes it much slower in average than regular DDR (since our speed is lower than memory clock).

 

And D-cache seems to be working so far. Let's see if DOOM runs OK on this platform, and with D-cache on.

 

Alvie

Share this post


Link to post
Share on other sites

Ok, back @100MHz now, but I had to register inputs and outputs to MCB - the cmdfull and rdempty lines are painfully slow (more than 5ns from output to first LUT input sometimes).

 

Magnus: did you experienced similar issues with other designs ? Maybe I am not using a proper MCB wrapper or clock settings.

Share this post


Link to post
Share on other sites

Looks similar to your settings, at least clock settings are same (200MHz DDR, 100MHz user clock).

 

Can you see the latency for a 1-word read, since we send the command and receive data (rd_empty becomes '0') ?

 

mcb-pipistrello.png

 

And here's the trace for no-registered inputs/outputs (i.e., two extra pipeline stages, mostly to deal with rd_empty and cmd_full):

     Location             Delay type         Delay(ns)  Physical Resource                                                         Logical Resource(s)      -------------------------------------------------  -------------------      MCB_X0Y1.P0RDEMPTY   Tmcbcko_RDEMPTY       2.270   cpu/mcbctrl_inst/ctrl/memc3_wrapper_inst/memc3_mcb_raw_wrapper_inst/samc_0                                                         cpu/mcbctrl_inst/ctrl/memc3_wrapper_inst/memc3_mcb_raw_wrapper_inst/samc_0      SLICE_X25Y59.B6      net (fanout=12)       1.644   cpu/mcbctrl_inst/rd_empty      SLICE_X25Y59.B       Tilo                  0.259   cpu/data_mux_io/qtag<4>                                                         cpu/boot_sdram_mux/Mmux_m_wbo_ack11      SLICE_X25Y60.A6      net (fanout=9)        0.323   cpu/sdramorbootwbi_ack 

It has improved from last run (placement?). But still you see that at least 4.17ns are needed to reach first LUT (included Tilo, not included net output from LUT).

 

[Edit: added timings]

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now