alvieboy Posted August 22, 2014 Author Report Share Posted August 22, 2014 First real HW boot of XTC..... and it almost worked Stupid me for two small mistakes (one related to carry add, and connecting the wrong clock line to SDRAM). And it boots now, and tests memory OK (whole 8MB of Papilio Pro). I'm happy. This shows both assembly code and "C" code working. I have data cache disabled on this test, let's see if dcache also works. Alvie Link to comment Share on other sites More sharing options...
hamster Posted August 23, 2014 Report Share Posted August 23, 2014 Awesome! Link to comment Share on other sites More sharing options...
alvieboy Posted August 23, 2014 Author Report Share Posted August 23, 2014 D-cache also works as expected. Now, moving on to traps/exceptions. And something came up, perhaps Alan can help here, it's a tricky issue. XTC uses delay slots for all branches. A delay slot is basically an instruction that *follows* the branch instruction, and it's executed even if the branch is taken. It's used to improve performance due to branch delays - SPARC is one of architectures that makes use of it, but others do too (microblaze for example, but they can be disabled on each branch instruction if necessary). The problem is: what happens if the delay slot raises a trap/exception ? Eventually the PC at time of trap must point not to the delay slot, but to the branch instruction itself. Plus: how should the fetch unit address a dual-branch (the real from the branch instr and the one from the trap caused by the insn at the delay slot) ? It will probably have to drop the 1st branch request, so that it can serve the trap handler, and return to the branch instruction. Another problem: the branch instruction *will* modify registers (if it's a function call), and I see no easy way to recover from that register write. And what it both branch and delay slot generate traps/exceptions (like branching to an invalid address + delay slot messing up with processor status register) ? Open to ideas from all of you. Alvie Link to comment Share on other sites More sharing options...
EtchedPixels Posted August 23, 2014 Report Share Posted August 23, 2014 Most of the processors that do this dump enough internal state onto the stack for the trap and then throw the lot at the OS and say "you clean it up". For some processors this was *evil* but usually consisted of a chunk of nasty to understand assembler the manufacturer provided. I would think if your instructions are restartable then you probably only need to know PC of trapping instruction and delay slot flag. At that point you can reconstruct and resume execution (you might need to know if the jump was taken). So I think I'd push condition codes flags [delayslot, etc] onto the trap stack or even push both a trap pc and a resume pc (the same except for delay slots when resume pc is thejump) It becomes something like restartaddr = stack[trap_pc]; if (stack[flags]&DELAY_SLOT) { restartaddr -= JUMP_SIZE stack[trappc] = restartaddr; ret (restores condition code, continues in usermode. The branch will be re-executed and go the same way as before) Probably even hideable in hardware. One thing that's nice about hiding it is that you keep compatibility if your behaviour has to change in future processors (eg x86 hides all sorts of parallelism in the real processor when it comes to throwing exceptions) Link to comment Share on other sites More sharing options...
offroad Posted August 24, 2014 Report Share Posted August 24, 2014 nice, looking forward to try it.Now I can only speak for myself but I hope you manage to keep it simple and stupid. Complexity is the enemy Link to comment Share on other sites More sharing options...
alvieboy Posted August 24, 2014 Author Report Share Posted August 24, 2014 As simple as possible, but these things are never simple anyway.... I believe I am still missing something on the design - I have to think about it for a couple of days before I decide on the approach. The design lacks a proper trap/interrupt register "window", and this makes things very difficult because you cannot use the registers in interrupt/trap mode without disturbing the original values. Two main approaches here:a) Simplest one, add another SPR (Special Purpose Register) as scratch register, so you can use it to temporary store one of the GPR so you can set up a proper stack. Add a full/partial register window (either all 16 registers or a subset of 8). The advantage is better use of the register banks, but requires a more complex/large approach for the dirty validation. This would also need a way to switch between both modes, and to fetch register from the other mode (ARM uses something similar, for both IRQ and FIQ). For a), upon entering the trap/interrupt handler, we could do something like this: sr <= r15; // Save R15 (stack pointer) onto the SPR SR (Scratch Register). r15 <= interruptstaklocation; *r15-- <= r1; // Save R1 (...) /* Save all other regs up to r14 */ r1 <= sr; // Reload SR into r1 (old R15, stack) *r15-- <= r1; // Save it in stack too. // We can now eventually load SPR TPC (Trap PC) and store it on stack aswell. .... (normal interrupt code) sr <= *r15++; // Reload old stack pointer into SR // Eventually set up TPC (Trap PC) in order to return to wherever we please with RETE instruction. We can also mess with SPSR (Saved PSR) r14<=*r15++; // Restore r14 (...) /* Restore all other regs down to r1*/ r15 <= sr; // Restore stack pointer. rete; // Return from exception - restores PSR (Processor Status Register) and jumps to SPR TPC (Trap PC) For things would be a bit more complex, and I won't describe here the flow for now. Link to comment Share on other sites More sharing options...
alvieboy Posted August 30, 2014 Author Report Share Posted August 30, 2014 Ok, thinks are now working it seems. Thanks Alan for: I would think if your instructions are restartable This idea happens to have solved a complex issue of having temporary registers (temporary immediate values, as in ZPU and Microblaze). just by pointing the restart address to the 1st instruction. XTC is now able to deal with traps and interrupts in the same clock cycle as they appear. I will push these to ZPUino too, it's a cleaner approach, makes design faster and overhead is negative (less LUT's, a few more flip flops). So, as long as the Trap PC points to the first instruction loading the immediate, all goes well.XThunderCException caught at address 0x8000007ER1 : 00000068 R2 : 90000000 R3 : 00000000 R4 : 00000000 R5 : 00000000 R6 : 8000002C R7 : 00000000 R8 : 8000084B R9 : 00000000 R10: 00000000 R11: 00000000 R12: 00000000 R13: 8000005A R14: 00000000 R15: 00010000 It's working as expected. Now, Alan / EtchedPixels, I still have a problem. Whenever the MMU faults due to a missing TBL entry, and since we don't have a MMU TLB miss walk in hardware, what should the state of the MMU be ? Fully physical<->physical mapping ? Since we need to "fix" the missing TLB line, shall the MMU be disabled while we do it ? Most kernels do have their own virt<->phys mapping, having the MMU disabled forces the OS to compute the phys address itself - this may lead to more complex HDL, and to complex trap contexts. Hoping to have made myself clear. This is definitely non-trivial (let's say we are at a requirement capture phase here). Link to comment Share on other sites More sharing options...
EtchedPixels Posted August 31, 2014 Report Share Posted August 31, 2014 It's nastier than that if you are not very careful. Consider the sequence TLB missfetch TLB miss handler instruction, oh bugger it's not there -> BOOM andTLB missfetch instructionsave old stack pointer, oh bugger -> BOOM (and thats a general trap handling issue with TLB misses - where do you put the trap vector and restart data that won't itself cause a TLB miss) I would vote for running the TLB miss handler physically mapped. In fact if you don't have many TLB entries, or your TLB entries don't have a size field I'd vote for running "supervisor mode" code physically mapped always. If you've only got fixed size say 4K TLBs then you have to take hits executing kernel code, which is stupid, and you have some other horrible cases (the infamous one is drawing a vertical line on a frame buffer) On x86 we try and do things like map the Linux kernel and its view of physical RAM using large pages, because even with a hardware TLB fetcher and a big TLB the TLB misses hurt. Another approach used by some processors is to in effect sacrifice a couple of bits of virtual address space to "direct mapped" and "uncached" and things like that. Then it becomes address[31] = physical mapped "0"&address[30 down to 0] else TLB If you do that then I think you are probably ok providing the user makes sure their TLB trap vector, code and the like is all in physical space. Link to comment Share on other sites More sharing options...
alvieboy Posted September 1, 2014 Author Report Share Posted September 1, 2014 I am still defining the TLB architectures in terms of entries and page sizes. The main constraint seems to be area. So, for a full (32-bit address space) TLB with 8 entries, and two page sizes (4K and 256K), 6-bit Context/ASID, I have 546 FF and 262 LUTS. Not sure these FF can be mapped to block RAM, but I will investigate. With 8 entries and four page sizes (4K, 256K, 1M and 16M), same context/ASID size, it eats 627 FF and 342 LUT. The increase might pay off, and I have both versions available. I might reduce the context ID to a single bit (supervisor/user). Each page includes cache information, like uncacheable, WT, WB_WA. Buffered might be also useful here. With the MMU disabled, all IO is done using uncacheable, and memory as WT. Still to define eventual bits for the iTLB/iCache. So, you see no problem to use only physical mappings for supervisor mode ? And thanks a lot for your insight. My experience with MMU/TLB is not big, and mostly for ARM and SPARC (Leon). Link to comment Share on other sites More sharing options...
EtchedPixels Posted September 2, 2014 Report Share Posted September 2, 2014 I can only speak for the Linux case, but I think Linux will be quite happy with physical mappings in supervisor mode. Do you need to force a page size or can you match on a base/mask pair as the 68010/68451 pair did ? Physical without proper caching would be bad though. I guess with 16MB tlbs for the kernel it wouldn't be too bad. I guess the other alternative is segment based addressing 8) There are reasons a lot of the earlier microprocessors with memory protection used segments even if it made programming them less fun in some cases (x86 due to the 16bit size). Does make full virtual memory harder but it makes the MMU architecture much simpler because you cache the entry with the segment register. Would limit you to ucLinux but with protection (although in theory with a bit of core kernel hacking you could also get fork() etc working) or perhaps a retrobsd/2BSD. Would going to 8 or 16K pages help - seems like it would also help for performance, especially if your code isn't very compact. There's definitely going to be a trade-off on how much time you spend reloading TLB entries and efficiency of memory use. 16K pages isn't that unreasonable and x86 is really only 4K nowdays because of compatibility. 16K ought to mean less misses and two less match bits to worry about in the cam Other trick is to ignore some bits of the virtual address space for now (and support it later as needed). Some 64bit cpus do this today. Not sure I'd bother with a context/ASID. If you only have 8 entries then it'll be cheap to save/reload them on a task switch and if that lets you have more TLBs that I imagine would be a bigger win ? Link to comment Share on other sites More sharing options...
alvieboy Posted November 19, 2014 Author Report Share Posted November 19, 2014 Basics, back to basics.... Bootloader, loads app from SPI flash, which in turn uses SD card and bootstraps application from there.XThunderCore Boot Loader v0.1 (C) 2014 Alvaro LopesTesting memory: OK, connecting to SPI flashSPI Flash Identification: 0x00BF258DProgram size: 0x000052C7, CRC 0x0000Signature: 0x310AFAD5Target board: 0x00000000Loading: Checksum: 0xDCB4446F, mem 0xDCB4446F doneStarting application.Registered console dev:serial0STDIO base registered in console dev:serial0Starting....SD Card initialisedApplication found, size 445480Application read (445480 bytes), starting....Registered console dev:serial0STDIO base registered in console dev:serial0French version DOOM 2: Hell on Earth v1.10 V_Init: allocate screens.M_LoadDefaults: Load system defaults.Z_Init: Init zone memory allocation daemon.W_Init: Init WADfiles. couldn't openError: W_InitFiles: no files foundNow, need to port SD library to work with the application/zposix interface. Should be fairly easy. Alvie Link to comment Share on other sites More sharing options...
alvieboy Posted November 24, 2014 Author Report Share Posted November 24, 2014 Soo many lessons learned First, arduino's SDFat library seems to be broken. So, don't "seek" on your file. That made me switch to another implementation, which is working very well up to now. Back to debugging. Porting DOOM has helped me a lot finding bugs in the core, in compiler, in binutils, newlib, so on. But even with proper traps in place, it's very hard to find exactly what is happening. It's not running, and I suspect some memory corruption (software) somewhere.M_LoadDefaults: Load system defaults.Z_Init: Init zone memory allocation daemon. W_Init: Init WADfiles.SD: Opening file 'doom1.wad'Exception caught at address 0x0003B974PSR : 20000031SPSR: 80000003 R1 : B9FBA9B2 R2 : 00054D36 R3 : 00000000 R4 : 00054D38 R5 : 005AEF08 R6 : 000723A8 R7 : FD3DDDBD R8 : 00000169 R9 : 00000000 R10: 000723A0 R11: 000000F1 R12: 00000000 R13: 0003B818 R14: 000774FF R15: 001652F4 Trap is "invalid memory access", and comes from newlib's "_free_r" (the reeentrant version of free() implementation). This is after reading the WAD file (which is reading OK). it's dereferencing R7 register, which holds an invalid pointer. This will take days to debug.... Link to comment Share on other sites More sharing options...
alvieboy Posted November 24, 2014 Author Report Share Posted November 24, 2014 EtchedPixels: your expertise in debugging might help me here, so I beg for some advise. I have quite a few options to debug this CPU, I may implement them all or only partially. First one is to use a soft debugger inside the IRQ/Exception handler, with his own stack, and a command line parser to inspect memory. Eventually implementing GDB protocol.Second is to add a trace buffer, but due to complexity of data I am not sure of what to put there except the program counter and eventually the two ALU operands (as I do for the simulation trace file).Third, JTAG or similar debug unit, OOB. Can be used in conjunction with the trace buffer.Fourth: slowish UART trace for all instructions (even at high speeds). May not work properly as of now, CPU is not able to be "halted" or "stalled" at will. In your opinion, and since you had to do a lot of hardcore debugging with only basic tools, what would be a fantastic, spectacular way to debug a CPU/OS/application ? Note: MMU is not up, so only basic invalid IO accesses are caught, as well as external NMI (pin-based). Alvie Link to comment Share on other sites More sharing options...
alvieboy Posted November 25, 2014 Author Report Share Posted November 25, 2014 Ok, I added an instruction tracebuffer (128-depth), and a small memory tracebuffer (16-depth) with filtering. Things are getting interesting. There's still an ugly bug somewhere, and looks software-related (perhaps compiler). The interesting thing is: with this ISR handler and tracebuffers, I can:a) Inject a "swi" instruction anywhere, which will cause a full register dump, as well as the tracebuffers. Use NMI (which is mapped to a pin) to trigger the dump.c) Catch invalid accesses which in turn call the dump. The "swi" approach is good, since I have no debugger. It is supposed to be non-intrusive, and so it looks. I also am to implement an "interactive" mode that will allow me to inspect memory without a debugger per-se. The ISR has its own stack, so it does not impact the "running" system. Alvie Link to comment Share on other sites More sharing options...
alvieboy Posted May 4, 2015 Author Report Share Posted May 4, 2015 Ok, I spent a few days working with emulation with QEMU for XThunderCore. Works very well so far. Here's a demo of DOOM runnning inside QEMU. XTC implemented with just a bunch of peripherals (UART, SPI for flash, SPI for SD card, VGA). I can now play with some optimizations at compiler level. Alvie Link to comment Share on other sites More sharing options...
Jack Gassett Posted May 4, 2015 Report Share Posted May 4, 2015 Cool, is it working at full frame rate etc? Jack. Link to comment Share on other sites More sharing options...
alvieboy Posted May 4, 2015 Author Report Share Posted May 4, 2015 Well, not yet, but not XTC issue. I have a wrong WAD file (I may be able to locate original one).Demo is from a different game version (109 versus 108)!But will be too fast actually: QEMU generates host (x86_64 code in this case) and it will perform quiiite fast. I'll let you posted. I was not to fix this (I'm using DOOM as a really CPU stressing application, in order to catch issues), but I can give it a quick attempt. I had this working once (I actually disabled this check), but it eventually crashed on middle of demo, I suspect due to some incompatibilities in both versions. Alvie Link to comment Share on other sites More sharing options...
Jack Gassett Posted May 4, 2015 Report Share Posted May 4, 2015 Well, this is a nice step, can't wait until you unleash the XThunderCore on the world! Jack. Link to comment Share on other sites More sharing options...
alvieboy Posted July 18, 2015 Author Report Share Posted July 18, 2015 XThunderCore now seems to boot with Pipistrello too, but only @80Mhz. XThunderCore Boot Loader v0.1 (C) 2014 Alvaro LopesTesting 0x04000000 bytes of memory: OKConnecting to SPI flashSPI Flash Identification: 0x0020BA18Program size: 0x0CF52B0A, CRC 0x310ASignature: 0xFADEBA01 - INVALIDMan, the LPDDR is sooo slow. Even when doing burst reads, sometimes I see 45 clock cycles before the 1st burst read starts outputting data. Best latency I observe is around 12 cycles. This makes it much slower in average than regular DDR (since our speed is lower than memory clock). And D-cache seems to be working so far. Let's see if DOOM runs OK on this platform, and with D-cache on. Alvie Link to comment Share on other sites More sharing options...
alvieboy Posted July 19, 2015 Author Report Share Posted July 19, 2015 Ok, back @100MHz now, but I had to register inputs and outputs to MCB - the cmdfull and rdempty lines are painfully slow (more than 5ns from output to first LUT input sometimes). Magnus: did you experienced similar issues with other designs ? Maybe I am not using a proper MCB wrapper or clock settings. Link to comment Share on other sites More sharing options...
mkarlsson Posted July 19, 2015 Report Share Posted July 19, 2015 I can take a look at your MCB wrapper setting if you are willing to share, here or via email.This post basically explains how I set it up: http://saanlima.com/forum/viewtopic.php?f=12&t=1234 Magnus Link to comment Share on other sites More sharing options...
alvieboy Posted July 19, 2015 Author Report Share Posted July 19, 2015 Looks similar to your settings, at least clock settings are same (200MHz DDR, 100MHz user clock). Can you see the latency for a 1-word read, since we send the command and receive data (rd_empty becomes '0') ? And here's the trace for no-registered inputs/outputs (i.e., two extra pipeline stages, mostly to deal with rd_empty and cmd_full): Location Delay type Delay(ns) Physical Resource Logical Resource(s) ------------------------------------------------- ------------------- MCB_X0Y1.P0RDEMPTY Tmcbcko_RDEMPTY 2.270 cpu/mcbctrl_inst/ctrl/memc3_wrapper_inst/memc3_mcb_raw_wrapper_inst/samc_0 cpu/mcbctrl_inst/ctrl/memc3_wrapper_inst/memc3_mcb_raw_wrapper_inst/samc_0 SLICE_X25Y59.B6 net (fanout=12) 1.644 cpu/mcbctrl_inst/rd_empty SLICE_X25Y59.B Tilo 0.259 cpu/data_mux_io/qtag<4> cpu/boot_sdram_mux/Mmux_m_wbo_ack11 SLICE_X25Y60.A6 net (fanout=9) 0.323 cpu/sdramorbootwbi_ack It has improved from last run (placement?). But still you see that at least 4.17ns are needed to reach first LUT (included Tilo, not included net output from LUT). [Edit: added timings] Link to comment Share on other sites More sharing options...
Recommended Posts
Archived
This topic is now archived and is closed to further replies.