alvieboy

Linux and ZPUino

Recommended Posts

maybe you can have a look at the amber ARMv2 core, in OpenCores. There's a cache in there -- not that I really looked into it, though, it might just be rubbish for you.

Share this post


Link to post
Share on other sites

All caches I found so far do not fit the design, unfortunately. Good thing is I have now a perfect idea of how the cache will work, and how to adapt all system to use the caches - this is complex, because the CPU has four external interfaces (two to RAM, one to ROM, and one to IO), and all of them have concurrent accesses (R/W). Add the DMA engine to this, and you'll see how much stress is put on the RAM chip.

Plus, add three pipeline stages, that can go "busy" individually, and then add cache "miss", IO delay and other things that cause the CPU to wait for data - it's a pain to keep everything synchronized. Add an extra delay to cache write (it takes 2 clock cycles, if successful), and note that almost all CPU operations cause writes - still we need to accomplish everything close to a 1-cycle delay for each instruction. Man, believe me, this is complex...

The ROM is also complex, cause it does not shadow copies into RAM, so we have to do that in software (hw is also possible, but would kill timing).

I hope to have some more time this and next week to work on this.

Share this post


Link to post
Share on other sites

Usually cache design is often as complex than the processor itself. And some time even more.

On most cpu, misscache simply freeze the core until the cache get the data.

Doing so you preserve all internal synchronization in the pipeline until cache line is fill with the requiere data without additional complexity.

My 2c :)

Thomas

Share this post


Link to post
Share on other sites

Ok, dcache design is moving well, working on cache flush right now. So far, "hitting" a cache line gives you a 1-cycle delay read on both cache ports, and a 0-cycle delay for writes (only one port is writeable though).

Tb_: well, simple freezing the pipeline is not a good approach. Sometimes theory and practice do not entirely cooperate :) Freezing the pipeline is only possible if the pipeline is no-delay, which is not the case - every pipeline stage can go busy individually. One example of this is when we for example have an instruction cache miss, and the instruction fetch unit is refilling the instruction cache. There's no logic halting this process because we had a write miss two stages later. Plus, some of these processes assume constant delays in some operations, and delaying those becomes problematic in some scenarios.

Alvie

Share this post


Link to post
Share on other sites

Ok, for the 1st time Linux is actually booting in Papilio pro. There's no "init" there yet, which is the 1st application linux runs, but this also means that the Linux Kernel is perfectly working. Note that this is a multi-stage boot - the 1st "application" run is "ZbFLT", which is an arduino sketch that loads FLAT binaries. The 2nd application (ZLinux loader) is also a sketch, which is converted to FLAT and loaded by the 1st one. The latter one then loads the Linux Kernel from a SD card. The kernel starts and then fails to completely boot cause there is no "userspace" applications yet on the SD card.

ZbFLT loader v1.0, © 2012 Alvaro Lopes

Loading 'loader.bflt'...

Loaded .text=0x007f2c00 .data=0x007fd150 .bss=0x007fdd84, starting...

ZLinux loader v1.0, © 2012 Alvaro Lopes

Loading linux: ............................................... OK.

Starting the kernel, sp 0x007ffff3, pc 0x00001008

Linux version 3.4.0-uc0 (alvieboy@della) (gcc version 3.4.2) #718 PREEMPT Sun Dec 16 16:00:50 WET 2012

bootconsole [earlyconsole0] enabled

ZPU: setting fast paths for interrupt and syscall

CPU: ZPUino Running at 96.000 MHz.

Physical memory:

00000000-00800000

Reserved memory:

00000000-00000fff: Bootloader

00001000-001265ba: Kernel code

001265bb-001c3827: Kernel data

Node 0: start_pfn = 0x0, low = 0x800

Node 0: mem_map starts at 001c5000

Built 1 zonelists in Zone order, mobility grouping off. Total pages: 2032

Kernel command line: root=/dev/mmcblk0p1 rootwait console=ttySZ0 init=/bin/sh

PID hash table entries: 1024 (order: 0, 4096 bytes)

Dentry cache hash table entries: 1024 (order: 0, 4096 bytes)

Inode-cache hash table entries: 1024 (order: 0, 4096 bytes)

Memory: 6264k/6264k available (1109k kernel code, 1928k reserved, 78k data, 64k init)

SLUB: Genslabs=13, HWalign=32, Order=0-3, MinObjects=0, CPUs=1, Nodes=1

NR_IRQS:32

ZPU: setup timer OK

zpu_clockevent: irq 0, 96.000 MHz

Calibrating delay using timer specific routine.. 197.86 BogoMIPS (lpj=989322)

pid_max: default: 4096 minimum: 301

Mount-cache hash table entries: 512

bio: create slab <bio-0> at 0

Switching to clocksource zpuino_counter

msgmni has been set to 16

Block layer SCSI generic (bsg) driver version 0.4 loaded (major 254)

io scheduler noop registered

io scheduler deadline registered (default)

gpiochip_add: registered GPIOs 0 to 127 on device: zpuino_gpio.1

ZPUino GPIO driver registered, 128 pins

Serial: ZPUINO UART driver

zpuino_uart.2: ttySZ0 at MMIO 0x8800000 (irq = 1) is a ZPUino UART

console [ttySZ0] enabled, bootconsole disabled

console [ttySZ0] enabled, bootconsole disabled

ZPUINO: UART at 0x8800000, irq 1

brd: module loaded

loop: module loaded

Registering ZPUino SPI driver

ZPUino: probing for SPI controller

zpuino_spi zpuino_spi.0: at 0x0B000000

ZPUino. SPI controller initialized 004a4000

mousedev: PS/2 mouse device common for all mice

mmc_spi spi0.0: SD/MMC host mmc0, no DMA, cd polling

Waiting for root device /dev/mmcblk0p1...

mmc0: new SDHC card on SPI

mmcblk0: mmc0:0000 SD4GB 3.67 GiB

mmcblk0: p1

VFS: Mounted root (vfat filesystem) readonly on device 179:1.

Freeing init memory: 64K (1000 - 11000)

Failed to execute /bin/sh. Attempting defaults...

Kernel panic - not syncing: No init found. Try passing init= option to kernel. See Linux Documentation/init.txt for guidance.

Call trace:

[<00014b6f>] panic+0x63/0x13c

[<000110c1>] match_dev_by_uuid+0x0/0x2f

[<00001996>] kernel_init+0x93/0x9c

[<00017a97>] do_exit+0x0/0x251

Share this post


Link to post
Share on other sites

Wow well done Alvie, it looks like you're 99% there. This will be quite epic when you finally get a login prompt :) Is this running on the early P pro with the SRAM or the current release version P pro with SDRAM?

Share this post


Link to post
Share on other sites

Man! This is awesome!!! It's so, so close; it's going to be an amazing feeling once you see that login prompt after all of that hard work you've put into this!

Alex, this is on the Papilio Pro board with 64Mb SDRAM. The 1Mb SRAM on the Papilio Plus was not enough to do much with linux.

I'm thinking that the MegaWing I made that was meant to be the second version of the Arcade MegaWing is going to be the perfect companion to this. I need to build up a couple more boards, one for Alvie and one for a manufacturing prototype. Then we can get the ball rolling on a nice MegaWing to go along with linux!

post-29509-0-11372000-1355701167_thumb.p

post-29509-0-32966000-1355701281_thumb.p

Share this post


Link to post
Share on other sites

Have you thought about an HDMI MegaWing instead of VGA? Most people would have a junk monitor supporting VGA but we can't be far behind having junk monitors with HDMI inputs :)

Share this post


Link to post
Share on other sites

Wow well done Alvie, it looks like you're 99% there. This will be quite epic when you finally get a login prompt :) Is this running on the early P pro with the SRAM or the current release version P pro with SDRAM?

Hi alex,

It's running on current version with SDRAM.

Hope to get busybox running today (it was running already in simulator, but I need to check if the cache flush is working properly for userspace applications).

Alvie

Share this post


Link to post
Share on other sites

Have you thought about an HDMI MegaWing instead of VGA? Most people would have a junk monitor supporting VGA but we can't be far behind having junk monitors with HDMI inputs :)

Yes, but it will take several months to make a prototype and test it. The existing design is tested and ready for manufacturing.

Jack.

Share this post


Link to post
Share on other sites

Ok, guys, I'm not very fan of copy/paste, but here's a copy of what I just posted to ZPU mailing list:

http://mail.zylin.com/pipermail/zylin-zpu_zylin.com/2012-December/001913.html

Hi guys,

Since it's almost Christmas it's perhaps time to get you all updated about ZPUino, what has been done and accomplished so far, what is being done right now, and

what future holds.

The ZPUino project started back in 2010 and published first alpha release in December the same year. The objective of the project was to implement an Arduino

(wiring) compatible platform, but running with a ZPU core and devices similar to those present on Arduino AVR devices. The project developed in several phases

and with several hardware versions for each phase. It started by a simple SoC using the traditional ZPU core, and with some basic devices like UART and SPI. A

software bootloader/programmer was also implemented, using the standard serial port and a variant (very variant) of HDLC protocol for communication with

programmer devices - ZPUino was designed to bootstrap its "sketches" from an external SPI flash, and logic for programming those flash devices was split between

the host programmer (which now is known to run on major operating systems, like Microsoft Windows, Linux and MacOS), and the device programmer.

Everything was set up to allow almost seamless migration of Arduino code into ZPUino code.

During this first phase the Arduino IDE/Wiring library was adapted to support ZPUino, and a new compiler mode was then implemented, since it did not support

multi platform (as of now, it does, but I still keep the "make" approach I designed back then).

The second phase relied on hardware design. A new core was implemented (ZPUino Premium), which had a full 3-stage pipeline and was able to execute most basic

instructions in one clock.

Some new core devices were also added, like Audio (sigma-delta), and complex PWM-able timers. The main IO interface is wishbone compliant, so any wishbone

compliant device should work with the design (I've tested a few, like OpenCores I2C, and works like a charm). A few design variants were written, like memory

mapped VGA, DMA VGA (such as the ZX Spectrum version), audio synthesis, and many more. But only internal RAM (BRAM) was supported.

There was a singular variant of this design, one which actually implemented a new instruction (which I called FMUL16), which could perform a 16.16 fixed point

multiplication, and speed up some operations. This variant was used in the SoundPuddle project.

Let me now tell you about the SoundPuddle project.

Back in April this (2012) year, I was contacted by John English from Colorado, US, asking if ZPUino could do real time signal analysis for a project he wanted

to show in Apogaea 2012.

After some initial analysis I said it was feasible, and so we moved to implement the thing on ZPUino in a S3E500 board (Papilio One), from Gadget Factory. It

was indeed feasible, and it was a huge success. It was improved and shown at Burning Man festival the same year. Feedback was awesome.

For some low-level details on this one:

A 1024-point FFT was implemented in software, whose inputs came from an external ADC. The FFT code was entirely done in assembly code (a whopping 177 bytes!),

using the FMUL16 instruction. This was fast enough for what the project needed (actually, it ended up being too fast, and we had to add some delays). The real

constraint here was the amount of memory available of the device. The system ran with around 40KB. Tough, but possible.

Intro video for Kickstacker is here: http://kck.st/MAu7oQ

Almost at same time, Jack Gasset (from Gadget Factory) started the Retrocade Synth project:

http://www.kickstarter.com/projects/13588168/retrocade-synth-one-chiptune-board-to-rule-them-al . This uses now the Extreme core, as described below.

Both projects were successfully funded, and are now shipping to its supporters.

Back to the design:

The core, due to it's pipelined design, required fast memory since it needed to simultaneously read the instruction stream, read stack values and write back

stack. And we were

very limited on block RAM, so it was time to move to another design.

ZPUino Extreme was then born.

ZPUino Extreme took another approach - it used block RAM for the stack (which was fixed, 4KB or 8KB), and used external memory for the program area and data. In

order to do so, we designed memory interfaces (SRAM, SDRAM and DDR-SDRAM), all working in wishbone pipelined mode, and added a simple, direct-mapped instruction

cache. This allowed us to run larger codebases, and access more memory than usual. This is still the fastest core if you need large code/data, and can live with

the limited, non-switchable stack. For most single-task applications, this is indeed the core you need.

But for complex designs this was still not enough. The fixed, limited stack prevented us from running more complex applications. At first a simple

write-back-stack, read-new-stack approach was tried, but was somewhat complex, and very slow.

So, ZCoreV3 was born :)

Yes, I decided to change the name for the core. I was running out of acronyms :P - now, seriously, I though a lot about the naming of ZPUino cores, and they

wouldn't cope with further development improvements, so I went radical.

First of all, ZCoreV3 is not yet in production, although it's considered (by me) stable. It's stability will be proven during next months, although I'm feeling

confident. A few improvements are also being thought of, so it might take a while before a first stable version is available to you all.

So, what's so different about ZCoreV3 ? Well, something simple, but something very complex: the stack is no longer fixed.

Although this might look like a simple thing, it's indeed the most complex thing I did in hardware!!!

ZCoreV3 shares the same pipeline and instruction cache as ZPUino Extreme, and adds a data cache, direct-mapped, one-way associative, dual-ported, write-back,

which can in "hit" scenarios attain a 1-clock read delay, and 0-clock write delay. Only one of the ports is writeable, though. Conflicts (r/w) are handled by

the cache itself, so the core does not need to address that. The core is also slightly different, featuring not only TOS cacheing, bu also NOS cacheing (but

TOS is always written back for stack push operations). Further improvements are to identify "hot" cache lines (those being accessed as stack) and perform

write-through for some memory accesses (or eventually convert it to a two-way associative cache).

So, since ZCoreV3 design is able to address a lot of memory, and not many restrictions on it's use (if any), we can probably put it to some real work....

... and it now runs Linux (MMU-less version)!

There are still some things needing implementation on Linux side (and uClibc), and a few stability issues, but things now look very promising.

I'm uploading a small video of it running on Gadget Factory Papilio Pro board (S6LX9), with 8MB SDRAM, and a real SD card. You can see it here:

http://youtu.be/WXhLxfztSZo

A few things still to address. Some stability issues need to be addressed (all those are software, eventually related to kernel stack switch), some functions

(memcpy, memset, string functions) need some optimizations (ie., assembler versions, memcpy already has one), the SPI controller is limited to 8-bit, which

makes it very slow (as you can see from the video, takes some time to exec. the first application), and some more, which I'll address. First, make it run

stable, then optimize.

I'm hoping to get this to run on S3ESK soon, at same speed (96MHz), so you guys can also help (I know some of you have this board at home).

Plans for the future: oh, well, first, get Linux and other operating systems running stable, getting DMA to work properly with the dcache, some new VGA

adaptors, what else....

Let's hope 2013 is a good year for ZPU and ZPUino.

A few thank-you:

- To all ZPU and ZPUino users, we're doing this for you, thank you !

- My family, for their support (although they don't know what I'm doing! :P )

- Jack Gassett, and Gadget Factory, for they support with hardware and ideas! Thanks Jack!

- John English, the SoundPuddle Engineer, for the real-world use of ZPUino and a lot more!

- All those who helped with ZPUino, they are so many I won't risk forgetting anyone, so you're all included!

- All ZPU fans!

As always, any doubts, questions, opinions, so on, are very very welcome!

And have a merry Christmas!

Alvie

PS: I'm not explaining something here - it's a challenge to your intellect and HDL knowledge :P I'll just say "data cache", hopefully someone will question how

is it possible. lol!

And merry Christmas to you all :)

Alvie

Share this post


Link to post
Share on other sites

Alvie,

That was a great read, it's amazing to go over all that you have accomplished with the ZPUino. I'm very grateful, the RetroCade would not have been remotely possible without the ZPUino and the prospect of Linux soon is very exciting! Let's make 2013 the year that we get ZPUino in the hands of many more users. :)

Jack.

Share this post


Link to post
Share on other sites

Hallo all,

I was very impressed seeing the linux kernel booting on ppro in your video, I downloaded the github repo with Linux3.7-zpu, but I have still big problems to get it compiled and  I also couldn't find the source of zbflt-loader and zlinux-loader to get it running. Maybe someone has some useful tipps or a short tutorial.

Thanks in advance

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now