ben

real time 3d display

Recommended Posts

Hi all,

I started pondering over a 3d engine for the papilio, using VGA output of a TV wing, rendering "on the fly" rather than in a bitmap buffer (nintendo ds style, for those who know about it), to avoid the need for external memory.

A bit of thinking got me to believe that flat polygons with z-buffering was within grasp, but that texturing seemed very ambitious, as it requires a division for each drawn pixel (and a fast memory to store the texture: one access per pixel as well). As a first prototype, I also excluded alpha blending, lighting, backface culling, and z-culling (although the last one is easy) And, of course, forget about custom shaders :)

So far I've implemented and tested the "pre-process" phase: projection of the 3d points on the 2d screen, and precomputing stuff to make the line by line rendering of triangles easier and faster (this is actually the hardest part). I also implemented all the parts of the rendering pipeline, but i have not tested it yet, and not started the (hard) work of integrating it and synchronizing it with the display.

A few technical details:

- target resolution is 256*240 (double lines in 640*480 VGA) ;

- all computations are done in 18bit fixed point arithmetic, to fit the DSP48A1 blocks. 8 bits for the fractional part seem to give good results ;

- maximum pipelining in the preprocess phase allows a throughput of 1 point every 3 cycles and 1 triangle every 6 cycles (I might slow it down a bit to save space) ;

- if no bad thing happen, drawing should be 1 pixel every cycle : with a 100MHz drawing clock (which seems reasonable), we'd get a theoretical maximum of 100MHz/15kHz ~ 6600 pixels per line.

Only real tests will confirm it, but the last bit makes me believe this will be able to display several hundreds of triangles in real-time. The hard part will be updating the input data in real-time too.

I've written a HTML5 implementation (attached file, tested on Firefox, should work on other navigators), to check that the 18 bits fixed point arithmetic was giving decent results (it does), and as a reference for the HDL implementation.

Ben

EDIT: VHDL code at https://github.com/ben0109/Papilio-3d

Share this post


Link to post
Share on other sites

Ben,

Awesome! So we've been kicking around an idea at my local hackerspace, (Solid State Depot), to work on a hand-held Papilio PDA. The idea was to use the GWEN GUI toolkit along with the ladyada tft library to make nice looking and easy to build GUI's for the $10 TFT touchscreen I got running on the Papilio.

I took a first stab at getting the GWEN library to work on the ZPUino last weekend but ran into some missing libraries. Everyone agreed that to really make this library look awesome we needed some kind of GPU/3d engine.

Then boom, here you are posting about that very thing in the forum! If you get this working and we can integrate it into the ZPUino and wrap it with the GWEN GUI toolkit that would allow people to write simple sketches to build up nice GUI's I think we would have something totally amazing! :)

And even if we never get that far, I think this 3d engine you are working on is going to be sweet!

Jack.

Share this post


Link to post
Share on other sites

I was thinking that a ray casting implementation of a terrain / hight map viewer (think Comanche's Voxel Space ) would be possible in real time on a Papilio - but only if the display is in portrait mode. Would be a neat hack though.

A Wolf 3D style 2.5D environment is very possible - you only need to hold row of data to display (so 1920 entries for full hi-def). Wall top Y (10 bits), Wall Bottom Y (10 bits), Wall colour (12 bits) = 7680 bytes of storage. You might even be able to squeeze in textures.

I have also looked at using external memory for holding display lists, but due to memory bandwidth the result isn't that great - even going flat out at 200MB/s you only get enough time to access 8 bytes per pixel when running at 640x480. You would need to hold a sorted list of triangles in external RAM, and would then be limited to how many actively displaying triangles you could hold in the FPGA.

Share this post


Link to post
Share on other sites

Jack : i have to admit that I cannot see how a 3d engine would help with gwen (especially one without texturing). But I must be missing your point. Still, the pda project looks promising.

Hamster :

Voxel space, that would indeed be very cool. I always wondered how that worked, but never took the time to look into it.

I also thought about 2.5d, with texturing: cheating a bit (like doom does) would make it very efficient. But that is a whole different project.

Actually, as I have a 'free' divider (i need it in the process phase and not so much in the rendering phase), texturing is really limited by memory bandwidth, as you point out.

Share this post


Link to post
Share on other sites

Ben: I must work out how to walk a ray over a grid, and find the intersects. Can't be much harder than using a DDA to draw a line one pixel at a time.,.. each column of pixels on the screen requires walking the hight map, so for a 640x480 you need to cast 640 rays per frame. If the horizon is 256 units away then you need at most 512 cycles per column per frame, just under 20,000,000 tests per second at 60Hz.

If only monitors scanned columns not rows (e.g. Vsync was in kHz and Hsync was in HZ) then could do it all on the fly without any memory... you would have 25.6us per column or 2560 cycles with a 100MHz clock...

Share this post


Link to post
Share on other sites

Can't you just rotate the picture 90 deg sort of like all the current FPGA arcade games?

Admittedly you'll have to then rotate your monitor in portrait mode, but it's a start.

Share this post


Link to post
Share on other sites

Ok, I just completed a fisrt working implementation. There are still some z-buffer glitches, but the display looks like a cube. Colors are really ugly, but that's the VGA wing :)

I'd like to attach a bit file (for papilio plus with a vga wing in AL), but that seems impossible at the moment -- it's on github.

to do :

- fix the z-buffer issues

- use a better clock for vga (16MHz distorts the image a lot)

- double check math rounding, to solve 'black edges" issues (javascript version looks better)

- investigate the upper clock bound for the gpu -- it's 16MHz in the attached bit file, but I could build a bit file with a 96MHz clock.

- find something more impressive than a still cube to show off -- ideas are welcome !

Share this post


Link to post
Share on other sites

Great !

Two remarks : division is not so bad if you use maximum pipelining (1 division per clock).

Texturing of the walls should be fairly easy, and you can use mode 7 style texturing for the floor and ceiling (keep a table of matrix coefficients, multiply it at the beginning of the line by the rotation matrix, and walk along the line, with two adds per pixel: look for gba tonc for a detailed howto)

Share this post


Link to post
Share on other sites

Ok, I just completed a fisrt working implementation. There are still some z-buffer glitches, but the display looks like a cube. Colors are really ugly, but that's the VGA wing :)

Do you have an Arcade MegaWing? That does more colors.

I'd like to attach a bit file (for papilio plus with a vga wing in AL), but that seems impossible at the moment -- it's on github.

Is there an issue with attaching a bit file to a forum post? I thought I added the MIME type, but maybe that was just for downloads... In the meantime, do you have a link to your github repository?

Jack.

Share this post


Link to post
Share on other sites

Texturing of the walls should be fairly easy,...

Well I've got the texturing of the walls sorted (updated 'C'code and image on http://hamsterworks.co.nz/mediawiki/index.php/Ray_cast).

It was a lot easier than expected - got rid of another division. In a -250 part I should be able to fit a full 256x256 maze and display at 1280x720 / 60Hz.

Now time to start on the HDL

Share this post


Link to post
Share on other sites

Cool. Again, 'mode 7-like' texturing of the floor and ceiling should be really easy, and would make a great addition. once you've got the appropriate (constant) z-value of each line, it's only 1 fetch + 2 additions for every pixel, with a small per-line overhead (a few add/sub and 4 mult).

In the meantime, I've made only little progress on my 3d gpu. I reimplemented the division (to get rid of the fat xilinx core, that took 20 minutes in p&r phase...) and started working with ram blocks to upload the data through data2mem -- debugging this is really a pain.

I'm aiming at a simplified stanford bunny (see http://www.cs.mun.ca...frameBunny.html), with 256 vertices and 512 faces. I'm still pondering on how best to choose the faces colors to make it look good, though.

For Jack: it seems the forum refuses all upload 'You can upload up to Uploading is not allowed of files (Max. single file size: 500MB)'

The bit file is on github w/ the source code, at https://github.com/ben0109/Papilio-3d

Share this post


Link to post
Share on other sites

Ok, i have a 0.99 version, that displays a cube in 256x240x60Hz with only small glitches (from the fixed point arithmetic) and BRAM-stored model, easily changeable through data2mem. The gpu pipeline runs at 32 MHz.

I still have to make the various parameters (number of points, number of triangles, parameters of the projection matrix) configurable as well, and do some pipelining optimization : right now, the core won't run at 64 MHz, and I'm targeting ~100MHz to display complex models @ 512x480x60 Hz.

All code and bit file (for P+, vga on AL) on https://github.com/ben0109/Papilio-3d

  • Like 1

Share this post


Link to post
Share on other sites

It works ! Running at 100.5 MHz (4x the vga pixel clock), the core displays a rotating 512 triangle-rabbit at 256*240*60Hz -- I'd love to attach a picture, but it is rotating too fast, and all i get is a blurred image.

To be totally honest, p&r chockes on some paths, and there are some glitches: the most triangle intensive lines shows some "holes" (you can see through the surface because some triangles were not drawn), and some lines edges spill ouside the triangles. The latter might be caused by fixed point arithmetic (for very flat triangles) or the failing timing constraints.

Again, I cannot attach the bit file, but it's on github (final.bit)

This post has been promoted to an article

Share this post


Link to post
Share on other sites

Can you give a brief rundown on the method of operation?

I've had a look at the code, and I see lots of things that are familiar, but am missing the 'aha!' moment when I see how it hangs together...

I see that the points and triangles are stored in diferent blocks, how are they tied together during the display phase?

Do you transform each point on each row, or transform them during the blanking interval, and then only compute triangles during the active display time?

Do you double buffer the rows in RAM? Having an active row that is displayed and cleared while the last one is being drawn? Or do you just buffer one row and draw it really quickly>

Share this post


Link to post
Share on other sites

You've got it right.

512 points are stored in x,y,z format (3*18 bits), and 512 triangles are stored with the indices of vertices A, B, C, and a 9 bit color D (3*9+9=36 bits).

At the end of the screen, the preprocess phase runs : it transforms the points in 3d world to screen coordinates, using a 4*4 matrix and a division (the resulting z coordinate is used for z-buffering). Then, it turns every triangle into 0, 1 or 2 screen triangles with a horizontal edge, and these are store in (direction,ymin,ymax,x,z,dxl,dzl,dxr,dzr,color) format : direction is up or down, x,z is the coordinate of the vertex and dxl,zl and dxr,dzr and the slopes of the left and right edges. It stops when it has processed the 512 triangles, or a the first "empty" triangle(one where A=B)

All of thatis implemented in the "transform_pipeline" module.

Then, for every line, it performs the draw phase : it clears the z-buffer and the color buffer, and find the intersecting triangles (checking on the ymin ymax bounds), then multiplying dxl,dxr,dzl,dzr by the y offset to get xl,zl,xr,zr, and dividing (zr-zl) by (xr-xl) to compute the z-slope. For each of these triangles, it draws a segment of the right color and fills the z-buffer using linear interpolation. Note that the 'find' and 'draw' processes run in parallel to optimize bandwith.

This is implemented in the "draw_pipeline" module

At the end of the line, it swaps the color buffer with the screen buffer, and starts a new draw phase (see gpu.vhd)

For the demo, I used BRAM held matrices to rotate the model, but the matrix would more typically be set through programmatic means (using AVR8 or ZPUino)

I'm planning on adding a normal vector for every triangle, to do back face culling, and perhaps lighting (once I find my Arcade Megawing to have several bit per RGB channel)

Texturing is much harder, because linear interpolation of u,v gives bad result (see http://en.wikipedia.org/wiki/Texture_mapping#Perspective_correctness) Gouraud shading is basically the same as texture mapping, but linear interpolation might be decent enough there.

I'm also thinking on how to parallelize pixel drawing, as the BRAM max width (36) should allow 4 writes at once, but this would require serious bookeeping. I'm also looking at a better fifo for triangle lookup, as the 30 clock cycles latency of the division makes it a probable bottleneck.

Share this post


Link to post
Share on other sites

Excellent work! I loaded the bit file onto my Papilio Plus board and captured a video of your 3d engine in action! I'm embedding the video into your last message and will promote to the showcase. :)

Jack.

Share this post


Link to post
Share on other sites

just looked at the timing report, to check which paths were not routed within spec. It appears these are paths from BRAM to a 9 bit compare, because I use inverted clock for BRAM and a regular clock for the registers that use the compare results -- so the constraint for the output path is 200MHz instead of 100MHz.

Any ideas about how to keep the constraint low without putting a register out of the bram ? (As this is in a critical loop, I really need to test a value at *every* cycle.)

I tried using a in-phase clock for the BRAM, but this gives unstable data, so I switched to 180° phased clock. Using a 90° phased clock makes it a little better -- that is p&R is still failing but the offset is not that bad.

Note that the thing seems to work perfectly in real life: the Xilinx toolchain timings must be tighter than reality (which is sensible)

Share this post


Link to post
Share on other sites

Awesome - I for one welcome our blue cyber-bunny overlords!

Are you double-timing the BRAM to make it look like it is async RAM (the data is valid before the next rising edge, so you can set the address and read the data in the same cycle)? That is a neat trick... can I borrow that?

Jack needs to get you a 3D Papilio logo for you to take for a spin.

My maze design is slowly progressing - I've got the output side of things working - I get a textured picture from pre-computed results stored in BRAM, and am slowly getting the calculation side of things working. I'm thinking of adding a second wall texture so you can play find-the butterfly graffiti, giving it a reason to exist...

Share this post


Link to post
Share on other sites

Are you double-timing the BRAM to make it look like it is async RAM (the data is valid before the next rising edge, so you can set the address and read the data in the same cycle)? That is a neat trick... can I borrow that?

Be my guest ! To be accurate : I do not double the ram clock, I used an inverted clock, so the rising edge of the ram clock happens right in the middle of the logic clock cycle (at the falling edge)

I actually thought that this was the right way to use BRAMs, but I must have got it wrong : I obviously need to dig into the Xilinx manuals...

EDIT : checking on the manuals, you actually need to set the address (and data for write) some time before clk and keep it stable for some time afterwards. And the BRAM has a built-in clock inverter. So this is a right way to use BRAM, if not *the* right way for single clock accesses.

BUT, against all general rules about clocks, you should use a *gated clock* ("not clk") instead of a global inverted clock, as it seems the synthesis tools use the built-in inverter, and that makes the clock skew much better, and the clock uncertainty a little better : reverting to gated clocks changed the worst slack from -2ns to -0.2ns... (which is still wrong, but not that bad)

Share this post


Link to post
Share on other sites

Who would be interested in trying to get a doom engine running on a papilio plus ? some info on the rendering process at http://fabiensanglard.net/doomIphone/doomClassicRenderer.php

Walls should be "easy", but floors and ceilings look way more complicated (I'm concerned about the flood-fill algorithm used in the original code).

Plenty of challenges there :

- simultaneous access to ram for data and texture access, back buffer write and screen display -- Atari ST style multiplexing might be the key, maybe a wishbone-like arbiter ;

- complex data structures handling in VHDL (with pipelining if we aim at a 60 Hz refresh) ;

- 16.16 fixed integer computations -- that does not fit in the 18 bit of the DSP block...

A (not so) quick check lets me belive that there is enough space in the sram chip for :

- 2 vga screen buffers (128kB - 64kB each) if we keep the HRES a bit low 256*240

- the level structures (~50kB for the first level) -- maybe we can put some of it in block ram

- the textures (~100kB for the same level)

- some space for intermediate structures (floors and ceilings, 65kB in the original code)

I'm more concerned about the size of precomputed tables -- can they fit in the ram blocks ?

Right now, I'm looking into a Java implementation, to see how all this actually fits together.

Share this post


Link to post
Share on other sites

Hi all,

I started pondering over a 3d engine for the papilio, using VGA output of a TV wing, rendering "on the fly" rather than in a bitmap buffer (nintendo ds style, for those who know about it), to avoid the need for external memory.

.................................

The current frequency of how many?

papilio_clk = ? MHZ

gpu_clk = ? MHZ

vga_clk = ? MHZ

Thank you:

Why not rabbit?

post-36405-0-89546900-1354368135.gif

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now