Guest essele

Memory options?

Recommended Posts

Guest essele

Hi,

I know this probably isn't the best section to post a generic memory question, but it does seem relevant to the platform (given the C/RAM wing and the Spartan 6 projects.)

I have a project that I'm working on where I need to run a multi-layer framebuffer for a 320x240 LCD screen ... I'm using the Papilio One to get the LCD working but will have to look at different options when I come to implement.

Basically I need to build up the video signal by reading (and processing) the four different layers, each of them is 320x240x8 (or at least the bits I'll need to access) ... therefore, with a 60Hz refresh rate, by my calculations I'll need to read 18,432,000 bytes per second just to refresh.

The problem is that the nature of the processing I'm doing means that the memory won't be read sequentially (not even within the layers) so I need good performing memory that can deal with random access.

So my current thinking is to use 8bit wide 10ns SRAM, latency isn't an issue as I'll need to heavily pipeline the processing anyway, so it's bandwidth that's my concern. With 10ns async I would expect to be able to get to 100MHz using the setup & read on consecutive clocks (assuming I understand this correctly) ... in any case 50MHz would actually be ok.

That should give me plenty of time to be able to handle updates etc (with an appropriate multiplexer on the memory.)

So that seems ok, however ultimately I'd like to either use higher resolution screens or add some anti-aliasing to the system ... this will need potentially 4-times the read bandwidth, and I'd still need to be able to make updates.

So what are my options here?  I'd still like to stay with a Spartan 3E, but could potentially use a bigger (208) footprint to get more I/O's.

1. I could use wider memory (16 or 32bit), though my accesses are all 8bit and for the most part are random so I'm confused as to how I could really make use of this.

2. I could use multiple memories and use them in parallel - one for each layer ... simple, but would take up loads of I/O's.

3. Is there any benefit to using DDR/DDR2 ... it looks like you can get better throughput, but non-consecutive reads, and intermingled writes sound like a problem, not to mention that the DDR controllers seem to be pretty complex.

Any help appreciated.

Lee.

Share this post


Link to post
Share on other sites

Hi,

let me see if I understood correctly.

The problem is that the nature of the processing I'm doing means that  the memory won't be read sequentially (not even within the layers) so I  need good performing memory that can deal with random access.

Are your memory accesses predictable in any way ?

So my current thinking is to use 8bit wide 10ns SRAM, latency isn't an  issue as I'll need to heavily pipeline the processing anyway, so it's  bandwidth that's my concern. With 10ns async I would expect to be able  to get to 100MHz using the setup & read on consecutive clocks  (assuming I understand this correctly) ... in any case 50MHz would  actually be ok.

100MHz is tricky, almost impossible to met. 50Mhz is fine though, even with sequential reads. Expect about 3ns offset out (assuming you are driving from a FF) and 3ns offset in. If you include propagation delays, and if you note that this SRAM has only a 2ns hold time, thinks are a bit complex.

1. I could use wider memory (16 or 32bit), though my accesses are all  8bit and for the most part are random so I'm confused as to how I could  really make use of this.

Use a 16-bit memory and use the UB/LB masks. For reads, just use an 8-bit muxer. Note that I think LB/UB will not be connected in Papilio Plus, so to write 8-bits you will need to read the 16-bit word, modify it and then write the whole value.

And stay away from DDR, at least with Spartan3. It's a very complex matter. Spartan6 seems to make things easier, but I have not tried.

Alvie

Share this post


Link to post
Share on other sites
Guest essele

Hi Alvie,

Thanks for the response.

Are the reads predictable?  Not really, some of the layers will be being rotated and so the memory accesses used during the video-out phase will be dependent on the angle ... and this can change each frame.

I'm not sure I entirely follow your timing comments on 100MHz. I think what you're saying is that there is a 3ns delay to get in and out of the fpga. So therefore assuming the address is set at clock 0, it won't actually get set until 3ns after, the data wont actually be available at the second clock etc.

If that's what you mean, then it makes sense.

I guess the answer to this is synchronous memory then?? For example IDT71V3578S133PFG? Still looks a little more complex, but should get up to 100MHz? In this case I could use each half of the data for different layers (not improving performance, but I can't find any 8bit versions of this.)

This device shows a 1.5ns "clock high to data change" time, does this mean I'll have the same problem, i.e. the valid data won't be available for long enough for me to get it??

Anyway, I think my first solution will be async sram at around 50Mhz ... once I've got that working I'll consider other options, the easiest one certainly looks like multiple srams in parallel ... but that does give pin, space & cost issues.

Thanks,

Lee.

Share this post


Link to post
Share on other sites
I'm not sure I entirely follow your timing comments on 100MHz. I think  what you're saying is that there is a 3ns delay to get in and out of the  fpga. So therefore assuming the address is set at clock 0, it won't  actually get set until 3ns after, the data wont actually be available at  the second clock etc.

Let me try explaining this:

Let's say you have an internal FPGA clock named CLK. Let's say your are using some FF to drive the address of the SRAM (meaning the FF outputs are mapped to output pins). I'll

When you clock rises, the FF output changes after time T1. The output on pin will be available after T2 due to output buffer delays. Data will then be available at SRAM after Tpd (propagation delay).

SRAM data will be present at output, and will take Tpd to reach the FPGA pin. Due to input buffer delays, this signal will be available after T3. This signal will then be sampled to FPGA clock either directly or using a latch.

So, let's assume this:

a) SRAM data output for address A is valid 10ns after address changes to A;

B) SRAM data output for address A is still valid within 2ns when address change from A to B (hold time)

Let's say CLK rises at time 0. At this time, we have presented the address on the FF. This address will reach the SRAM at

  T1+T2+Tpd

The SRAM will react, and data will be available at time:

  (T1+T2+Tpd) + 10ns

It will be available inside FPGA at time

  (T1+T2+Tpd) + 10ns + (Tpd + T3)

This is a simple computation, right ? Problem is you want to change the address in the mean time, so to fetch a new value. If you imagine a 100Mhz clock (10ns period) you can see that you must issue the new address before the previous address data was retrieved. Tricky, right ? We have to pipeline these operations. But this is not all:

when you change address, output will only be stable for 2ns.

A timing diagram helps understanding the problem:

sram_timing.png

As you can see, not only you have a tiny 2ns window, you also don't have a clock which can be used to sample data. I used about 3ns-3.5ns delays. These vary a lot from device to device, and from routing to routing. Almost never the same.

Alvie

Share this post


Link to post
Share on other sites

Alvie answered this much better then I could have.  :)

One thing I would also recommend is to stay away from DDR with the Spartan 3E. I made an attempt years ago and I was never able to get it to work correctly. I think it should be no problem to implement SDRAM, but you will need an SDRAM controller core to use it.

The Spartan 6 makes DDR easier if you are using one of the larger chips that include a DDR controller. You have to use fixed pins if I'm not mistaken, and anyway the LX4 and LX9 chips do not include the DDR controller.

Now Artix is a different matter, with Artix they are making it very easy to implement DDR memory on every single size chip, and you won't have fixed pin locations. I'm very much looking forward to the Artix chips. :)

If you want I can send you a C/RAM Wing to experiment with.

Keep us posted,

Jack.

Share this post


Link to post
Share on other sites
Guest essele

Wow ... what a great explaination. Thanks.

So providing you can assure that the data has reached the FPGA by the time the clock ticks then it's fine ... so if we assume (for arguments sake) 3ns total delay each way, then we need to add 6ns to the 10ns cycle ... giving us 16ns, and a theoretical max (for these fake numbers) of 62Mhz.

What about the synchronous option? Is that another alternative? (I assume that's clocked and hence you don't have the data-valid window problem??)

Jack - I'd love to experiment with the C/RAM wing, I'm in the UK ... happy to pay though!  It would be great if I could get everything done with the Papilio!

Lee.

Share this post


Link to post
Share on other sites
So providing you can assure that the data has reached the FPGA by the  time the clock ticks then it's fine ... so if we assume (for arguments  sake) 3ns total delay each way, then we need to add 6ns to the 10ns  cycle ... giving us 16ns, and a theoretical max (for these fake numbers)  of 62Mhz.

Truth is you can actually run this at 100Mhz, by increasing input delay, so that the rising clock edge is about half way on those 2ns. This is however, tricky.  It will also add additional latency, but if you use bursts you should not be affected.

I'm running the SRAM at 50Mhz now, and works pretty well.

I forgot to focus on write: writing those memories at 100Mhz is painful: the WE signal is required to be '0' for at least 8ns, and it has to go back to '1' so that memory write is actually performed. Again, a 2ns  thing. I tried to find a way to do this, I might have a solution, but its pretty much awkward.

Why is this signal so complex to generate ? Because it's a 10ns clock, with 80% duty cycle. The usual clocking options (clk90 [adds 2.5ns], clk180 [adds 5ns], clk270 [adds 7.5ns]) are not within the required time. The only option I see is to use a FF with asynchronous set/reset connected to CLK270 with added delay provided by a few LUT. Tricky, tricky.

What about the synchronous option? Is that another alternative? (I  assume that's clocked and hence you don't have the data-valid window  problem??)

Source-Synchronous designs use another technique: a separate clock, with is not phase aligned with main clock. The phase is set either by feedbacking the output clock into the FPGA, or manually programmed. This is also tricky sometimes (that's why it's so difficult to put DDR to work with S3E). DDR also poses another problem - [most of] those devices require at least a 75MHz clock, so you cannot clock them at lower speeds, in case you don't meet timings.

Here's a paper for Altera that shows things involved: http://www.altera.com/literature/an/an433.pdf

And one from Xilinx: http://www.eng.utah.edu/~cs3710/xilinx-docs/XAPP768c.pdf

These things are so complex sometimes that a few mm PCB trace length error is enough for your design to not work properly. That's also why you see sometimes a few weird traces in DDR connections, so that all signals have the same Tpd (by making all traces the same length, and with same impedance).

Alvie

Share this post


Link to post
Share on other sites

I would like to use the C/RAM wing also with the Papilio One + board.  Let me know the details as to payment so that it can possibly ship with a Papilio One + board when available.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now