SDRAM controller with consistent access time?


Matthew Hagerty

Recommended Posts

I have been spoiled with the simple SRAM interface for sure. :-)  I have also used PSRAM, a.k.a. "Cellular RAM" (SDRAM with a built-in SRAM interface and consistent access time) and it was just as easy as SRAM.  I'm now trying to move a design I'm working on to a devboard that only has SDRAM and I'm finding it a big pain to use.

 

What I *think* I'd like to find is an SDRAM controller that provides a consistent access time, similar to what PSRAM offers.  Does anyone know of such a controller or have any suggestions on how I might modify an existing one?  The access I need is 8-bit and completely random, i.e. burst access of multiple linear bytes in memory are useless.

 

Micron made the SPRAM I used and offers this main feature:

 

"CellularRAM products incorporate a transparent self refresh mechanism. The hidden refresh requires no additional support from the system memory controller and has no significant impact on device read/write performance."

 

It is that second part that is critical "has no significant impact on device read/write performance."  I'm wondering what it would take to get an HDL-based SDRAM controller to provide that same feature?  70ns access, "hidden" refersh.

 

I would like to work on this with somone if anyone is interested.

Link to comment
Share on other sites

To be honest, I find this to be completely not useful at all. It drastically reduces the available bandwidth, and although might make sense for real HW parts (cause they need to be a drop-in replacements), it makes not much sense from a FPGA point of view.

 

If your system does require a 1-cycle delay read and writes, I'd suggest you halting the main clock while the request is being processed. This can be easily done with a BUFE or similar entity.

 

Do you have a real use-case for this, one that cannot be accomplished with other techniques like the described above ?

 

Alvie

Link to comment
Share on other sites

My memory controller is pretty simple, and doesn't activate multiple banks at once.

 

It occurs to me that when moving data within the same bank, if the controller has multiple bank support it might actually be a lot quicker to first move the data to a different bank (which can be done relatively quickly) then move it back into the original bank, rather than activating an precharging the same bank for every set of reads and writes.

 

It would even go as far to think about implementing the frame buffer as a grid of 16x16 cells, each mapping back to one row in the SDRAM.

 

That way drawing a 16x16 bitmap would only need to activate at most four rows in frame buffer, and one for the row that is the source of the 16x16 words of data - maybe 512 cycles + 5 * 12  = 572 cycles.

 

This would be far quicker than the the 32 or maybe even 48 rows that would be activated if you implemented the frame buffer as a sequential memory space, which could be maybe at best 512 + 33 * 12 = 1088 cycles,  With the current controller, performing single word read/writes it would take maybe 512*(12+1) = 6,656 cycles! 

 

So the performance is there - just got to work out how to make the best of it!

Link to comment
Share on other sites

Just had and afterthought. If you arranged the memory in 16x16 tiles, and then 'checkerboarded' the mapping of tiles to banks like this:

 

  00 01 00 01 ...

  10 11 10 11 ...

  00 01 00 01 ...

  10 11 10 11 ...

 

(where each 00, 01, 10, 11 is a 16x16 tile of pixels)

 

AND if you performed you bit-blits based on tiles of at most 16x16, first bringing the source into one of the FPGA's BRAM and then sending writing it back out

 

THEN all four banks could be opened during the entire transfer read/write operation, effectively giving near-single cycle access to a 32x32 area of the screen. 

 

That would allow a 256 pixel copy or move in 512 + 8*2 = 528 cycles (as you can open all the banks at the start), with a command sequence something like:

 

  ACTIVATE source bank0,

  ACTIVATE source bank1, 

  ACTIVATE source bank2, 

  ACTIVATE source bank3, 

  IDLE,

  IDLE,

  READ source bank0 addrX.... 

  READ source bank0 addrX+1

  READ source bank1 addrX+2

  READ source bank1 addrX+3

 

  ....

  PRECHARGE all banks,

 

  ACTIVATE dest bank0, 

  ACTIVATE dest bank1, 

  ACTIVATE dest bank2, 

  ACTIVATE dest bank3, 

  IDLE, 

  IDLE, 

  WRITE dest bank0 addrY.... 

  WRITE dest bank0 addrY+1

  WRITE dest bank1 addrY+2

  WRITE dest bank1 addrY+3

  ....

  PRECHARGE all banks,

 

That is only 3% more cycles than using pure SRAM (and much faster than PSRAM could perform).

 

However, streaming access would be only about 50% of total memory bandwidth due to swapping banks every 16 pixels, unless you hide the open/closing of banks while another was being accessed, in which case nearly 100% of the bandwidth would be available.

.

It would also speed up any line drawing routine, as worst case is one pair of ACTIVATE / PRECHARGE commands every four pixels, compared with the worst case of one ACTIVATE / PRECHARGE every pixel if the direction of the line doesn't match the native flow of memory addresses. Best case is one ACT/PRE every 16 pixels.

 

Humm, so a high performance SDRAM frame buffer with bit-blits and accelerated line draws is possible..... humm.

 

:-)

Link to comment
Share on other sites

To be honest, I find this to be completely not useful at all. It drastically reduces the available bandwidth...

Yes and no. It depends. It reduces bandwidth when using SDRAM that is significantly slower than the rest of the system (all modern computers) and when there is no benefit to being able to take advantage of burst reads/writes. If your memory access is true random byte access, then modern SDRAM is awful. Not every use of RAM is for video or audio streaming, or convenient configurations of large blobs of data to be processed in linear chunks.

SDRAM's only advantage is the cost per bit compared to other memory.

If your system does require a 1-cycle delay read and writes, I'd suggest you halting the main clock while the request is being processed. This can be easily done with a BUFE or similar entity.

It is not always feasible to do this, especially when the rest of the system is designed (and expects) to work within a specified period. Modern computer systems may be designed for this kind of waiting on RAM, but old systems are not.

Do you have a real use-case for this...

Yes I do, otherwise I would not actually be trying to use SDRAM. Like I said, I'm trying to move an SoC design that is *already working* on a devboard that has SRAM and/or PSRAM. However, since SDRAM is cheap and you find it on many devboards, I wanted to see if I could move the design to using SDRAM. However I'm finding this task difficult and using SDRAM is awful unless you have a full-blown memory controller coupled with a CPU and tasks that require large gobs of linear data.

@hamster:

Thanks for the suggestions, however the system is already designed, I'm just re-engineering it in an FPGA.  I don't need any kind of streaming access to the data, i.e. no modern video or audio processing going on here. The total memory use is 48K (3-banks of 16K) and the layout is predefined: 4-bits per pixel (two pixels per byte). The CPU is 1MHz and only needs one memory access every microsecond (two actually, due to a need for a read-before-write situation), and the video circuit needs six bytes every microsecond. For the video access I can take advantage of the SDRAM 16-bit word size, so really I just need five memory accesses every microsecond, which gives 200ns per access. I figured an SDRAM, even with the refresh cycles, activations, precharges, etc. could manage that.

Ideally a controller would allow the host system to issue a refresh so a refresh cycle does not randomly affect a memory access that is expected to finish in a certain amount of time. An ideal memory access for this case would be as follows:

1. CPU access read-before-write, 80ns max

2. Possible write to same memory address if CPU is writing to memory, 80ns max

3. Issue a refresh cycle to the SDRAM, 200ns max

4. Video read 16-bit word (0K to 16K address range) 200ns max

5. Video read 16-bit word (16K to 32K address range) 200ns max

6. Video read 16-bit word (32K to 48K address range) 200ns max

This cycle repeats every microsecond, which would more than cover the minimum SDRAM refresh requirements of about one refresh every 7.8uS. The only guarantee that is needed is the CPU access times, the rest of the cycles have a lot of room to complete.

Link to comment
Share on other sites

Hi Matthew,
 
I haven't double checked it, but that can all be done in well under 0.4us - here's the sequence of SDRAM commands (it relies on the 16k pages being in different banks)
 
1. CPU access read-before-write, 80ns max
   ACTIVATE
   NOP
   READ <addr>
   NOP
   NOP
   NOP (with data)  (70ns)
 
2. Possible write to same memory address if CPU is writing to memory, 80ns max
   NOP (for bus turnaround)
   WRITE <addr> 
   NOP
   PRECHARGE
   NOP 
   NOP (60ns)
 
3. Issue a refresh cycle to the SDRAM, 200ns max
   REFRESH
   NOP
   NOP 
   NOP
   NOP
   NOP (70ns)
 
4. Video read 16-bit word (0K to 16K address range) 200ns max
5. Video read 16-bit word (16K to 32K address range) 200ns max
6. Video read 16-bit word (32K to 48K address range) 200ns max
   ACTIVATE BANK 0 
   ACTIVATE BANK 1
   ACTIVATE BANK 2
   ACTIVATE BANK 3
   READ BANK 0
   READ BANK 1
   READ BANK 2 
   READ BANK 3 (data from bank 0)
   NOP (data from bank 1)
   PRECHARGE ALL BANKS (data from bank 2)

 

   NOP (data from bank 3) (120ns total) 
   NOP
 
If you can't exploit the use of the banks to speed up the last step, then it works out to be 6 cycles per block (240ns total)
   ACTIVATE
   NOP
   READ <addr>
   NOP
   PRECHARGE
   NOP (with data)
   NOP
 
   ACTIVATE
   NOP
   READ <addr>
   NOP
   PRECHARGE
   NOP (with data) 
   NOP
 
   ACTIVATE
   NOP
   READ <addr>
   NOP
   PRECHARGE
   NOP (with data) 
   NOP
 
Link to comment
Share on other sites

Hmm, that seems pretty doable. Now I just have to *do it*. ;-) Having each 16K bank in a separate page would be possible, but it becomes a hassle due to the way the CPU accesses that memory vs the video circuits. In this case it would be more trouble than it is worth and I don't need the efficiency, 240ns is well within the 600ns or so that I have for that part of the design.

Link to comment
Share on other sites

If you have 600ns between accesses you could simply use the DRAM MCB built into the FPGA (assuming you use a BGA LX9 part) and simply assume that after several hundred ns your data is available. I believe with the MCB your worst case delay is about 13 cycles anyway so the data will certainly be available after 500-600ns, same for writing.

 

If the FPGA you have doesn't have a MCB then Hamster has written a SDRAM controller that could be used similarly to give you constant time accesses provided you're willing to go slow enough, then you don't have to worry about delay states.

 

Someone can correct me if I'm wrong but the MCB runs at 200MHz (giving 400MHz DDR access to the DRAM) so even assuming a worst case delay to account for refresh of 13 cycles at 200MHz you can have a 15Mhz constant access to the DRAM of about 70ns like you said you wanted before. You could even make it 100ns to be on the safe side and still be well inside your 600ns requirement. This seems definitely doable.

Link to comment
Share on other sites

Isn't google amazing....

 

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.4.5108&rep=rep1&type=pdf

 

"An important problem in extracting maximum benets from an SDRAM-based architecture is to exploit data locality at the page granularity. Frequent switches between data pages can increase memory latency and have an impact on energy

consumption."
 
And 
 
http://ics.kaist.ac.kr/intjpapers/High-Performance%20and%20Low-Power%20Memory%20Interface%20Architecture.pdf - High-Performance and Low-Power Memory-Interface Architecture for Video Processing Applications
 
Damn,.And I thought that I had an original idea!
Link to comment
Share on other sites

Nice resources! SDRAM can certainly be used to great affect when you need a lot of bulk data, but in situations where you are working with random access 8-bit (or even 16-bit) data, SDRAM is going to be difficult. Knowing your data and access patterns can help improve bandwidth, but you don't always have that information.

Damn,.And I thought that I had an original idea!

I remember reading somewhere that if you are thinking about something or have an idea, there are at least ten other people somewhere having similar thoughts. I don't know if that is true, but it would not surprise me it it were.

Link to comment
Share on other sites

Someone can correct me if I'm wrong but the MCB runs at 200MHz (giving 400MHz DDR access to the DRAM) so even assuming a worst case delay to account for refresh of 13 cycles at 200MHz you can have a 15Mhz constant access to the DRAM of about 70ns like you said you wanted before. You could even make it 100ns to be on the safe side and still be well inside your 600ns requirement. This seems definitely doable.

 

You might be wrong - given the the MCB is a multi-port design, and different ports have different priorities.

 

Looking at a Micron Datasheet it takes 9 cycles to issue an ACTIVATE and READ commands before the data comes back.

 

13 cycles (75ns) sounds right for the long it takes an MCB to access the first byte of a burst at any given address on an idle DDR chip, but it is definately does not sound right when transaciton is already in flight on a different address, and if refreshes are pending, Refreshes alone are 75ns (13 cycles)

Link to comment
Share on other sites

Sorry the unstated assumption was that only one port is used. Like I said, I'll find some time and do some real world tests on a Pipistrello with pseudorandom address spread for read/write and time the longest worst case timing.

 

EDIT: just had a look in the MCB User Guide page 58 table 4-4 gives a few read latency timings (no corresponding write timing table through)

The following is the worst case "read from new row" for a 400MHz memory clock.

Outbound Command Path   12.5Precharge/Activate      12.0Memory CAS Latency (CL)  5.0Inbound Read Datapath    4.5-----------------------------Total Latency in Cycles 34.0 Cycles (85 ns)

I'm unsure how the MCB controller deals with refresh cycles.

 

Link to comment
Share on other sites

TL;DR Simulation experiments on a Spartan 6 MCB with LPDDR memory at 400MHz DDR clock reveal the practical latency times for single non pipelined read and write operations.
The worst case scenario latency (when a clash occurs with a refresh cycle) is 260ns (104 cycles) for read and 220ns (88 cycles) for write.
The next best case (no clash with refresh) is 160ns (64 cycles) for read and 140ns (56 cycles) for write (these times include 20ns penalties for opening new rows).
 
I did some tests in the simulator to observe the behaviour and latency of the MCB under different conditions. I know I said I might break out the Pipistrello hardware but really, you can't beat the simulator for accuracy and ease of use.
If anyone is keen I can make the entire zipped up project available [Project files added below] (very small size), you don't even need any FPGA hardware, simply run it in the ISIM and you can examine all the timings in the sim window and zoom in and out to see all the waveforms.
A word of warning, simulation might run very slow if you have a free Web Pack license (they purposely slow it down for non paying users).
 
Here are the gory details for the various test cases to non consecutive addresses.
 
MCB System clock: p0_cmd_clk = p0_rd_clk = p0_wr_clk = 50Mhz
DRAM clock: pll_ce_0 = 200MHz
DDR clock: sysclk_2x = 400MHz
Auto-refresh cycles occur every 7.72us
 
Write test case
 
lvbHEsd.gif
 
Write with refresh clash, a write command is issued shortly before an auto refresh cycle is about to start (worst case latency scenario). The write command is delayed while the refresh is executed, then the write command is executed.
The falling edge of p0_cmd_empty at time 9.39 indicates a command is in the FIFO but it is too late to execute this command as an auto-refresh cycle is about to start. At 9.435us mcb3_dram_a becomes 0x0400 (meaning A10 is high to signal a refresh command to the DRAM). At 9.61us the rising edge of p0_wr_empty indicates the write FIFO is empty and the data has been written to DRAM.
Transaction addr(0x0badbabe)=0x12345678 at time  9.39us - refresh cycle gets in the way so total latency for this cycle from falling edge of p0_rd_empty to rising edge of p0_rd_empty is 220ns
 
Write without refresh clash - two back to back write transactions are executed
Transaction addr(0x0000dead)=0xfedcba98 at time  9.67us - no refresh cycle gets in the way so total latency for this cycle from falling edge of p0_rd_empty to rising edge of p0_rd_empty is 120ns
Transaction addr(0x05adface)=0x02468ace at time  9.83us - no refresh cycle gets in the way so total latency for this cycle from falling edge of p0_rd_empty to rising edge of p0_rd_empty is 120ns
 
Write followed by forced refresh - three "write then refresh" transactions are executed back to back
Transaction addr(0x0badbabe)=0x12345678 at time  9.99us - Total latency for this cycle from falling edge of p0_rd_empty to rising edge of p0_rd_empty is 120ns
Transaction addr(0x0000dead)=0xfedcba98 at time 10.15us - Total latency for this cycle from falling edge of p0_rd_empty to rising edge of p0_rd_empty is 140ns
Transaction addr(0x05adface)=0x02468ace at time 10.31us - Total latency for this cycle from falling edge of p0_rd_empty to rising edge of p0_rd_empty is 140ns
 
 
Read Test Case
 

no5l0Jl.gif
 
Read with refresh clash, a read command is issued shortly before an auto refresh cycle is about to start (worst case latency scenario). The read command is delayed while the refresh is executed, then the read command is executed.
The falling edge of p0_cmd_empty at time 18.13us indicates a command is in the FIFO but it is too late to execute this command as an auto-refresh cycle is about to start. At 18.18us mcb3_dram_a = 0x0400 (meaning A10 is high to signal a refresh command to the DRAM). At 18.39us p0_rd_empty goes low to indicate data has been read from DRAM and is available in the FIFO. Note that the correct values written in the earlier write tests are being recovered from the various addresses.
Transaction read addr(0x0badbabe) at time 18.13us - correct value 0x12345678 is recovered, refresh cycle gets in the way so total latency for this cycle from falling edge of p0_cmd_empty to falling edge of p0_rd_empty is 260ns
 
Read without refresh clash - two back to back read transactions are executed
Transaction read addr(0x0000dead) at time 18.45us - correct data 0xfedcba98 is recovered, no refresh cycle gets in the way so total latency for this cycle from falling edge of p0_cmd_empty to falling edge of p0_rd_empty is 160ns
Transaction read addr(0x05adface) at time 18.67us - correct data 0x02468ace is recovered, no refresh cycle gets in the way so total latency for this cycle from falling edge of p0_cmd_empty to falling edge of p0_rd_empty is 160ns
 
Read followed by forced refresh - three "read then refresh" transactions are executed back to back. The read executes first then the refresh immediately after.
Transaction read addr(0x0badbabe) at time 18.89us - correct data 0x12345678 is recovered, total latency for this cycle from falling edge of p0_cmd_empty to falling edge of p0_rd_empty is 160ns
Transaction read addr(0x0000dead) at time 19.11us - correct data 0xfedcba98 is recovered, total latency for this cycle from falling edge of p0_cmd_empty to falling edge of p0_rd_empty is 140ns
Transaction read addr(0x05adface) at time 19.33us - correct data 0x02468ace is recovered, total latency for this cycle from falling edge of p0_cmd_empty to falling edge of p0_rd_empty is 140ns
 
To convert the above times into DRAM cycles use 2.5ns cycle time (400MHz clock), so for example 200ns = 80 cycles.
 
You may ask why force a refresh cycle at all? The auto refresh normally happens every 7.72us however by forcing a refresh cycle you cause the MCB to reset its refresh counter and guarantee that another refresh will not happen for at least another 7.7us. Therefore instead of the auto-refresh catching you by surprise (if you run on a clock that is not evenly divisible with the 7.72us period) you control the refresh and ensure it occurs at a time suitable to you. It is also clear from the simulation that time savings can be gained by doing a manual refresh on the back of your read or write transaction. Note that not every single read or write needs a refresh, you only need to insert one refresh every 7.7us on a read or write transaction (if you miss it, the MCB controller will do it for you with the associated time penalty).
 
So in conclusion, as I see it, if you were to write a wrapper for the MCB to simulate a SRAM, you could take a simplistic approach and take the absolute worst case of 260ns and be fairly guaranteed that any random read or write would have completed in that time. This would let you use this "SRAM" with an old school CPU like a Z80 or 6502, etc at up to 3.8MHz clock allowing a read or a write on every single clock cycle. However given the experiments with the forced refresh, it is clear that time savings can be accomplished giving a worst case latency of only 160ns. This translates to 6.25MHz clock assuming the CPU accesses the SRAM on every cycle.
 
Source files added:
Link to comment
Share on other sites

Alex, thanks for the research! Very good information. Hopefully I can get a little better results with a simple controller dedicated to the task. I'm working with an SDRAM, i.e. not DDR and a 100MHz clock. Based on the datasheet the Trc (Active to Active command on the same bank) is 65ns min, so in theory I can get 70ns access per read or write. I'm going to control when the auto-refresh command is issued so it won't get in the way of a read or write.

Link to comment
Share on other sites

As I'm getting close to trying my controller out on the real SDRAM, I noticed that the two controllers I was using as a guide (hamster's being one of them) always do the register transfer on the rising edge of the clock. While this is typically normal in HDL, this is the same clock that the SDRAM is using and the SDRAM *samples* the inputs on the rising edge of the clock. That says to me that the HDL controller should be doing its register transfers on the falling edge of the clock so the signals to the SDRAM are nice and stable during the rising edge.

The other controllers work, but maybe just barely? The setup and hold times for the SDRAM I'm using are 1.5ns and 1.0ns respectively, which is fast, but might not always be fast enough as the FPGA clock speed increases.

I changed my design to work with the falling edge of the clock, and the registers are now nice and stable during the rising edge and the simulation timing diagrams look much more like the datasheet's timing. Am I missing something?

Link to comment
Share on other sites

What is typically done is to use a dcm to introduce a controlled phase shift/delay in the clock used in the output registers before sending it out of the fpga as the sdram clock. This dcm is fine-tuned to match the delays to get the proper setup and hold times.

Link to comment
Share on other sites

I do signal capture on the falling edge of the clock.

 

You might be missing:

 

 1) FPGA Input delays

 2) FPGA output delays (including clock)

 

Check your design after p&r for values of these. For output delays, expect something around 2ns. For input delays, 1.8ns is usual. For hi-z transitions, this is usually higher, around 8ns.

 

And make sure at least output FFs are placed on IOB.

Link to comment
Share on other sites

Thank you everyone for the feedback and information. Once I had most of pieces in place I tested my SDRAM controller in simulation to verify and tweak the timing. I implemented my register-transfer on the falling edge of the clock and it really worked out nicely for the SDRAM at 100MHz. After things looked good in simulation I wired it in to the rest of the system, generated the bit stream, loaded it into the FPGA, and...

It worked the first time!! I was beside myself. I could not believe it. It was a lonely triumph though, no one to high-five, or call on the phone, or have a small celebration with. That seems to be the way it is in the hobby FPGA scene though.

My design lets me read or write a word (16-bits in this case) every 70ns, guaranteed. The host is responsible for issuing refresh cycles though, which also take 70ns, but having control of when those refresh cycles happen allows me to keep them from interfering with the other accesses (which is what I needed). Since the host is running at 1MHz in this case, I just issue a refresh after the CPU has had its access, and just prior to the video circuits. So one refresh every microsecond, which is well below the SDRAM minimum of one refresh every 7.8us (i.e. 8192 refreshes needed every 64ms).

I also now realize that SDRAM is not so bad once you get over the mystique and decouple the interface from the data access patterns. Controlling the I/O and commands to use SDRAM is not so bad, designing an efficient "memory controller" to give maximum bandwidth to SDRAM is the hard part IMO. I think most SDRAM controllers try to smash those two components together which makes them complicated. In my case I did not need maximum bandwidth so I traded it for simplicity.

Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.