nilrods 2 Report post Posted April 27, 2015 I had a question for the FPGA experts. I want to build a lookup table to be used by multiple components(here I mean the generic use of the term not any vhdl term). I have a function that I have modelled in excel and seems to give me the correct values to do what I want. Input 32 bit fixed point, output 32 fixed point. Function is basically 1-e^(x3). My thought since the e(exponential) function is probably not easily calculated directly in hardware(at least from what I read) that a lookup table would be more efficient. So based on what I have read I think I can come up with a lookup table. I am assuming since it is 32 bit fixed point that BRAM would be best place to store? Thoughts? But if I want to have say somewhere between 2 and 10 components running the same function in parallel and taking action based on the result at the same time(in the 10's of kHz range) I am struggling to figure out how to share the lookup table across all of them. I believe I could just build the lookup table into each component and then they could all process in parallel, but depending on the size of the lookup table this could waste quite a bit of BRAM, I think. The only other thing I can think would be to process them for each compenent 1 at a time, if the time to do each is able to fit the timing constraints. Any thoughts would be appreciated. Thanks,Chris Share this post Link to post Share on other sites

Jaxartes 7 Report post Posted April 28, 2015 I think you're basically right. The BRAM has two read ports, so you could have two components accessing the lookup table at the same time. Any more, and they'd have to take turns. The logic for taking turns needn't be terribly complicated, but you'd have to make each component capable of waiting (for its turn). Another option, if you're planning on ten identical components using this lookup table, running in parallel at 100kHz, is: Redesign with a single component, running on ten different sets of inputs/outputs/state variables, at 1MHz. Of course, that's not an option if the logic of the components is quite different, or if you need higher speed or more components. (Also, when I tried something like this, the small fast memories needed for all those inputs/outputs/variables took up a lot more space on the FPGA than I'd expected.) And you'd have to make sure you're always working on the right set of inputs/outputs/state variables at any given time. I'll also point out, just using a BRAM, even without sharing it, leads to a couple of inconveniences:Each BRAM read takes a clock cycle: You get your result one cycle after your input went in. So that adds one pipeline stage to your logic.If your lookup table's input is 32 bits, and its output is 32 bits, then the total BRAM needed for the lookup table is 32*2^32 bits, that is 16 gigabytes! (Not 36 as I initially stated, oops.) This FPGA doesn't have nearly enough space.I have only vague thoughts about what you could do about the latter:If you can get away with reduced precision, at least on the input side, then it can fit: A lookup table with 14 bits input and 32 bits output would fit into the BRAM on the Papilio Pro or Duo, just barely. One with 11 bits input and 16 bits output would take up a single 2kB BRAM.On a small scale, smooth functions resemble straight lines. So you might have a small lookup table, and then interpolate between values in the lookup table. Linear interpolation would just require an addition or two and a bit shift. Quadratic interpolation would add multiplication into the mix, meaning you'd have to use one of the DSP blocks for it, but would get a higher-precision result out of a small lookup table.You might also exploit any patterns in your input. You'd know those, I wouldn't. But if each input is related to the previous input, that might also mathematically relate each output to the previous. Example: to synthesize an approximate sine wave without a lookup table, use the relation that sin'(x) = cos(x) and cos'(x) = -sin(x) (and do something about the continual loss of precision). Share this post Link to post Share on other sites

nilrods 2 Report post Posted April 28, 2015 Jaxartes,Thank you for the excellent insight. I was hoping there was some slick trick out there I just didn't know about, being new to fpga's. I was already working on the high level state machine. I was trying to build it as a component that the user could chose how many they want to run, but sounds like that may be adding complexity to the design. I will have to look into seeing if I can possibly build it like you said as 1 component working on multiple input/output/state variables. Nice tip on the BRAM and timing. I will have to remember the 1 clock cycle delay. I apologize for the excluded details on the lookup table requirements. I re-read my post and can see how it was lacking some info. Most of the literature I had read about lookup tables for non-trivial, continuous functions(like e^x) mention the idea of reduction(splitting the input range into fixed number of intervals) and approximation (what you mention as interpolation above) so I assumed that was standard practice, sorry about that. Even though I listed the input as 32 bit fixed, using the reduction/approximation method I was only hoping for somewhere between 2048 and 4096 actual entries in the lookup table, but I can see what your saying about the 11-bit/16bit lining up and using less BRAM. I need to look again over my precision and see if that will work for me. I have already started an approximation function, but still needs some work for accuracy. Again. Thanks for the info/help. Chris Share this post Link to post Share on other sites

Jaxartes 7 Report post Posted April 29, 2015 I'm fairly new to FPGAs myself (it's been 7 months since I got my Papilio Pro), so there may well be some tricks I'm missing. This issue seems like something lots of people will have dealt with. A generic solution might be presented as something quite different, and perhaps not as a matter of FPGA logic: It might be presented as digital logic design (since the same issue would apply on ASICs as well as on FPGAs). You might also try looking at SDRAM controller cores (such as Hamster's): their interface, and the structure of simple cores that use them (if any are simple). The SDRAM, like your lookup function, is a fixed, single resource with (sometimes) multiple uses. So they could provide an example of dealing with complex multiplexing. I'm curious, what is this function for? Share this post Link to post Share on other sites

hamster 46 Report post Posted April 29, 2015 An old trick is you can clock a BRAM with "not clk" and the result of a lookup is ready before the next rising clock edge... However your designs Fmax might halve so it isn't suited for high performance designs. Share this post Link to post Share on other sites

nilrods 2 Report post Posted April 29, 2015 Jaxartes,Thanks for the idea. I may have a look at the sdram controller to see if I can get any ideas from it. Actually the function is used for trajectory planning and integrating. The nice thing about the function is that since it is based on e^x all derivatives are also continuous and smooth also(since derivative of e^x is e^x). This helps for things like smooth accelerations and jerk(1st and 2nd derivatives of velocity). Most existing algorithms have sharp changes as you get to the jerk values(and some even acceleration values) which of course impacts things like vibration and stress on the machine being run. hamster,I like that trick! I will definitely have to remember that one. I will see if I can use it. Thanks,Chris Share this post Link to post Share on other sites

Jaxartes 7 Report post Posted April 30, 2015 Ah, I see. So I would assume that the input to your function, increases smoothly (even linearly) with time? Then you might also consider the following: A simple low pass filter will smooth out a function. Two low pass filters will smooth out the function and its derivative -- and so on. And it should be fairly easy to make (though I haven't tried it on FPGA):Each filter has a state variable. The state variable might need extra precision.Every clock cycle, the state variable is updated: subtract some fraction of its difference from the inputMake that fraction of a power of two, so that "division" becomes easy.You would have to adjust the number of filters, and the exact fraction to subtract, to suit your application. Too many, or too-low frequency, and it becomes sluggish. Too few, or too-high frequency, and you still have sharp changes. But with such filters you could take a roughly approximate function and "smooth it out" to be good enough. Or have the filters do all the work, and use as your function the step function (it's here, then suddenly it's there in zero time). The calculation, at each iteration, is linear, even though the function it ends up producing, over time, is exponential. See also: low-pass filter, Butterworth filter. Share this post Link to post Share on other sites

nilrods 2 Report post Posted April 30, 2015 Yeah the inputs are linear with time. I remember some of those filters from college. I may need to take a look at that again and see if that might help out. Thanks,Chris Share this post Link to post Share on other sites