Real Hardware Does Not Match Simulation


jagboy

Recommended Posts

I've got a very simple design running on a PapilioOne 500K, implementing multiple quadrature encoder interfaces and PWMs.  It all works flawlessly in simulation, but in the actual hardware it is flaky.  For a given build, the flakiness will manifest in a very consistent manner.  I can make some inane change to the logic, and the flakiness will move to a completely different part of the circuit.  I've examined the generated gate-level logic, and it is exactly what I would expect (I've been doing chip design for 25+ years).  At first, I thought I just got a bad FPGA, so I got a replacement board, but I get the exact same results.  I can do synthesis at 2X the target clock rate, and still have LOTS of margin on all paths.  

 

As an example, I've had two quadrature encoders and two PWMs operating for over a week now without a single problem.  Yesterday, I added two more PWMs, and now just one of the quadrature encoder interfaces is working - it's completely ignoring the encoder inputs - or, likely more correctly, I can no longer read the encoder counter.  But, that same interface has worked perfectly for the last week and I have made NO changes to the RTL, except adding the two additional PWMs.  All the PWMs work fine, as does the other identical encoder interface.

 

I'm quite confident the RTL is correct, and it ALWAYS simulates correctly, so about all I can figure is I am doing something wrong in the configuration of the tools, and it's perhaps not building for the right chip or something like that.  I have it configured for a xc3s500e-4vq100, and 32MHz clock at LOC P89.  The clock is defined in the UCF file as:

 

NET "CLK" TNM_NET = CLK;
TIMESPEC TS_CLKSPEC = PERIOD "CLK" 31.25ns HIGH 15.125 ns;

 

That is correct, right?  Is there anything else I could be doing wrong that could explain this odd behavior?

 

I did have one helluva time getting the Xilinx tools to install properly, as I'm running Win 8.1, but except for this problem, they now seem to be working fine.

 

Regards,

Ray L.

Link to comment
Share on other sites

No motors or any other noise sources right now - just the FPGA, an Arduino, and the encoders.  Noise would be more random.  This is a very consistent, hard failure.  The Encoder 0, 2, and 3 logic works perfectly, only Encoder 1 is funky.  The logic is identical - a single module instantiated 4 times.  The decoding is also identical, except foe two address bits.

 

Regards,

Ray L.

Link to comment
Share on other sites

Here's a brief list of what I can think of... ask if you want any expanded on!

 

- Check all your build warnings. Is there anything unexplained? Especially messages about sensitivity lists, inferred latches. I've spend hours tracing things down only to discover that the warned told me what is going on.

 

- Do all your "human scale" signals get registered at least once before they are acted on? (without this different LUTs see the signal at different points in time)

 

- Do all fast signals get registered twice before they are acted on? (prevents metastability errors - the one in a million error when flip-flops don't settle down fast enough if the setup/hold times are violated)

 

- Have you defined your clock pin correctly? Can your design blink a LED?

 

- Actually pin mappings in general - look at the pin-out report and see what is and isn't located, and all the desired signals are on the desired pins.

 

- Async resets don't work very well, unless you precondition the reset signal. If you don't to this different parts can be released from reset during different clock cycles. Async assert, Sync release)

 

- Have you used the full/empty signals of a FIFO in the wrong clock domain? ("Full"s are on the input side, "Empty"s are on the read side).

 

And then the head-scratching really starts!

Link to comment
Share on other sites

Can you send me the synthesis report ? Can I also have access to your design ?

 

Remember: simulation almost never "simulates" real world. Are you modeling the bouncing properly for the encoders ?

 

Alvie

 

 

Synthesis report is attached.  I don't see anything unexpected in there, certainly nothing that would explain the flaky behavior.

 

I'll see if I can zip up the whole design.

 

Regards,

Ray L.

synthesis report.txt

Link to comment
Share on other sites

Here's a brief list of what I can think of... ask if you want any expanded on!

 

- Check all your build warnings. Is there anything unexplained? Especially messages about sensitivity lists, inferred latches. I've spend hours tracing things down only to discover that the warned told me what is going on.

No warnings that could possibly explain what I'm seeing.  No sensitivity list problems, NO latches.

 

- Do all your "human scale" signals get registered at least once before they are acted on? (without this different LUTs see the signal at different points in time)

Yes, the "important" inputs are synchronized and de-bounced - CS, Wr, Rd, encoder inputs.

 

- Do all fast signals get registered twice before they are acted on? (prevents metastability errors - the one in a million error when flip-flops don't settle down fast enough if the setup/hold times are violated)

Yes, typically through a chain of 3 flops.

 

- Have you defined your clock pin correctly? Can your design blink a LED?

It can do FAR more than that.  95% of the logic works flawlessly.  The remaining 5% is in completely different locations in different builds.  In the current build, there are four identical encoder interfaces, three of which work perfectly, the fourth does not work at all.  In previous builds the faulty logic was in completely different, and unrelated areas.  The encoder interface logic has not been changed in any way.  The encoder interfaces and PWMs have been running a complete servo system for the last week with zero problems.  It's only adter I made a minor change to the PWM logic that the one encoder interface went wonky.

 

When I first designed it, the faulty logic was in the bus interface.  I had a VERY simple state machine that supported word reads/writes through a byte-wise bus interface.  It generated signals that kept track of whether the current access was an LSB or MSB access.  We're talking truly trivial logic here - I've personally designed mult-million gate ASICs, so this is not rocket science.  But, no matter what I did in the logic, the MSB signal ended up getting set prematurely.  I examined the gate-level logic after synthesis, and saw EXACTLY what I expected to see.  Nonetheless, it did not work when loaded into the chip.

 

I finally, out of frustration, expanded the bus interface so I passed the LSB/MSB signals in explicitly from the CPU on dedicated lines - it was the only way I could make the bloody thing work at all.  But, it has worked perfectly for the last week or more.  Yesterday I made a trivial change to the PWM logic, and now the one encoder interface no longer works.  I believe the problem is almost certainly that I cannot reliably read the registers for that one interface, rather than that the interface itself does not work.  But, I haven't yet proven that theory.  In any case, the encoder interface logic is identical to the other three (it's a single block instantiated four times), and it uses all the exact same interface logic, except for using a different decoding of the four address bits.  There is simply no logical reason for that one block to behave any differently from the other three, but it clearly does.

 

- Actually pin mappings in general - look at the pin-out report and see what is and isn't located, and all the desired signals are on the desired pins.

Everything is controlled by a common bus interface.  If there were a problem in the pins, it would affect ALL of the interfaces, not just one.  I obviously can't verify that the A/B inputs to that one encoder interface are correct, but even if they were not, I would still be able to read and write the registers, but I can't.

 

- Async resets don't work very well, unless you precondition the reset signal. If you don't to this different parts can be released from reset during different clock cycles. Async assert, Sync release)

I don't currently do a global synchronization on the reset, though you're right that I should.  However, the whole chip powers up in a disabled state, and does not actually do anything until it's programmed by the CPU.  So, what I see can't be explained by a reset issue.  The behavior is absolutely consistent for a given build, though different for different builds.  The behavior is as if there is a small block of defective logic in the chip that gets mapped to different parts of the logic on each build.  But, I don't believe that's really the problem, since it behaves the same on two different boards.  That's why I suspect either a synthesis or mapping error.

 

- Have you used the full/empty signals of a FIFO in the wrong clock domain? ("Full"s are on the input side, "Empty"s are on the read side).

There is only one clock domain - 32 MHz.

 

And then the head-scratching really starts!

I've been scratching my head for over a week now....

Link to comment
Share on other sites

This thing is going to be the death of me!  In the process of trying to debug and figure out exactly WHAT was broken, I added some debug logic.  It didn't give me any useful information, so I removed it, and re-built.  Everything is now working properly, even though, in theory, nothing has changed.  That would seem to me to support my theory that there is some bizarre synthesis error happening.  Due to the Monte Carlo methods used in generating the logic and mapping, it is entirely possible for different builds of the same source code to generate different gates and mappings for functionally equivalent logic.  I can't see any other explanation for what's happening to me here.  It doesn't give me a warm and fuzzy feeling about the Xilinx tools....

 

Are others using Win 8.1 with ISE?  I had a really hard time getting a working install, so I have to wonder if there is some Win8 problem with the tools.

 

Regards,

Ray L.

Link to comment
Share on other sites

Synthesis report is attached.  I don't see anything unexpected in there, certainly nothing that would explain the flaky behavior.

 

I'll see if I can zip up the whole design.

 

Regards,

Ray L.

 

I see...
 
Asynchronous Control Signals Information:---------------------------------------------------------------------------------------------------+--------------------------+-------+Control Signal                                             | Buffer(FF name)          | Load  |-----------------------------------------------------------+--------------------------+-------+Glue/DMux/RST_inv(Glue/WrStbGen/Rst_inv1_INV_0:O)          | NONE(Glue/DMux/DMux_0)   | 282   |Glue/RdStbGen/EnaD1_and0000(Glue/RdStbGen/EnaD1_and00001:O)| NONE(Glue/RdStbGen/EnaD1)| 3     |Glue/RdStbGen/EnaD1_and0001(Glue/RdStbGen/EnaD1_and00011:O)| NONE(Glue/RdStbGen/EnaD1)| 3     |Glue/WrStbGen/EnaD1_and0000(Glue/WrStbGen/EnaD1_and00001:O)| NONE(Glue/WrStbGen/EnaD1)| 3     |Glue/WrStbGen/EnaD1_and0001(Glue/WrStbGen/EnaD1_and00011:O)| NONE(Glue/WrStbGen/EnaD1)| 3     |-----------------------------------------------------------+--------------------------+-------+
You want it to look more like this:
Clock Information:-----------------------------------------------------+------------------------+-------+Clock Signal                       | Clock buffer(FF name)  | Load  |-----------------------------------+------------------------+-------+clk                                | BUFGP                  | 16    |-----------------------------------+------------------------+-------+ Asynchronous Control Signals Information:----------------------------------------No asynchronous control signals found in this design

So you need to synchronise 4 more inputs, and maybe the reset too!

Link to comment
Share on other sites

 

I see...
 
Asynchronous Control Signals Information:---------------------------------------------------------------------------------------------------+--------------------------+-------+Control Signal                                             | Buffer(FF name)          | Load  |-----------------------------------------------------------+--------------------------+-------+Glue/DMux/RST_inv(Glue/WrStbGen/Rst_inv1_INV_0:O)          | NONE(Glue/DMux/DMux_0)   | 282   |Glue/RdStbGen/EnaD1_and0000(Glue/RdStbGen/EnaD1_and00001:O)| NONE(Glue/RdStbGen/EnaD1)| 3     |Glue/RdStbGen/EnaD1_and0001(Glue/RdStbGen/EnaD1_and00011:O)| NONE(Glue/RdStbGen/EnaD1)| 3     |Glue/WrStbGen/EnaD1_and0000(Glue/WrStbGen/EnaD1_and00001:O)| NONE(Glue/WrStbGen/EnaD1)| 3     |Glue/WrStbGen/EnaD1_and0001(Glue/WrStbGen/EnaD1_and00011:O)| NONE(Glue/WrStbGen/EnaD1)| 3     |-----------------------------------------------------------+--------------------------+-------+
You want it to look more like this:
Clock Information:-----------------------------------------------------+------------------------+-------+Clock Signal                       | Clock buffer(FF name)  | Load  |-----------------------------------+------------------------+-------+clk                                | BUFGP                  | 16    |-----------------------------------+------------------------+-------+ Asynchronous Control Signals Information:----------------------------------------No asynchronous control signals found in this design

So you need to synchronise 4 more inputs, and maybe the reset too!

 

 

I don't really see a problem there.  The first is the asynchronous reset signal which is only asserted when the logic is essentially quiescent.  After it's released, nothing happens until the CPU comes along and writes some registers.  While I agree reset should (and will) be synchronized, I' can't explain the problems I'm seeing.

 

The other four are the first stages of the synchronizers on the Wr and RD inputs.

 

Regards,

Ray L.

Link to comment
Share on other sites

>> I've examined the generated gate-level logic, and it is exactly what I would expect (I've been doing chip design for 25+ years).

 

Hi,

 

given above, this seems unlikely but I'll ask anyway (it's a typical beginner mistake):

Did you register all your off-chip inputs? Otherwise the same signal can appear differently at two points within the circuit. FPGAs are unforgiving towards this mistake - state machines derail, stars drop from the sky and generally bad things happen.

 

BTW, I've been debugging similar problems with my first UART... part of the FPGA experience :)

Link to comment
Share on other sites

>> I've examined the generated gate-level logic, and it is exactly what I would expect (I've been doing chip design for 25+ years).

 

Hi,

 

given above, this seems unlikely but I'll ask anyway (it's a typical beginner mistake):

Did you register all your off-chip inputs? Otherwise the same signal can appear differently at two points within the circuit. FPGAs are unforgiving towards this mistake - state machines derail, stars drop from the sky and generally bad things happen.

 

BTW, I've been debugging similar problems with my first UART... part of the FPGA experience :)

 

 

Yes, the off-chip signals are synchronized, other than ones, like the data bus, that have enormous setup/hold times.

 

The frustrating thing is it's now all working perfectly, even though I didn't change anything....

 

Regards,

Ray L.

Link to comment
Share on other sites

oops just noticed this had already been checked.

 

I'd get rid of the reset, even though we're lacking a good theory how exactly it would cause failure.

FPGAs have a defined initial state so it's perfectly acceptable, probably even more efficient, to omit it completely (took me some time to get used to that...).

Link to comment
Share on other sites

Synthesis report is attached.  I don't see anything unexpected in there, certainly nothing that would explain the flaky behavior.

 

I'll see if I can zip up the whole design.

 

Regards,

Ray L.

 

Well, I do see many things unexpected:

    Register <LastA> equivalent to <A> has been removed    Register <LastB> equivalent to <B> has been removed

According to the naming, I assume that LastA should not be A.

 

You also seem to be using asynchronous resets.

Asynchronous Control Signals Information:---------------------------------------------------------------------------------------------------+--------------------------+-------+Control Signal                                             | Buffer(FF name)          | Load  |-----------------------------------------------------------+--------------------------+-------+Glue/DMux/RST_inv(Glue/WrStbGen/Rst_inv1_INV_0:O)          | NONE(Glue/DMux/DMux_0)   | 282   |Glue/RdStbGen/EnaD1_and0000(Glue/RdStbGen/EnaD1_and00001:O)| NONE(Glue/RdStbGen/EnaD1)| 3     |Glue/RdStbGen/EnaD1_and0001(Glue/RdStbGen/EnaD1_and00011:O)| NONE(Glue/RdStbGen/EnaD1)| 3     |Glue/WrStbGen/EnaD1_and0000(Glue/WrStbGen/EnaD1_and00001:O)| NONE(Glue/WrStbGen/EnaD1)| 3     |Glue/WrStbGen/EnaD1_and0001(Glue/WrStbGen/EnaD1_and00011:O)| NONE(Glue/WrStbGen/EnaD1)| 3     |-----------------------------------------------------------+--------------------------+-------+

Don't do this - make sure that at least the resets are synchronously de-asserted. Here's an example of how to do that:

signal rstin, rstout: std_logic;process(clock, rstin)begin  if rstin='1' then    rstout<='1';  elsif rising_edge(clock) then    rstout<='0';  end if;end process;
Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.