So, another day, another few hours of my life forever wasted. Today’s ADD bug was particular sinister – the seven-segment display was displaying all the digits over each other, as I noted previously. The trick with this quad-digit display is that it only has one set of segment inputs; you can direct which if any digits are activated for any given input via 4 enable lines. Now, this is a pretty standard thing on all our Xilinx boards used to date. We’ve done this before, twice.
Thus why I was so surprised that I was obviously screwing up the display code. Even weirder was that it was virtually identical to the successful code used for the ADD assignment naught but a month or so ago. Perplexed, I spent half the afternoon trying everything to figure out where I’d messed up. But it just didn’t make sense.
I eventually started to consider that it perhaps wasn’t an error on my part. My first instinct was to try removing the pin assignment for one of the enable lines. Lo and behold, absolutely nothing changed! So I removed all the enable lines… still no change. I then went off on a tangent for half an hour trying to figure out why the retarded Xilinx compiler was retaining stale code (which, I’ll add, it often does). But no, even that seemed not to be the case. I still have no idea for sure how it works; I can only presume that it randomly guessed what pins should be assigned to each enable line, and managed to get all four right – hey, it’s only about a 1 in 4 billion chance.
Anyway, after finding that unassigning the system clock had the satisfactory result of disabling the whole board, as expected, I moved on to alternative lines of inquiry. After some reading I discovered that, for reasons completely incomprehensible to the mortal mind, Digilent – makers of the board we’re using – had decided to invert the enable lines. But of course, how stupid of me, why should I think they would implement a straight-forward, trivial and extremely common circuit in exactly the same way as everyone else? Bah, what silliness that would be.
And all that after spending forever simulating the display driver, too. ModelSim is disturbingly slow… Rob & I took regular breaks while my simulations ran, and his compiles ran.
So, it turns out my display code was flawless right from the very start. Gah!
At least that let me get onto the next problem, which is why 99.9% of packets were being ignored; and of those that did get through many were corrupt… despite the fact we use a 16-bit CRC for a 32-bit payload. Surely in this case my CRC calculation must be going haywire, right?
Well, long story short I don’t know yet, but I’m pretty sure that’s not the case. The first CRO I hooked the board up to, to see what the real serial output was, was working great until the computer I was using fried itself. Literally. That burnt electronics smell ambushed the whole room – and all the student engineers in it started scampering around trying to figure out what had blown up, and most importantly who was to blame (an interesting application of the travelling salesman problem, where the answer is never self). I never thought to check the computer; I was thinking I may have shorted something on the board using the CRO probes. No such luck, luckily. No, probably just a blown power supply. That computer’s not going to be doing much for a while. That did cheer me up, a little.
Anyway, so, next CRO… this one purported itself to be a digital sampling CRO, which I think is some kind of code for even-crappier-than-usual-CRO-with-pixellated-output. Hooray. It didn’t have a LCD display; but it seemed to be digitising it’s inputs and rendering that using the normal CRT. Yikes.
And of course it didn’t work properly. I could get the transmission up on the display (after much cursing and frustration), but couldn’t get it to trigger properly, even after giving it a dedicated trigger output from the FPGA. So I couldn’t see clearly what was going on. There was also the issue of scale – trying to cram 50 bits of data into 10cm of CRO screen is impossible at best. Thus, failure. That wasted a good few hours of my time. But I was on such a roll at that point, it seemed only fitting to keep throwing them out.
So, back to the politically correct method for debugging an FPGA – randomly modifying your code and enduring the 10-minute turn around time for each and every test change. Hooray.
Luckily I guessed pretty well; disabling the CRC validation showed immediately that packets were flowing as expected, but were being received as garbage. Great. It’s unlikely it’s a noise issue, although I’ll have to investigate that tomorrow, but at this point I don’t have many other ideas. In simulation my receiver works flawlessly. I went to great effort to stress test it in simulation specifically so I wouldn’t be in this situation now.
Nonetheless, I started simulating again… trying to test if perhaps my algorithm was sensitive to skew – I’m pretty sure it isn’t, since of course I designed it with skew in mind as the biggest problem – but that showed no problems either. I then verified the algorithm by hand, and still can’t fault it.
I would normally assume it’s a problem in my code, but given the track record on this project thus far, the odds of that are surprisingly slim. Tomorrow morning I get to jump up all jolly and happy that it’s fifteen below and try yet again to make sense of it. Hooray.