So, ADD… well, let’s conclude first: it’s working now. Pretty flawlessly, too. And yes, as expected, all my frustrated random changes were fruitless and pointless; the code was more or less correct days ago.
So, let’s rehash today. I slept in as much as I could… I did plan on getting up at 9, but I couldn’t handle the cold at that point, so I settled for turning my heater on and going back to sleep. Thus, I actually got out of bed an hour later, once the ice had melted out of my room, and started on my way. I still had a bit of a headache from yesterday, so I took it easy… I didn’t get into uni until nearly two hours later, which is pretty lax even by my standards. Tony beat me there, but Rob did not – he rocked up nearly two hours later.
Anyway, I set about fighting with the damn board once more. I managed to snag the only real digital storage CRO in the room, which did wonders; I could actually get useful information out of this one, not just frustration. As it turns out, too, the CRO I was using yesterday was broken; it wasn’t just me. Jim Whittington (the lecturer) and one of the tech guys were fiddling with it for a while, before concluding it was bjorked.
So, with the decent CRO I was able to sample a random packet and decode it manually off the CRO. I then piped that packet’s data into a transceiver testbench in ModelSim, to verify that it was correct. It was – the CRC generated in simulation matched that in my real world sample, and the simulated receiver recognised the packet as valid. So the transmitter was working fine. That was progress, at least.
The next step was to figure out where in the receiving half of the design the fault lay… I was pretty sure the higher level stuff was fine, since it’s all pretty trivial and has very little state, unlike the actual receiver itself. So I figured the receiver must be either broken, or getting the wrong data. I poked holes in my once elegant design to provide debug outputs for the CRO, and used these to verify that the receiver was seeing the right data from the MAX23-whatever RS-232 chip we’re using… I had been hoping that silly little chip had been at fault, but it didn’t seem so at this point…
So I then started pulling out other signals from the deep internal machinery, like the transmitter & receiver busy signals. These are active when the respective components are transmitting or receiving a packet. The transmitter looked fine; it seemed to make sense. The receiver, on the other hand, was just whacked – the busy signal was more or less random, compared to the real line state. In fact, after running for a while the receiver would get stuck in an endless busy state… I’m not even sure how that’s possible, even in hindsight.
Anyway, this gave me an epiphany – the receiver was reading garbage because it wasn’t getting the start of packets correct (and possibly the bit duration, too). Ah-hah! So, I then realised that I hadn’t accounted for lost clock transitions in my original implementation… it would just sit around forever waiting for exactly 49 clocks (47 data + 2 for the reference start bit). Thus, if a clock transition was lost, the receiver would sit at the end of the real packet, waiting for those lost few transitions – which would eventually arrive as part of the next packet. Oh dear. I figured this had to be the problem – random noise was losing clock transitions, and the receiver was thus losing sync with reality. This would seemingly explain why the receiver ever so rarely did actually pick up a real packet…
So I implemented a timeout scheme; if a clock isn’t detected within 125% of the reference clock period (i.e. it’s more than 25% late) the receiver aborts and goes back to idle. This still has the problem that the receiver will then see the next transition in the current packet as the start of the next packet, but this can be combated by adding an extra bit of spacing at the end of the packet; this ensures the receiver will always timeout at the end of the real packet, even if it’s not expecting it. Anyway, that’s a side issue.
After implementing this timeout… nothing changed. It didn’t help at all. Now I was perplexed. I had been so sure that was the issue, and so sure I could actually get out of that damn lab before dinner time. No such luck, of course.
So then the only thing I could figure from all that remained was that my receiver code was just really stuffed in some fundamental yet subtle way. I pored over it for ages, as did Tony – running his own testbench simulations. Nothing seemed out of the ordinary. Particularly bizarre was how well everything worked in simulation – adding delays and even dropping clocks & data saw the receiver still recover ideally. Yet on the board, complete and utter failure.
So my next step was to hack out a whole lot of code from the ingress block and just push out, in place of the remote counter value, the value of the clock divider detected by the receiver. Surprisingly, this showed that the receiver was in fact getting the correct clock period. It also showed that I wasn’t in fact transmitting at 6kbps as I’d tried to manually override earlier; I’d overridden the wrong signal. Whoops. Fixing that, I got it transmitting at the 6kbps I wanted, which eased debugging on the CRO.
But still, this just showed that I was getting that one element correct; not which one element was wrong.
At that point, after hours and hours of abject frustration & misery, I finally caved in and sought help from Jim. After hearing the situation explained, he suggested that it was most likely an error in the receiver code. He suggested doing things like decomposing everything back to it’s fundamental blocks, and testing simple things like Manchester encoding some data, then decoding it – without all the extra overhead of the CRC and transceiver machinery. I wasn’t thrilled with that idea; it’s very difficult to extract particular elements of a VHDL design, in HDL Designer. He also suggested, after I queried it, that it could be a timing issue – the receiver has some quite deep logic paths, which could perhaps be running too slowly and causing random failures. Given we’re running at 50MHz, this isn’t inconceivable. But then that’s probably why the Xilinx compiler spends a good portion of it’s design analysing the timing of each path… ‘course, Jim seemed to think there could be still be signals not settling in time for successive clocks… I don’t know how that works, but anyway, it wasn’t the issue this time around.
What really irks me is that I specifically asked Jim if we had to debounce the incoming serial signal, and he said no. Not a dead affirmative no, granted, but he seemed pretty confident that wasn’t an issue. Well…
Anyway, while I was talking with Jim I decided to try a loopback inside the FPGA itself, rather than outside on the serial pins. At first glance this looked like it also was failing – nothing worked. I had of course at this point completely forgotten I’d disabled all the real packet decoding earlier in order to display the receiver’s clock divider value… damn! It was only half an hour or so later, after Jim had left and I was fighting with implementing a system clock divider, that I realised.
I then restored the original ingress code, and then that was it… it worked. Oh, for I would never thirst again for all my tears of joy… it worked!
This was the pivotal moment. It showed that my receiver was perfectly functional even in hardware; it was the signal from the serial chip into the FPGA that was somehow causing problems. This doesn’t leave much room for argument – I’d verified quickly with the CRO that skew was not an issue, which left only bouncing. So, despite Jim’s assertion that no debouncing was necessary, I implemented a debouncer. I hooked back up to the external hardware loopback, and yes yes yes it still worked!
I can’t express how relieved I was at this point. Not happy; just relieved. The end was finally in sight. Sure, we had been supposed to demo the finished system hours before, but Jim has granted an extension until Tuesday next week, given that everyone was having difficulties, but everyone was working really hard and had all got at least something working.
It was then a relatively simple matter of re-enabling all the bits of code I’d disabled for testing, such as the automatic speed scaling. This caused a few glitches; I was allowing a transmit rate of up to 25MHz, despite the serial chip’s official limit of 250kHz… turns out the serial chip really is limited to that speed. So, I adjusted the auto-scaling to limit itself to about 250kHz or so, and that’s it – it’s good to go!
Note: While I say 250kHz, that’s the maximum rate; the actual clock (i.e. data) rate is half that, 125kHz. Obviously in Manchester encoding you potentially have transitions at twice the bit clock rate, since you squeeze in a transition for data inbetween each clock transition. So the real transmission rate is 250kHz, but the data rate is 125kHz.
At first I was debouncing over 100 system clock cycles (2µs), but I then tried just 10 (200ns) with equal success, and then settled on a mere 5 (100ns)… so it seems the “bouncing” is just transistor jitter due to the relatively slow transition of the serial signal. Damn.
I haven’t ascertained how reliable each debounce time is; I’ll add more functionality now that it’s working for counting errors and so forth. Since we don’t transmit faster than 250kHz in any case (as a limitation of the serial modem) the debounce can quite happily be 100 system clocks as I first tried (given that 250kHz corresponds to a period of 200 system clocks at 50MHz). So I’ll probably bump it back up to that later today when I get back to work on it.
So, I’m pleased that it’s all worked out, after all this time. The next step is to get all the relevant code copied into Rob’s part of the project, and get all that working properly.
I also want to add extra things, like some sort of acknowledgement protocol; at the moment it’s theoretically possible to get the system into a state where it pretty much freezes up. I haven’t seen it in practice, but knowing my luck it’d be just the thing to happen during the demonstration.
There’s also the issue of buffering… I haven’t done it at this point since I figure for a high enough transmission rate you don’t need it (i.e. provide you can transmit state changes on the buttons and switches faster than a human can toggle them), but I’m considering doing it anyway just for the technical kudos. Plus, I realised today while perusing the board’s data sheet that it has 1 meg of SRAM slapped on the back… awesome. That should be fast enough to use more or less directly with the transceiver, so maybe – just maybe – I’ll do some really awesome buffering. :D
Anyway, that’s work for later today, and next week. I’m taking the weekend off, one way or another… well, from uni, anyway. I’ve already promised to work for some of it, for better or worse – I do need money to survive, like everyone else, and especially with petrol once again stupidly expensive – so it won’t be entirely cruisy…. but then I like my work anyway, so it’s no real loss.