For the past two weeks or so Rob & I have been developing the UAV software on the AVR, finally. As I’ve noted in a few previous entries. And it’s been just one giant bug-hunt, for the most part, but we’re making progress. Bluetooth works. We have my very own printf substitute that’s quite tiny… the ADC and other sensors are all working (aside from the magnetometre; we’re working on that). And the buttons now work… reading the datasheet would have been a good idea, to realise that you can’t read input pins via the PORTn registers, only PINn. Whoops. Possibly explains the difficulty we had with hardware flow control on the Bluetooth module… damn.
Anyway, we’re using ImageCraft’s C compiler for AVRs, version 7. It’s not too hideous so far as small-time, homebrew compilers go… it supports a liveable fraction of the C language, has reasonably good error messages, and is pretty literal – which results in really bloated binaries, but ones with predictable behaviour, even when you do use it’s “optimisations”. Unfortunately, it is still a hobbyist compiler… it’s very buggy. I’ve been slowly weeding out the ones that annoy us must and working around them, but it’s been a tedious process. One in particular today was where it simply refused to generate an instruction to read from a volatile I/O port. It’s volatile; that means you read from it when I damn well tell you to. You don’t go being shifty and saying “ah, well, he doesn’t use the result anyway, so we’ll just ignore that”. That would be what you do for non-volatile addresses. Gah!
Anyway, luckily in that case it turns out we didn’t need that particular read. I was trying to get SPI working, you see, and having some really bizarre problems. We’re using the SPI interface in master mode, generating our own clock, and using PC1 for the MMC/SD card’s CS, because I thought the dedicated SPI SS (Slave Select) pin was only used when the AVR was operating in slave mode. Whoops. Anyway, I was doing the usual stuff – spinning waiting for the bit flag to be set indicating the transfer was complete, then returning success or failure depending on the status of the Write Collision status bit. Unfortunately, this was being set. What the? I couldn’t figure out why that would happen… it happened on the first write, which makes no sense… how can a singleton collide? Anyway… I figured there was some weirdness going on there – you’re supposed to read the SPI status register followed by the SPI data register to clear it and whatnot, which I don’t much care to do – so I just started ignoring the flag. That got me slightly further… now it died on the second SPI write. Well, that’s much better… grrrr.
So, now it was spinning waiting for the SPI write to complete. Hmm. I spent a very long time on this… I even went through the schematics and over the board itself, checking that every SPI pin was correctly connected, and that all the software config was correct.
I eventually decided to read through the SPI section of the 2561 manual from start to finish, rather than just the (supposedly) relevant sections which I’d scanned intently a dozen times already. And there I discovered this sneaky little bastard:
“The master may continue to shift the next byte by writing into SPDR, or signal the end of the packet by pulling high the Slave Select, SS line.”
Hang on a second… SS does what now? But, I left SS as an input, ’cause I didn’t know what it was connected to and didn’t want to start driving it… but I didn’t bother with enabling the internal pull-up resister, since I didn’t realise it was doing anything. Which of course means the pin was floating. As it happens it’s attached to a nice long track which runs all the way over to the programming header, so we’ve got a fantastic antenna just casually soaking up whatever. Brilliant.
So, one pathetic change later, SS was an output set always to 0, and SPI works. Duh.
Progress is made, but at what cost… my sanity?
Anyway, a bunch more random issues ensued, and so forth… my life is fascinating, really… anyway, I finally got to the point where an actual proper error was occurring – one that’s handled appropriately, resulting from command failure from the SD card.
Now, this introduces the main problem… our code wasn’t fitting so well. Which is to say, my code wasn’t fitting… at about 4000 lines or so, it’s not trivial. The FAT stuff is disgusting, but it works, and I can’t make the kinds of off-hand optimisations I would for a PC target because optimisation is a different business for the AVR… anyway, the result is that if we try to build the SD stuff as-is, it doesn’t fit. Quite a bit over, in fact – even with the compiler’s “Code Compression” and “Global Optimisation” features enabled, it still was too fat. D’oh.
I think I discussed this previously. Anyway, long story short the workaround I came up with was to trim out the strings where possible… if NDEBUG is defined, our logging functions just spit out the file name and line number, without the message itself. That way you know what code path the error goes through, but you lose all the groovy extras like logging of variable values and whatnot. Plus it can be tedious going through the code manually to find out what a particular log point is.
But then, over the past week or so I’ve been refining my code – fixing up the last few problems and whatnot – and optimising it at the same time; I’d use ints pretty generically, which are about 4 times more expensive on the AVR than chars, so reducing their sizes made a noticeable difference (a few percent of overall size). And even this solution only enabled the code to fit if we used the Code Compression and Global Optimisation features of the compiler. Since we’re only using the trial version, we get 50 compiles each of these. (and they still count compiles that fail, which seems mean)
So that’s not really an ideal solution… it puts way too much pressure on you to scrutinise every single build, because you don’t want to waste those precious 50 compiles. Thus, I spent most of today building without these features, analysing the code by hand or in the simulator to guesstimate variable values, and then only if absolutely necessary building the full version. Not fun.
And then, I got caught off on a tangent trying to figure out a bizarre bug, where error names were being garbled. I have the standard err/strerror(err) suffix on all relevant messages, and while the error code was printed just fine (so I could still figure out what was happening), the strerror() string wasn’t. I figured it was a bug in my strerror(); I’d recently changed from using a manual switch()-based implementation, which is horribly large on the AVR given it’s terrible efficacy with switch statements – to a simple static array and subsequent indexing. I was a little unsure how well that’d go down, though… so I figured it was busted. But the code seemed fine… so I began to dig around in the assembly the compiler generated from it. Unfortunately the AVR instruction set really sucks, and the documentation matches that general level of suckiness, so this is surprisingly difficult. Nonetheless, after a rather long time (half an hour or so) I finally concluded that the assembly did indeed seem correct.
So, I turned to the simulator and some test log messages. I was most surprised to see that the test messages worked. Huh.
I realised fairly quickly why – the difference was that the test messages just printed the result of strerror(), whereas the real messages printed err/strerror()… hmm. So, I dug into the assembly for some random invocation of a logging function, and discovered that – despite the fact that the error type is defined as an unsigned char – the compiler is pushing it onto the stack as an int. Nonetheless, the va_arg macros have no trouble popping off only a single byte when told to retrieve an unsigned char. So when my format-parser pops the next argument off the stack, it’s misaligned and consequently utterly mangled. Damn.
I’m not sure how I’ll overcome that yet, but the point is that while I was doing this, I wanted to see what the address of the strings was in Flash memory, so I could try to figure out why the pointer to them was being mangled. It was while scrolling through the Flash memory listing in the simulator that I noticed an odd thing… the code finished at 0x7213 (which is a word address, so twice that in bytes). That’s less than 2^15. 11% less, in fact. And guess how much space ICC AVR7 said was left when I compiled that binary?.. 11%. Coincidence? I think not…
Long story short, after some googling I found that ICC AVR7 supports 64k of Flash in the standard version, 128k in the advanced, and 256k in the pro. Although we get to try the code compression & optimisation which aren’t in the standard version, we are otherwise limited to the functionality of the standard version.
‘course, I found a neat hack, courtesy of https://www.ethernut.de/en/documents/atmega2561.html. You see, the “standard” compiler defines the func_lit segment as spanning to no further than 0x10000, despite recognising that the 2561 runs up to 0x40000. But if you respecify the func_lit segment manually, you can make it whatever size you want.
So, in the Compiler Options window, under the Target tab, in the Advanced box, there’s the “Other options” field… add “-bfunc_lit:0xcc.0x40000” to it. Voilá, instant 4x boost to our available space. I can’t see any problems with it as yet, although I’ve only programmed the board a few times since doing this.
So we can ditch those stupid optimisation options, and finally have proper debugging all the time. :D
And now I can finally get back to debugging our debugging messages, so that then I can get back to actually debugging real code. I only wish I was kidding.
I made the joke when talking to Rob earlier: “Error: error trying to retrieve errors. You may have too many errors.” It’s an old one, I know. He loved it nonetheless, and suggested we put in similarly sarcastic and patronising error messages… I think I’ll leave Rob to implement that part. :P
Not that I haven’t come up with some creative log messages in my time… there’s those shifty Smurfs, for a start… and of course the perennial favourite comment “// this is a bug”. Ah yes. :)