I was writing and testing the interrupt processing code for a real-time hypervisor on the MPC8641 Multi-core PowerPC SoC.
During testing, I hit a bug: the system would not take any more interrupts after the highest priority interrupt got serviced.
Since I was debugging on the Wind River Simics virtual platform, debugging the problem was pretty straightforward. I’ve had this exact kind of bug on a hardware platform before (ARM9-based) and it had taken me hours to debug.
With Simics, I have full system-level introspection, forward/reverse execution, full memory-space access and an infinite number of breakpoints. Since the platform is virtual, problems like interrupt debugging are unaffected by the passage of time, which also helps, although it wasn’t a problem in this case.
In the solution part of this post, I show how I found the bug and tested a fix. The video is a re-enactment (I wasn’t recording while I was working), but it shows the exact steps I took, in less than 5 minutes.
About the problem context
Classic PowerPCs have a 32-entry exception table. Out of these, one (offset 0x0500) single entry is for all external interrupts.
The MPC8641 has an OpenPIC-compliant peripheral interrupt controller (PIC) that makes it easy to handle a vast amount of interrupts from that single exception table entry. Each interrupt source has a vector identifier, a destination control and a priority field. Automatic nesting control is provided by the PIC.
Basically, the control flow of an exception with that setup is
- CPU core is interrupted by PIC because a source is ready
- Interrupt processing vectors to 0x00000500 or 0xfff00500, depending on MSR[IP].
- We handle the interrupt:
- Save enough context to make the system recoverable and permit
- Read the IACK register of the PIC to acknowledge the interrupt and get its vector
- Re-enable interrupts by setting MSR[EE]
- Finish saving context
- → Jump to interrupt-specfic handler
- ← Return from handler
- Write 0 to the PIC’s EOI register to signal the end of processing of the highest-priority interrupt
- Restore context
- Return from interrupt
That’s a lot of steps, and a lot can go wrong :).
Thanks to Simics, I got a pretty good idea, simply because of event logging:
[pic spec-viol] Write to read-only register IACK0 (value written = 0x0).
Wow, thanks 🙂 ! That’s a good starting point. No way a development board would have told me that…
The video shows how I narrowed it down to a write to IACK instead of a read from EOI before context restoration. I used reverse execution and on-line code patching to get the job done.
After the bug was found, I identified the problem in the actual source code:
The EOI setting macro was:
#define EOI_CODE(z,r) \ li z,0; \ lis r,(CCSR_BASE+BOARD_PIC_BASE+BOARD_PIC_IACK_OFFSET)@h; \ ori r,r,(CCSR_BASE+BOARD_PIC_BASE+BOARD_PIC_IACK_OFFSET)@l; \ stw z,0(r)
when it should have been:
#define EOI_CODE(z,r) \ li z,0; \ lis r,(CCSR_BASE+BOARD_PIC_BASE+BOARD_PIC_EOI_OFFSET)@h; \ ori r,r,(CCSR_BASE+BOARD_PIC_BASE+BOARD_PIC_EOI_OFFSET)@l; \ stw z,0(r)
It was simply the case of a cut-and-paste error from the IACK-setting macro, but one that would have been pretty nasty to find using just a JTAG debugger.