5 minute interrupt controller bug chase and fix with Simics

The problem

I was writing and testing the interrupt processing code for a real-time hypervisor on the MPC8641 Multi-core PowerPC SoC.

During testing, I hit a bug: the system would not take any more interrupts after the highest priority interrupt got serviced.

Since I was debugging on the Wind River Simics virtual platform, debugging the problem was pretty straightforward. I’ve had this exact kind of bug on a hardware platform before (ARM9-based) and it had taken me hours to debug.

With Simics, I have full system-level introspection, forward/reverse execution, full memory-space access and an infinite number of breakpoints. Since the platform is virtual, problems like interrupt debugging are unaffected by the passage of time, which also helps, although it wasn’t a problem in this case.

In the solution part of this post, I show how I found the bug and tested a fix. The video is a re-enactment (I wasn’t recording while I was working), but it shows the exact steps I took, in less than 5 minutes.

About the problem context

Classic PowerPCs have a 32-entry exception table. Out of these, one (offset 0x0500) single entry is for all external interrupts.

The MPC8641 has an OpenPIC-compliant peripheral interrupt controller (PIC) that makes it easy to handle a vast amount of interrupts from that single exception table entry. Each interrupt source has a vector identifier, a destination control and a priority field. Automatic nesting control  is provided by the PIC.

Basically, the control flow of an exception with that setup is

  • CPU core is interrupted by PIC because a source is ready
  • Interrupt processing vectors to 0x00000500 or 0xfff00500, depending on MSR[IP].
  • We handle the interrupt:
    • Save enough context to make the system recoverable and permit
    • Read the IACK register of the PIC to acknowledge the interrupt and get its vector
    • Re-enable interrupts by setting MSR[EE]
    • Finish saving context
    • → Jump to interrupt-specfic handler
    • ← Return from handler
    • Write 0 to the PIC’s EOI register to signal the end of processing of the highest-priority interrupt
    • Restore context
    • Return from interrupt

That’s a lot of steps, and a lot can go wrong :).

The solution

Thanks to Simics, I got a pretty good idea, simply because of event logging:

[pic spec-viol] Write to read-only register IACK0 (value written = 0x0).

Wow, thanks 🙂 ! That’s a good starting point. No way a development board would have told me that…

The video shows how I narrowed it down to a write to IACK instead of a read from EOI before context restoration. I used reverse execution and on-line code patching to get the job done.

After the bug was found, I identified the problem in the actual source code:

The EOI setting macro was:

#define EOI_CODE(z,r)                                                  \
    li      z,0;                                                        \
    lis     r,(CCSR_BASE+BOARD_PIC_BASE+BOARD_PIC_IACK_OFFSET)@h;       \
    ori     r,r,(CCSR_BASE+BOARD_PIC_BASE+BOARD_PIC_IACK_OFFSET)@l;     \
    stw     z,0(r)

when it should have been:

#define EOI_CODE(z,r)                                                  \
    li      z,0;                                                        \
    lis     r,(CCSR_BASE+BOARD_PIC_BASE+BOARD_PIC_EOI_OFFSET)@h;       \
    ori     r,r,(CCSR_BASE+BOARD_PIC_BASE+BOARD_PIC_EOI_OFFSET)@l;     \
    stw     z,0(r)

It was simply the case of a cut-and-paste error from the IACK-setting macro, but one that would have been pretty nasty to find using just a JTAG debugger.

2 thoughts on “5 minute interrupt controller bug chase and fix with Simics

  1. That was a really neat way of using Simics! The bug was pretty pernicious, especially since it would look OK in the source of the interrupt handler… you might have been able to setup source-code view for the assembly language, but that really does not give you much in terms of understanding the problem.

  2. It would indeed look OK in the interrupt handler source code since the problem was in the “EOI_CODE” macro. It’s interesting that CDT under Eclipse and Visual Studio both collapse macros by default. I couldn’t even SEE that I had made a mistake until I expanded the macro ! So much for ease of code reviews.

    One thing for sure: using a lot of preprocessor macros helps in making interrupt handling code more generic (no need to recopy all the prologue and epilogue instructions), but sometimes there are pitfalls, especially when espousing a Linux-like interrupt template macro system where about 5 macros are used for every interrupt (prologue, epilogue, template and template conditionals for each template type).

Leave a Reply