• Please review our updated Terms and Rules here

Interesting bugs we have to deal with as system and BIOS developers :)

sergey

Veteran Member
Joined
Jul 15, 2010
Messages
880
Location
Silicon Forest, Oregon, USA
Recently I've got this bug report on the 8088 BIOS. I'll describe the debugging procedure and other interesting findings for everyone's enjoyment.

The bug reporter's observations were that SysChk utility hangs on a system with 8088 BIOS, but it works OK on the same system with GLaBIOS, so it does appear to be BIOS related.
Symptoms seem to be related to the UART/Serial port and the interrupt controller (PIC).
  • I compared the PIC and UART initialization between my implementation, GLaBIOS and the original IBM XT BIOS, and did not find anything related... While the initialization procedures are slightly different, they result in the same configuration. I was trying to find an IBM XT emulator that would allow some degree of logging I/O port accesses to better understand any differences, but I couldn't find anything suitable without investing too much time... DOSbox is not one of these emulators if anything...
  • I decided to determine where in code the SysChk was hanging... Easier said than done. Once it is hanging, DOS DEBUG is not able to interrupt it. So I recalled that many years ago I've used the NMI for debugging similar situation. I implemented an NMI handler that prints the registers and the code around the current IP location. I ran the SysChk, waited until it hung, and generated an NMI using a piece of wire to connect ISA A1 (/IOCHCHK) to B1 (GND) pins... I know, I could have used a small flathead screwdriver instead...
  • The hang was happening in what appears to be an interrupt service routine for one of the IRQs (later I found that it was IRQ3). The code was pretty short, it and it wasn't a big problem to disassemble and understand it. At the same time it wasn't clear why that code would hang with my BIOS, but wouldn't hang with the other BIOS... I spent an hour unsuccessfully trying to understand what's the deal.
  • I had a thought of disassembling SysChk, but it appeared to be a compressed SFX binary, and I didn't want to spend time trying to find decompressor for it.
  • Finally I went through the process of running SysChk under DOS DEBUG, stepping over the "CALL" instructions, trying to narrow down to the place were it was hanging
    • Note on DEBUG commads: p - "proceed" is the command to step over the CALL instructions, vs. t - "trace" - stepping into CALL
  • Fairly quickly I found that the IRQ detection procedure that was causing the hang, didn't run at all on GLaBIOS... It was a variable set earlier that SysChk was checking to determine whether to run IRQ detection
  • So I had to do another series of runs under DOS DEBUG to find where and how this variable was being set
  • It turns out that SysChk calls INT 15h AH=0C0h (get system configuration parameters), and it appears to check for the bit #1 of the feature information byte, which would be set on an MCA system. SysChk when will skip the check on the MCA system (it is either not needed, or doesn't work there?!). Now, my BIOS does implement that function, and returns the correct data (non-MCA system), while GLaBIOS does not implement that function, and returns CF=1 AH=86h, as it should if the function is not implemented. Now, SysChk does not check if the function call was successful. It simply uses the ES:BX value, and assumes that the system configuration parameters structure will be there.
  • It happens so that initial ES:BX value is 0000:0000, so SysChk goes and checks byte at 0000:0005, which is actually a part of INT 1 vector, and it happens so that bit #1 is set there... (BTW, apparently DOS implements its own INT 1 ISR, and perhaps most DOS versions have similar value there?!). And since that bit is set, it presumes that the system is MCA, and skips the IRQ check
  • Now it is a good question, why exactly IRQ check would hang. It appears to be a combination of two bugs:
    1. Hardware bug: on a typical IBM PC/XT as well on Micro 8088, IRQ signals are left floating (mental note to self - put some pull-downs there next time). Perhaps it doesn't cause much trouble with older NMOS 8259, but with the CMOS chipset it seems that PIC reads floating IRQ signals as switching between 0 and 1 all the time. Normally, this shouldn't cause issues as all unused interrupts are masked at the PIC
    2. Software bug: SysChk implements its own ISRs for IRQ3, IRQ4, IRQ5, IRQ7 (presumably all IRQs that COM or LPT ports can use), and unmasks these interrupts. I assume, the idea is that then, it tries to trigger an interrupt using a COM or an LPT port, and checks what is the IRQ level for it. Instead, due to the floating IRQs, it results in an interrupt storm, and possibly in a stack overflow and a hang...
  • The bug's reporter eventually implemented pull-downs on IRQs and that resolved the issue for him...
 
That's really interesting thanks for taking the time to share your process. I learnt it's really hard to debug custom BIOS issues, especially when there is no emulator for your system!

I went a bit made trying to get PS/2 mouse to work in DOS with only 1 PIC and getting the floppy drive to work without DMA. Lots of stepping through DEBUG and Turbo Debugger. With the floppy too, you can't step through transfers you need to let the CPU keep up with the incoming data or the buffer over/under runs... Something I thought would take a few hours went on for weeks!

Learned a lot though and so satisfying to get it working. I really need to clean the code up and get it on my GitHub - esp as I might need to merge some of your fixes into my well-mangled fork!

Cheers for all your efforts, I'd never have build my lovely little 8088 machine without it!
 
I have a software version of this story ...

Everybody knows the trick where you can enter a character on the IBM PC keyboard by hitting ALT and a three digit code on the number pad. It's how you enter high-bit characters that are not directly on the keyboard.

If you do this in IRCjr (my IRC client) you get the expected results; it's basic keyboard handling handled by the BIOS. If you do it under VirtualBox, it appears to lock the keyboard up but you can hit Ctrl-Break to end the program. Strange, but ok, must be a bug in VirtualBox. Works fine in VMWare - you get the character and no lockup.

So I dig deeper and I write some test code. I isolate the problem to a small program compiled with Open Watcom, and it is still just a VirtualBox problem. I'm using Open Watcom's "_bios_keybrd" function which is basically a passthrough to the BIOS in16h handler, so that shouldn't be the problem. But I'm curious and I write my own assembler code to replace the call to their function, and it works better - it doesn't lock up the keyboard and I can see the problem with VirtualBox now.

VirtualBox doesn't support that method of keyboard entry so when you try one of those special characters under VirtualBox it just interprets it one key at a time. For example, instead of interpreting ALT 1 5 3 as "give me ASCII char 153", it gives you the results of ALT 1, ALT 5, and ALT 3. And worse, it's returning zeros for both the scan code of the key and the ASCII code of the result. So clearly VirtualBox is broken here.

So why is my hand coded assembler working but the Open Watcom's BIOS function is not? Here is where it gets fun.
  • BIOS INT 16H is the "check keyboard status" function. If a key is pressed the ZF flag is set to 0 and the scan code/ASCII code combination is put in AX. Every key has a scan code so scan code should never be 0.
  • OpenWatcom wrappers the BIOS. It tests ZF as expected. It also sets a return code in AX, 0 if no key is ready, 1 if a key is ready.
  • The screw-up is in that instead of setting the return code based on the value of the ZF flag, it assumes that the BIOS will set the scan code/ASCII code combination in AX correctly. But AX is 0 because of the VirtualBox bug.
  • The bogus value (0s) is now in the keyboard buffer, and will remain there unable to be cleared until something reads the keyboard. But nothing is going to read the keyboard because you presumably where checking to see if a key was available before reading it, so why would you read a key if you were just told nothing was available?
  • Ctrl-Break doesn't break the loop; you have to reset the virtual machine?
So why did Ctrl-Break work in IRCjr and why did my replacement assembler code work?
  • Ctrl-Break work in IRCjr because it has a dedicated Ctrl-Break handler, and my keyboard polling code checks that.
  • My replacement assembler sets the return value directly based on the ZF flag returned by the BIOS, and doesn't just pass AX through as the return code.
So the bottom line is we have a coding shortcut exposing by a bug in a virtual machine. The coding shortcut is also a bug, especially when dealing with BIOS, as you should always code defensively.

This has taken about 4 hours of my life to isolate and understand. :)
 
Back
Top