• Please review our updated Terms and Rules here

What's wrong with my timer chaining code? (3Com PCI packet driver mystery)

mbbrutman

Associate Cat Herder
Staff member
Joined
May 2, 2003
Messages
6,531
So I have a mystery I’d like to understand. My NetDrive code works great with every packet driver I’ve tested it with except for 3Com PCI packet drivers. Some of those are known to be buggy, but let’s assume that’s not the problem here.

My device driver loads at CONFIG.SYS time and hooks the timer interrupt (IRQ0) for two purposes:
  • Implementing a simple countdown timer so I can detect timeouts.
  • Sending ARP responses
All packet drivers load after the device driver. Not all packet drivers care about the timer interrupt, but the 3Com PCI one does.

With no remote drives connected everything is fine. If you connect a remote drive, the machine crashes a few seconds later. I’m pretty sure that it’s trying to send an ARP response when it crashes. If there is no ARP traffic everything is fine, but if it has to send an ARP response under the timer interrupt the machine will crash.

The failing code looks like this:

Code:
timerInt proc far

  cmp cs:arp_pending, 0
  jz timerInt_chain_only

  <switch to private stack>
  <push some regs>
  <call to a routine that does int 0x60 to send the packet>
  <pop the regs>
  <restore original stack

  timerInt_chain_only:
  jmp cs:[timerOrigInt]

timerInt endp

The pseudo-code in the middle is not special, I just wanted to keep this readable. There is a call to a routine to do the interrupt because the packet driver can be anywhere from interrupt 0x60 to 0x6F and I need to use a jump table to branch to the right INT instruction. All registers that get touched are saved and restored, and that code works for both sending regular UDP packets and ARP packets - just not for ARP packets under the timer interrupt with the 3Com PCI packet driver.

After much experimentation I found a variation of code that works:

Code:
timerInt proc far

  pushf
  call cs:[timerOrigInt]

  cmp cs:arp_pending, 0
  jz timerInt_chain_only

  <switch to private stack>
  <push some regs>
  <call to a routine that does int 0x60 to send the packet>
  <pop the regs>
  <restore original stack

  timerInt_chain_only:
  iret

timerInt endp

The code in the middle is exactly the same. The big difference is in how I do the chaining to the other interrupt handlers. In the failing code I just do what I need to do, then use a far jump to chain to the next interrupt handler. In the working code I push the flags and make a far call; that combination makes it look like an interrupt has occurred when the next handler gets control, and it’s IRET instruction will pop both the return address and flags off of the stack before my code continues.

The code should be equivalent except for when my code runs. In the first variation my code runs first (tries to send a packet), then chains to the other interrupt handlers. In the second variation my code chains first, and then runs.

The only thing I can think of that might matter is that the working code allows the interrupt condition to be cleared on the 8259, and somehow the 3Com code is sensitive to that. Is there something else I’m missing?

(I can't send packets under the received packet interrupt either, even though it works on every other packet driver. I suspect that path has the same problem, but my interrupt chaining technique wouldn't be the cause there.)
 
This is going to be tough to get to the bottom without really knowing what they're doing in that packet driver. Looking at your code for the timer interrupt, can you explain where you clear the arp_pending flag? In the non-working case - if it gets cleared near the end of the routine, after calling the packet driver, try clearing it before calling the packet driver in the off-chance they also use the timer interrupt for something. It isn't clear from you pseudo code where the flag was actually being cleared and this *might* explain why the working version works.

But I'm just guessing here, so take that with a grain of salt. The second case may be a re-entrancy issue in their driver. Maybe putting a flag around your routines to check if you're in the middle of processing packets to see if there is some bizarre feedback loop going on with sending and receiving packets.

I don't know if there is a version of SoftICE available, but I used that back in the day for some difficult debugging situations.

Good Luck!
 
Since the packet driver is loaded after it, its own timer handler would be executed first, so it can't be a reentrancy problem with that. Unless it also does a pushf/call instead of a jump to chain to your handler, maybe doing so on the same stack as it uses while in int 60h? But in that case, the changed code would also crash.

Is there any chance that the 'jmp cs:[timerOrigInt]' instruction was running with interrupts enabled?
 
resman,

The arp_pending_flag gets cleared after I send the packet after I know it's clear and gone. Remember, this path is already under the timer interrupt and in the non-working case the end-of-interrupt processing has not been done yet so there won't be another timer interrupt to deal with.


dreNorteR,

If the packet driver does something like switch to it's own stack on the timer interrupt, then chain to the other handlers that would be a problem - my device driver code would get called, try to send a packet, and that might cause the packet driver to try to switch back to the same stack it's already on. (This assumes they switch stacks and didn't use a unique stack for each interrupt handler or path.) My code is careful to use different stacks for each possibly entry point and to switch back to the original stack before chaining to the other interrupt handlers. I don't know if the 3Com code is that careful.



Maybe it's uninitialized storage in their driver and I'm just getting lucky.

Does anybody know of any good disassemblers for 16 bit DOS code? The modern Ghirdra tool doesn't work with COM files. I can work with either a static disassembler or something that generates instruction traces. (The problem with trying to get an instruction trace is I'll need to do it from a virtual machine and I don't know of any VMs that emulate the 3Com PCI hardware.)

I can also keep trying variations where I save away the incoming stack segment or look for telltale signs of re-entrancy problems. That combined with just scanning interrupt handlers with DEBUG.COM might eventually get me to the answer, but it's slow and tedious.
 
resman,

The arp_pending_flag gets cleared after I send the packet after I know it's clear and gone. Remember, this path is already under the timer interrupt and in the non-working case the end-of-interrupt processing has not been done yet so there won't be another timer interrupt to deal with.
Right, so it's a good indication that the 3com driver is using the timer interrupt for its processing. You say the machine crashes. Is that meaning it just locks up?
dreNorteR,

If the packet driver does something like switch to it's own stack on the timer interrupt, then chain to the other handlers that would be a problem - my device driver code would get called, try to send a packet, and that might cause the packet driver to try to switch back to the same stack it's already on. (This assumes they switch stacks and didn't use a unique stack for each interrupt handler or path.) My code is careful to use different stacks for each possibly entry point and to switch back to the original stack before chaining to the other interrupt handlers. I don't know if the 3Com code is that careful.



Maybe it's uninitialized storage in their driver and I'm just getting lucky.

Does anybody know of any good disassemblers for 16 bit DOS code? The modern Ghirdra tool doesn't work with COM files. I can work with either a static disassembler or something that generates instruction traces. (The problem with trying to get an instruction trace is I'll need to do it from a virtual machine and I don't know of any VMs that emulate the 3Com PCI hardware.)

I can also keep trying variations where I save away the incoming stack segment or look for telltale signs of re-entrancy problems. That combined with just scanning interrupt handlers with DEBUG.COM might eventually get me to the answer, but it's slow and tedious.
Have you ever tried SoftICE? It is very powerful but also a bit of a hassle to set up and use. It's been a few decades since I've used it, but I recall it was useful in these hard-to-diagnose cases without having a real ICE.
 
I've never used SoftICE, but it seems like it's the right tool for the job. I don't think the network hardware will tolerate the change in timings if I start hitting breakpoints, but all I need to do is trace through the driver long enough to see where the re-entrancy problem is.

(Argh, yet another learning curve ...)
 
Does anybody know of any good disassemblers for 16 bit DOS code? The modern Ghirdra tool doesn't work with COM files. I can work with either a static disassembler or something that generates instruction traces. (The problem with trying to get an instruction trace is I'll need to do it from a virtual machine and I don't know of any VMs that emulate the 3Com PCI hardware.)
The free version of IDA Pro 5.0 has always done well by me. Newer ones have sadly dropped 16-bit support, but there's a download link for 5.0 (sanctioned by Hex-Rays) at https://wiki.scummvm.org/index.php/HOWTO-Reverse_Engineering#Resources.
 
The modern Ghirdra tool doesn't work with COM files.
Really? Remember COM is not a file format. It's just a filename extension; the file itself is a raw binary that gets loaded to a fixed memory address.

Ghidra likely lacks the logic to associate a com filename with "this is a x86-16 raw binary loaded at x".
 
Lacking the logic to do something is definitely in the "it didn't work" category, and we (a group of us) couldn't find the method to force the association.

When I go back to it I'll use DEBUG.COM to start, and then try something like SoftICE or IDA Pro if I can't get further.
 
Lacking the logic to do something is definitely in the "it didn't work" category, and we (a group of us) couldn't find the method to force the association.

As a test, I opened ghidra, created a project, and imported a com file. The file was automatically detected as raw binary.

I had to specify x86-16 real mode, as well as set the base address. That's about it.
 
  • Like
Reactions: cjs
My own experience with SoftICE back in the day says that it is a very good tool for this kind of problem if you can figure out how to get failures to occur so you can observe them.

Back in the 90's I was contracted to fix an embedded application that ran in real mode with DOS for file access and somebody's real-time exec and TCP/IP code. It ran in unprotected mode, but all the units actually had a 386 CPU and 4MB RAM. There were many machines in the field (~100?). The spec said the units had to meet a 2500 hour MTBF, but about one unit crashed per day in the field. No way to predict which one would crash next.

I wrote some code to switch to protected mode and return in virtual 8086 mode, and use the 386 debug registers to detect failures and dump memory to a floppy when it failed. All the field units ran this stuff.

The traces allowed me to see that the problem occurred when the DOS file code was interrupted with network stuff, and the DOS stack overflowed. I found a lot of bugs, but the main fix was to switch to a separate stack for all device interrupts to avoid stepping on DOS' stack. I also wrote some Turbo C code that helped analyze the memory dumps, and allowed Turbo debugger to be used to paw through the image over a serial line. I still have most/all of that code if you think it would help you.
 
Last edited:
As a test, I opened ghidra, created a project, and imported a com file. The file was automatically detected as raw binary.

I had to specify x86-16 real mode, as well as set the base address. That's about it.

We we tried it we basically just got a hex dump, but I don't remember setting the entry point address so I'll try again.

Even so, I think that SoftICE is what I need. DEBUG.COM can show me what will run when an interrupt is invoked, but it's too hard to follow the code through the various calls and branches. An instruction trace will make it so much easier.
 
My own experience with SoftICE back in the day says that it is a very good tool for this kind of problem if you can figure out how to get failures to occur so you can observe them.

Back in the 90's I was contracted to fix an embedded application that ran in real mode with DOS for file access and somebody's real-time exec and TCP/IP code. It ran in unprotected mode, but all the units actually had a 386 CPU and 4MB RAM. There were many machines in the field (~100?). The spec said the units had to meet a 2500 hour MTBF, but about one unit crashed per day in the field. No way to predict which one would crash next.

I wrote some code to switch to protected mode and return in virtual 8086 mode, and use the 386 debug registers to detect failures and dump memory to a floppy when it failed. All the field units ran this stuff.

The traces allowed me to see that the problem occurred when the DOS file code was interrupted with network stuff, and the DOS stack overflowed. I found a lot of bugs, but the main fix was to switch to a separate stack for all device interrupts to avoid stepping on DOS' stack. I also wrote some Turbo C code that helped analyze the memory dumps, and allowed Turbo debugger to be used to paw through the image over a serial line. I still have most/all of that code if you think it would help you.

It's amazing what we had to do back then ... your story reminds me of the failure analysis work I used to do on the AS/400 operating system.

I started with three separate stacks for each possible path through my code, but even then I got caught because the PCjr was particularly bad about stack usage and managed to overflow one of the stacks if a key was held down. The three stacks I have now are even more generous now to prevent that from happening again.

I really suspect the problem is in the 3Com driver and that it's doing something stupid like not being re-rentrant safe or using uninitialized storage. No other packet driver that I've used has problems like this and all of the other packet drivers let me send packets under their receive interrupt or under the timer interrupt. The 3Com packet driver is just seriously super picky about when I can send a packet (never under the receive interrupt and only under the timer interrupt in this very specific way) and that has to be a bug, not a conscious design choice.

If I can't make further progress with just DEBUG.COM and my own test code I'll move onto SoftICE. Just getting an instruction trace will be a big deal, as I'll be able to detect what it's touching that is not re-entrant safe. I'll ping when it gets worse. ;-0
 
I am silly unlucky on this problem ...

SoftICE works on my Pentium 133 but the machine doesn't lock up when SoftICE is resident. If I change my config.sys to not load SoftICE then I get the expected lock-up a few seconds after NetDrive connects to a drive and sends an ARP response. While SoftICE might be useful for stepping through the executed code, the change in behavior makes me worried that I might wind up wasting my time.
 
Not unusual. I remember having the same problem when using an Intel ICE-85 on an 8085 system. With the ICE plugged into the CPU socket, everything worked. Otherwise, every once in awhile, hangup city.

I eventually resorted to adding code the the TRAP vector to at least show what was in the registers when triggering the trap line manually.
 
This may be a long shot... but could you use an emulator and do a PCI pass-through? That should allow you to get instruction traces from the emulated CPU, at least.
 
Somebody suggested that about a week ago - run QEMU on a machine and give it exclusive access to the card. The Pentium 133 that I used for SoftICE is probably underpowered for that, but I do have a Celeron 1100 system that might be able to run a modern Linux with QEMU.

I spent a few hours with it in DEBUG.COM, writing down addresses and taking notes. I also wrote some test code. I'm not finding any evidence of the 3Com driver switching stacks, which is a good thing, but I need to do a full search of the bytes for the stack switch code before I can declare that to be true in the entire driver. That would be one major source of non-reentrant code.

At some point though I'm just going to give up and move on because this is just kind of fruitless. That packet driver and card *never* get a DHCP address on the first try; it always times out and then on the retry it gets a response from the server. That doesn't give me warm fuzzies about the quality of their code, along with this specific problem that I've been chasing.
 
So you have a Pentium 133 system that reliably reproduces the lock-up within a few seconds of initiating a certain scenario?

One project I'm going to work on in a week or so is seeing if I can get a logic analyzer set up with a Socket 7 processor probe (E2457A). The only Socket 7 target system I currently have is currently running a Pentium 133. Until I try to get it set up I'm not sure if the processor probe will be compatible with that target system, and if it is, I'm not sure how useful capturing bus cycles will be. I haven't even checked to see if I have the Inverse Assembler software for that processor probe. Without that the bus trace would be too hard to interpret. If I can get it all working, I wonder how useful that would be for helping to figure out what might be going wrong in your scenario.

The logic analyzer might be able to capture around the order of 1M or so cycles. The trick is always to figure out how to set up the trigger conditions to capture a trace with useful information. For example, trigger the start of the capture going forward from an execution point that will lead the the problem before the capture buffer overflows, or trigger the end of the capture at an execution point where it can be determined (if that is possible) that the problem has occurred and hope that the capture buffer going backward contains the start of the problem.
 
I just looked the E2457A up, and that's a monster ...

I'm pretty sure that I can send a funny opcode or do port I/O that can trigger a probe. After that it would probably be within 10s of thousands of clock cycles to capture the offending trace. All I need to do is send an ARP response under the packet interrupt.

(The thing that is killing me about this problem is that I don't think I'm doing anything terribly wrong, and it works on every other card I've run into. So I really think it's just a buggy packet driver, but I'd really like to know that for sure.)

Let me know how it goes with your project. Even if I never go that extreme it is nice to know somebody can; I didn't realize those existed.
 
Back
Top