You can rearrange the addressing of the integrated IDE ports however you'd like, (e.g. another approach would be to invert A0-A3), but the supplemental "latch" port must follow the data port address. My approach was an "it's easy; any idiot can do it" one. I certainly wouldn't do it that way if I were designing from scratch.
Two easily-located trace cuts; two jumpers--and more important, easily reversible.
The instruction savings is considerably more than a single instruction fetch. Consider the original way of moving data from the IDE internal buffer to the PC memory (and this is but one way to code the loop).
Code:
mov dx,[DataPort]
mov cx,512/2
mov di,[Buffer]
mov bl,8
loop2:
in al,dx
xchg ah,al
xor dl,bl
in al,dx
xchg ah,al
xor dl,bl
stosw
loop loop2
By doing the A0-A3 swap (which was simple and easily reversible), your loop is:
Code:
mov dx,[DataPort]
mov cx,512/2
mov di,[Buffer]
loop2:
in ax,dx
stosw
loop loop2
If you've got a V20 or better for a CPU, the loop can become a single INSW instruction.
While you can't do writes this way because the latch needs to be written
before the IDE data port because the way the circuit is designed, it's the write to the even address (IDE data port) that triggers the data transfer to the IDE internal buffer. You could devise additional circuitry to fix this, but my feeling was that it wasn't worth the effort because the ratio of reads to writes is probably over 10 to 1. But writes still get faster, as the write loop now looks like:
Code:
mov dx,[DataPort]
mov cx,512/2
mov si,[Buffer]
loop2:
lodsw
xchg ah,al
inc dx
out dx,al
dec dx
xchg ah,al
out dx,al
loop loop2
And a INC/DEC is only a 1-byte instruction, as opposed to a XOR or XCHG (your poison), which are 2 byte instructions.
Note that you can get a small improvement in all loops by doing some unrolling--and I did that when I hacked Hargle's code.