HOWTO: Improve the performance of the XTIDE controller.

glitch · Apr 15, 2011

Also, as expected, the mod works flawlessly with the IBM PS/2 Model 30! I tried it with mine last weekend. I'm planning on posting a complete guide to modifying the controller and updating the firmware, including how to change the modified Hargle BIOS around with a hex editor to support different I/O base ports and such. The Hargle BIOS will also work with the Leading Edge Model D!

eeguru · Jun 10, 2011

So let's see if I have this straight. Normal IDE operation has up to 8 registers per chip select ordered at continuous 8-bit addresses - eg. Reg0..7. If you perform a 16-bit data read from Reg0, on most systems you would get Reg0 in the low order byte and Reg1 in the high order byte. However on IDE, for the data port registers, you actually get shadow versions instead. So a 16-bit R/W from Reg0 would return (Reg0' << 8 )|Reg0 instead of (Reg1<<8 )|Reg0.

Correct?

So what Chuck did was reorder registers so that there is a gap at the odd address bytes so that the odd address become a physical representation of those shadow registers. An even bus cycle triggers a real chip select assertion to the drive presenting or capturing the high order byte in a latch. While an odd bus cycle loads or reads the latch from the host point of view.

Correct?

And the only reason A3/A0 was swapped was it was less rework with a soldering iron. You could have just as easily shifted each address line up by one so that the original logical register order remains preserved while creating the inter-byte address gaps.

Correct?

So the only real speed improvement is eliminating the additional instruction fetch when changing the code from 2x inport-byte to 1x inport-word?

I'm also assuming that on x86, a 16-bit port read from an 8-bit bus run the low order byte read cycle first then high and a 16-bit port write to an 8-bit bus presents the high order byte cycle first then the low order.

Correct?

Chuck(G) · Jun 10, 2011

You can rearrange the addressing of the integrated IDE ports however you'd like, (e.g. another approach would be to invert A0-A3), but the supplemental "latch" port must follow the data port address. My approach was an "it's easy; any idiot can do it" one. I certainly wouldn't do it that way if I were designing from scratch.
Two easily-located trace cuts; two jumpers--and more important, easily reversible.

The instruction savings is considerably more than a single instruction fetch. Consider the original way of moving data from the IDE internal buffer to the PC memory (and this is but one way to code the loop).

Code:

    mov       dx,[DataPort]
    mov       cx,512/2
    mov       di,[Buffer]
    mov       bl,8
loop2:
    in        al,dx
    xchg      ah,al
    xor       dl,bl
    in        al,dx
    xchg      ah,al
    xor       dl,bl
    stosw    
    loop      loop2

By doing the A0-A3 swap (which was simple and easily reversible), your loop is:

Code:

    mov       dx,[DataPort]
    mov       cx,512/2
    mov       di,[Buffer]
loop2:
    in         ax,dx
    stosw
    loop      loop2

If you've got a V20 or better for a CPU, the loop can become a single INSW instruction.

While you can't do writes this way because the latch needs to be written before the IDE data port because the way the circuit is designed, it's the write to the even address (IDE data port) that triggers the data transfer to the IDE internal buffer. You could devise additional circuitry to fix this, but my feeling was that it wasn't worth the effort because the ratio of reads to writes is probably over 10 to 1. But writes still get faster, as the write loop now looks like:

Code:

    mov       dx,[DataPort]
    mov       cx,512/2
    mov       si,[Buffer]
loop2:
    lodsw
    xchg      ah,al
    inc        dx
    out        dx,al
    dec        dx
    xchg       ah,al
    out        dx,al
    loop       loop2

And a INC/DEC is only a 1-byte instruction, as opposed to a XOR or XCHG (your poison), which are 2 byte instructions.

Note that you can get a small improvement in all loops by doing some unrolling--and I did that when I hacked Hargle's code.

mbbrutman · Jun 10, 2011

eeguru said:
So the only real speed improvement is eliminating the additional instruction fetch when changing the code from 2x inport-byte to 1x inport-word?

I'm also assuming that on x86, a 16-bit port read from an 8-bit bus run the low order byte read cycle first then high and a 16-bit port write to an 8-bit bus presents the high order byte cycle first then the low order.

Correct?

The original code has to do two 8 bit reads spread across two different I/O ports. The address of the register has to be in DX; it can not be an immediate value. So besides the two reads you have to adjust the port address on every read and you need to do some bit shifting to form a 16 bit word to write to memory.

If the IDE data register maps to two consecutive 8 bit I/O registers on the machine, the CPU can be told to do a 16 bit operation starting at the base address. The CPU will automatically generate the bus cycles to read the next eight bits from the next I/O port address. Having the bus unit of the CPU do this takes a few cycles compared to the overhead of the multiple instructions and the instruct prefects, which cause extra bus traffic and probably stall because the CPU does not have enough prefetch buffer.

Mike

Chuck(G) · Jun 10, 2011

Thanks for being a bit more clear--I tend to think that everyone knows these things, having lived in the x86 assembly code world for the last 30 or so years.

One problem with the x86 architecture is the half-hearted improvement in I/O instructions over x80. Everything still has to go through the A register, with 8-bit transfers confined to AL. And the only way to get to I/O port addresses above 0xff is to go through the DX register for addressing.

Why the x86 designers didn't give the I/O-space instructions the same addressing modes and orthogonality of the memory-space instructions is a mystery. You'dve thought that it would be a small matter. Perhaps they were afraid of running out of opcodes.

Another idea was to memory-map the XTIDE I/O ports into, say, the upper 512 bytes of the XTIDE BIOS ROM space (some SCSI adapters do this). You'd still want to keep the consecutive relationship between the IDE data port and the latch port so you could do a read transfer with a "REP MOVSW", but without another redesign, you'd still be dealing with single-byte-per-instruction transfers on writes, but there would be fewer instructions in the loop.

eeguru · Jun 10, 2011

Ok, that answers my main question of how the 8-bit cycles are ordered on a 16-bit write.

Is it just the first register offset in each bank that needs a shadow? I am writing some experimental CPLD code for the JR-IDE and it would be easy to do the following:

- Shift host A1..3 -> IDE A0..2 so the register ordering is preserved (mostly due to anal tendencies!)
- Memory map the registers
- Connect the ready signals

For a specific list of 16-bit register offsets only:

- On reads from the even address, assert CS, route the IDE lower byte to host data, and latch in the upper IDE byte at the end of the cycle
- On reads from the odd address, return the latch output.
- On writes to the even address, latch in the host data
- On writes to the odd address, route the host byte to the upper IDE path, enable the latch output on the lower path, and assert CS.

This JR-IDE design also has up to 864KB of the system RAM organized 16-bit wide on the same buffered data bus shared with the IDE header. Which makes the design friendly to future fast PIO or bus mastering by changing CPLD code (via parallel port cable to ISP header).

mbbrutman · Jun 10, 2011

eeguru said:
This JR-IDE design also has up to 864KB of the system RAM organized 16-bit wide on the same buffered data bus shared with the IDE header. Which makes the design friendly to future fast PIO or bus mastering by changing CPLD code (via parallel port cable to ISP header).

Can you explain the 864KB of memory? This is new to me .. the Jr is wired so that the first 128KB is always on the motherboard; between that and the 64KB of system ROM 864 is too much, or it must not all be in the memory map.

Sector reads/writes are always 256 16 bit operations. If you have what is effectively a local bus on the card between the card on the memory and the IDE adapter, then you can remove the processor loop from the picture entirely. I don't think you have any additional intelligence on the card that could do this, but I'd love to be in a place where I could set a source/destination register and start a block transfer. A microcontroller could drive the transfer between the device and the SRAM, and all the PCjr CPU would have to do is just sit in a busy loop waiting for the micro controller to come back.

(But rats, that doesn't work for memory locations that are not in that local memory. You have to revert to a processor loop to handle that.)

eeguru · Jun 10, 2011

My latest design uses a CPLD rather than a couple SPLDs. And there isn't a full local bus as the address lines are not private. So I would have to request an 8088 processor hold. Which ultimately may not be doable depending on the sensitivity of the DRAM refresh generator. But I wanted to at least lay the board out so that possibility can be explored in the future.

The 864K assumes that if the internal 64KB RAM expansion isn't installed, memory cycles to/from that address range can still be serviced by the demuxed general bus. I know the wait state generator in the video controller will insert more waits if it doesn't detect it, but it should still work. And if it doesn't then that number simply becomes 800K. In the current code, there is a remap register that has 7 bits that allow the option BIOS to enable RAM fill for 7 different regions - 6 in upper RAM plus that one.

The CPLD comes in a PLCC-84 and is under $5 for 128 macrocells which is way overkill - even embedding a bus-master. It drops the entire board to 6 ICs (2x RAM, 1x Flash ROM, CPLD, RTC module, and a '245), 3 dip switches, 3 headers, 4 resistors and 9 caps. It also allows you to map part of the flash into f0000-fffff so you can play around with/update the JR BIOS if you're really adventurous.

On the wacky side I've studied this code, simulator output, and related docs so much I think I could draw out every single bridged junction on the routing matrix in the entire CPLD from the images that keep dancing around in my nightmares!

mbbrutman · Jun 10, 2011

Good news one one part - the PCjr doesn't have a refresh generator. Unlike the PC which uses a DMA channel to refresh the system memory, the PCjr only handles refresh for the first 128KB and it is tied to the video function. For anything above 128KB you are on your own - all of the memory expansion products include a DRAM controller to handle the refresh for that particular set of DRAM.

I need to check the tech ref but I am pretty sure that any address below 128KB is not going to be put on the sidecar expansion bus at all - there was no reason for it, and allowing it would have cost a few extra cents. (These are people who reused an existing oscillator to drive the UART divisor on board, making the BIOS code different from the PC and making it too far out of spec for operation above 4800 bps. Even without that problem the machine would die of a heart attack at those speeds because somebody hitting the keyboard triggers an NMI to decode the serial stream from the keyboard!)

I love the CPLD option - the reduced chip count makes up for the lack of space in a sidecar. I understand perfectly about the nightmares - I've been working on some code for work the past two weeks that has been bending my brain.

Software will be ready when you have a prototype. My existing XT-IDE works so well on the Jr I just think it is natural, and then I keep having to remind myself that other people need to experience the joy ...

Chuck(G) · Jun 10, 2011

Mike, I know that there are add-ons for the Peanut that provide a DMA controller. Are any of those tied to memory expansion?

eeguru · Jun 10, 2011

My point 'was' the DRAM is refreshed through the video controller autonomously. I haven't checked the schematics, but I'm assuming if I request a bus hold, all of the DRAM refresh logic would continue behind the video controller dynamically. If for some reason the refresh logic crosses the general bus or relies on the 8088 for any reason, then I can't hold the bus for very long. Depending on how long that is, it might be a problem.

As far as the 64/128K, it would actually take more parts and cost more to gate off those lower addresses from the Sidecar connector. So if the video controller tri-states it's buffers for 64->128 memory accesses when the option card isn't installed, that RAM can be provided from a Sidecar. Though I suspect this option will never be used as most JRs have 128K.

And I'm not sure what generates waits on the Jr. So that's another can of worms.

HOLD/HLDA are provided on the sidecar. So bus mastering is possible if the disk buffer target is on card memory. If it's not, it may still be possible though I would have to generate all the other control lines that I wasn't planning on routing and don't have pins for. So for low target buffer address, it will probably have to revert to PIO.

For a normal XT-IDE in a PC slot, DMA operation would probably be easier with a CPLD. Since PIO vs DMA to the drive itself isn't really the bottleneck, you could still perform a PIO bus cycle to the drive for every 2 host bus cycles while DMA ack is asserted. Not to mention a XT-IDE board with a CPLD would be a single PLCC-84, a DIL-28W for the ROM, a DIP switch block and a few R/Cs. That's it! It wouldn't even have a single via.

Not sure programming would be a show stopper either as the Atmel-ISP software is a free download and you can hack up a home brew cable from a DB25M shell, pin sleeves, and wire.

pearce_jj · Aug 19, 2011

I'm trying to really follow this mod but my measured results don't agree to the stated expected behaviour (these are all measured in a stock 5155 with Intel 8088@4.77MHz, with an SD-card). On post#14 per provided some excellent detail:

"about 14.2uS are saved on every word transfer...In a 4.77MHz computer, this should bring the performance from 85KB/s to 220KB/s. By halfing the number of loops; by adding 4 more "In ax,dx"+"movsw" instruction pairs (8 per loop), the speed should be able to get up to 234KB/s...The fastest would be to lay out the 256 instruction pairs with no loops at all, where about 247KB/s transfer rate is expected

And Chuck noted later (#43),

"you can't do writes this way because the latch needs to be written before the IDE data port because the way the circuit is designed, it's the write to the even address (IDE data port) that triggers the data transfer to the IDE internal buffer. You could devise additional circuitry to fix this"

So, with the A0/A3 swapped card and revised BIOS (file I have is dated 14-Feb-11) I see 250KB/s read and 173KB/s write measured with my simple file IO tester. But I'm wondering whether this can be correct, since it is implied above that the write speed shouldn't be improved (or is it simply not improved as much?).

Many thanks!

Chuck(G) · Aug 19, 2011

My initial impression is that you're underestimating the overhead incurred by the Pascal I/O routines.

If you want to get a true estimate of speed, use interrupt 13H I/O directly.

pearce_jj · Sep 2, 2011

Thanks, my intention was to measure file system throughput hence avoiding INT13h. But the timing is certainly off; with the mod applied my utility is reporting 250KB/s write and 770KB/s read, but per the stop-watch it's more like 90KB/s write and 230KB/s read, which ties in closely to what was expected. But I don't know why as it's just calculating run-time by checking the system time at each end.

pearce_jj · Sep 5, 2011

Just as an update for the benefit of the search, the reason the benchmark is off (per this thread) is the operation of the XT/IDE BIOS with interrupts cleared during transfers. Hence there is clock skew during heavy IO - so all benchmarks with the board will be off unless a BIOS without this behaviour becomes generally available.

eeguru · Sep 5, 2011

The first board spin of JR-IDE is breathing. I'm able to transfer sectors back and forth with test code. I just need to write an int 13h BIOS for it (next weekend). It has both memory mapped I/O and the Chuck mod applied in both directions. Though it does create some quirks atm such as all register writes must be 16-bit (even writes to the latch, odd writes to the upper bus / lower from latch while asserting cs). That could be adjusted eventually with some PLD rework.

Perhaps I can run your benchmark code on it?

Spin 2 of the board plus a single chip ISA version happens either this weekend or next.

pearce_jj · Sep 5, 2011

JR-IDE being XT-IDE ported to the PcJnr? Is this all in CPLD?

My simple benchmark, however notice it will read high unless the BIOS is fixed to enable interrupts during transfers.

mbbrutman · Sep 5, 2011

The original XT-IDE testing and performance page is here now:

http://www.vintage-computer.com/vcforum/showwiki.php?title=XTIDE+TestResults

Of course the prior benchmark numbers are missing, but we'll find them.

The benchmark that was used at the time is still in the page: http://www.brutman.com/iotest.zip . It tests sequential read and write performance, trying to test the drive interface and not the drive mechanicals. I think the interest in performance testing is great, but I hate seeing the wheel being reinvented.

Mike

mbbrutman · Sep 5, 2011

I found the data in the page - it is just not formatted correctly. As a result the new wiki doesn't display it, but it is there.

For now I made a small edit to expose the raw format of the table. I guess we have some more cleanup to do ... (grrr).

eeguru · Sep 11, 2011

Preliminary tests ~ 230 KB/s direct read and write transfering 63 sectors at a time. Unoptimized but without the overhead of int13h wrapping. That should translate to nearly 400 KB/s on a 8 MHz bus machine.

HOWTO: Improve the performance of the XTIDE controller.

Veteran Member

Veteran Member

25k Member

Associate Cat Herder

25k Member

Veteran Member

Associate Cat Herder

Veteran Member

Associate Cat Herder

25k Member

Veteran Member

Veteran Member

25k Member

Veteran Member

Veteran Member

Veteran Member

Veteran Member

Associate Cat Herder

Associate Cat Herder

Veteran Member