ISA maximum sustained transfer rate.

mR_Slug · Feb 26, 2017

I have seen many sources state that the max transfer rate on the ISA bus as something along the lines of "typically 1-2MB/s" But cant really find any explanation as to why.

I found this: (InfoWorld Jan 25, 1993 - Steve Gibson)

https://books.google.co.uk/books?id...nfoWorld Jan 25, 1993 - Steve Gibson:&f=false

That states that the theoretical max transfer rate as being 5.3MB/s for 16-bit ISA @ 8MHz. Summarized:

2 Bytes are transferred over the bus at a time, but it takes about 3 cycles to send them. At 8Mhz:
(2 Bytes x 8MHz)/3 = 5.333Mbps.

But this is a far cry from 1-2MB. From what I understand, say moving data from a SCSI card to a NIC, this would be halved again, because you have SCSI>>CPU>>NIC. I think bus mastering can in theory speed thing up, but i cant find a good source on ISA bus-mastering.

Benchmarks are usually slightly slower than a theoretical maximum sustained transfer rate, but I think I'm missing part of the picture. What are the other variable(s)?

Can anyone point me in the right direction?

Chuck(G) · Feb 26, 2017

The best I've ever gotten from an 8 bit transfer was just shy of 1MB/sec. Double that for 16 bits.

What I've seen for ISA bus timings looks like this:

pearce_jj · Feb 26, 2017

This is the book to answer your questions:

https://www.mindshare.com/Books/Titles/ISA_System_Architecture_(3rd_Edition)

DMA will do the job for 8-bit boards but the controllers never scaled up with bus speed, so quicker to go via CPU with 16-bit 8MHz systems.

mbbrutman · Feb 26, 2017

From the IBM PC XT Technical Reference:

"Normal memory read and write cycles take four 210ns clocks for a cycle time of 840ns/byte.
Microprocessor-generated I/O read and write cycles require five clocks for a cycle time of 1.05us/byte.
DMA transfers require five clocks for a cycle time of 1.05us/byte.

So the bandwidth for memory reads and writes is 1.13 megabytes per second, which is based on an 8 bit bus at 4.77Mhz.

If you used I/O ports or DMA instead there would be a 20% penalty because of the fifth cycle. And of course this assumes that your devices can keep up with these speeds and not insert extra cycles.

From the IBM PC AT Technical Reference:

On an AT only 3 clock cycles are required for a bus transfer. At 6Mhz that means 2 bytes are transferred every 500ns. So that bandwidth is 3.8 megabytes per second. At 8Mhz that works out to around 5 megabytes per second.

There are complications; 8 bit operations to 8 bit devices take 6 clock cycles, not 3. 16 bit operations to 8 bit devices take 12 clock cycles. And the DMA controller operates at 3Mhz so anything it does takes 5 clock cycles.

And of course all of this depends on the code you are running. The instructions you are executing also take up bus cycles. So this favors using the "REP" prefix if available for an instruction because that allows you to keep the CPU from accessing the bus to read more instructions. And DMA refresh gets in the way too.

Chuck(G) · Feb 26, 2017

Don't forget that if you're storing words to odd addresses, that eats cycles also.

I've read over and over that 5.33 MB/sec is the maximum, but I've never run into a peripheral that actually does this. As I mentioned, I can get a bit above 2MB/sec on an AT bus asserting 0WS and using REP INSW instructions, but could never get even close to 5.33 MB/sec. I've never tried bus mastering transfers, but I suppose that's another option.

Does anyone have any real-world examples?

eeguru · Feb 27, 2017

It would be 3 cycles for an independent memory access - 2 for the access and 1 rest. That's in one direction. For example, you could never hit 5.33 MB/s doing rep insw on a true 286 with a direct coupled ISA bus. The result of the inport would still need to be stored back to RAM. The only place where a 5.333 rate would be possible outside of a VLB/cache type setting is during DMA where the I/O and MEM operations could share cycles.

Chuck(G) · Feb 27, 2017

I assume that by "DMA" you don't mean the 8237 type--that's limited by the 8237 and is generally slower in 16-bit mode than programmed I/O. Are you talking about bus mastered DMA?

nc_mike · Feb 27, 2017

Would anyone know the maximum sustained transfer rate with an Inboard/386 installed in a PC/XT? I know that the Inboard takes over boot from the base system BIOS shortly after boot. I've got an Inboard in my PC/XT with a 4MB RAM daughter card (5MB total) with most of it running at Extended memory (I've also upgraded the CPU with a 133-pin compatible 486 running at 40MHz (33MHz effective limited a bit by the standard oscillator).

Mike

mR_Slug · Feb 27, 2017

Thank you all for your responses. I am reading that ISA book at the moment, so i still have a lot to understand. This is going to be a long post. This is as I understand it (so far):

The XT:
4.77MHz is ~210ns, 210ns = ~4.76Mhz

It takes 4 cycles to read or write 1 byte to/from RAM. (840ns)
It takes 5 cycles to read or write 1 byte to/from I/O. (1050ns)
It takes 5 cycles to read or write 1 byte Via DMA. (1050ns)

(1 Bytes x 4.77MHz)/4 = 1.1925 MB/s ~ 1.19MB/s ~ 1.1905 ~ (1 Bytes x 1000/210ns)/4
(1 Bytes x 4.77MHz)/5 = 0.954 MB/s ~ 0.95MB/s ~ 0.952 ~ (1 Bytes x 1000/210ns)/5

1.13 megabytes, i assume you mean 1.19? unless i missed something.

XT PIO transfer:
So if we are going to move data from a card (above port 254) to another card, using PIO (is that the correct term?)

Code:

Loop:    mov     dx, 378h    ;Point at LPT1: data port      (2 cycles*)
   	 in      al, dx      ;Read byte from printer port.  (5 cycles)
   	 mov     dx, 278h    ;Point at LPT2: data port      (2 cycles*)
   	 out     dx, al      ;Write byte in AL to ptr port. (5 cycles)
         jmp	 Loop	     	    	       	      	    (15 cycles, what!*)

>* I am assuming "dx, 378h" is equivalent to reg, reg. 2 cycles according to:
http://zsmith.co/intel_m.html#mov
15 cycles!
http://zsmith.co/intel_j.html#jmp
Code example based on:
https://courses.engr.illinois.edu/ece390/books/artofasm/CH06/CH06-4.html#HEADING4-144

So using this method we have 2+5+2+5+15 = 14 + 15 (seriously!). Lets just ignore the loop instruction. If we just copy-paste the first four lines enough times, we can effectively optimize it out of the equation So:
14 cycles, per byte moved.

So (1 byte x 1000/210ns)/14 = ~0.340MB/s Max transfer rate.

XT DMA:
Ok i am having problems with this one, lots of diagrams, no code, sounds complicated.

AT
8MHZ is 125ns, lets just stick with 8MHz for the time being.

It takes 3 cycles to read or write 2 bytes to/from I/O. (375ns)

(2 Bytes x 8MHz)/3 = 5.33 MB/s = (2 Bytes x 1000/125ns)/3

AT PIO:
So if we are going to move data from a card (above port 254) to another card, using PIO:

Code:

         mov     dx, 378h    ;Point at LPT1: data port      (2 cycles*)
   	 in      ax, dx      ;Read word from printer port.  (3 cycles)
   	 mov     dx, 278h    ;Point at LPT2: data port      (2 cycles*)
   	 out     dx, ax      ;Write word in AL to ptr port. (3 cycles)

>* Again I am assuming dx, 378h is equivalent to reg, reg. 2 cycles according to:
http://zsmith.co/intel_m.html#mov

So using this method we have 2+3+2+3 = 10 cycles, per word moved.
So ((2bytes x 1000)/125ns)/10 = 1.600MB/s Max transfer rate.

a 286/386 at 16MHz (bus is just half speed 8MHz)
Ok now we have basically a double speed CPU, most instructions would be twice as fast, but if we access an 8MHz ISA bus, any instruction should, take the same time in ns, right? So:

Code:

                                                            ISA:               CPU (2x ISA): 
         mov     dx, 378h    ;Point at LPT1: data port      (1 cycles)         (2 cycles)	
   	 in      ax, dx      ;Read word from printer port.  (3 cycles)	       (6 cycles)	
   	 mov     dx, 278h    ;Point at LPT2: data port      (1 cycles)	       (2 cycles)	
   	 out     dx, ax      ;Write word in AL to ptr port. (3 cycles)	       (6 cycles)

I hope that makes sense. The CPU does the mov instruction (2 CPU cycles), within the time one ISA bus cycle has elapsed.

So using this method we have 1+3+1+3 = 8 cycles, per word moved.
So ((2bytes x 1000)/125ns)/8 = 2.000MB/s Max transfer rate.

32MHz CPU:

we have 0.5+3+0.5+3 = 7 cycles, per word moved.
So ((2bytes x 1000)/125ns)/7 = 2.286MB/s Max transfer rate.

64MHz CPU
we have 0.25+3+0.25+3 = 6.5 cycles, per word moved.
So ((2bytes x 1000)/125ns)/6.5 = 2.462MB/s Max transfer rate.

I think I'm on the right track, the figures look right, but that could just be coincidence.

Mike, does the Inboard operate at 16MHz, and the bus still at 4.77MHz, I think this is correct:
4.77Mhz / 16MHz = ~0.3, so the cpu should be able to perform the mov instructions in (previously 2 bus cycles, in 0.6 cycles, so:

0.6+5+0.6+5 = 11.2
So (1 byte x 1000/210ns)/11.2 = ~0.425MB/s Max transfer rate.

Of course, if anyone finds any errors in my calculations, please please let me know.

njroadfan · Feb 27, 2017

Bus mastering on ISA cards wasn't really supported. It was a hackjob at best (multiple bus masters should be avoided), but Adaptec and other SCSI adapter makers figured out how to do it. Using DMA, 2.5MB/sec was observed over a ISA SCSI card (vs. 1MB/sec with PIO IDE). See: http://www.os2museum.com/wp/booting-is-hard/

EISA (and MCA for that matter) fixed all these problems, as the bus explicitly supports multiple bus masters out of the box. Some ISA SCSI cards apparently supported enhanced bus mastering DMA functions in EISA systems as well. Hardly anything outside of the floppy controller and soundcards used the 8237 for DMA, its just too damned slow.

Steve Gibson's article was grossly misguided though. The ISA bottleneck was quite apparent with video cards by 1993 and plenty of VLB cards readily crushed their ISA equivalent in benchmarks. 5400rpm hard drives became much more common in 1993 as well.

Chuck(G) · Feb 27, 2017

mR_Slug,

It's very rare to go I/O port to I/O port. I/O is usually I/O space to memory or vice versa. This is where the 80186/286/V20 I/O variants come handy. You can do a REP INSB/INSW and do a whole bunch of accesses with one instruction.

There are other work-arounds Consider the XTIDE and the "Chuck mod". Since the ATA protocol uses 16-bit transfer, the XTIDE latches one byte and saves it at at different I/O address. If we re-arrange the I/O port mapping, we can use the 8088 BIU to do an operation to 2 I/O ports with one instruction. If you're using a V20, it's even better, because you can issue a single REP INSW to do the whole operation, which should be the upper limit on 8-bit I/O.

All of this implies that you have some sort of buffered I/O, so you don't have to check for data available. If it's a loop-on-data-not-ready, input when ready, all bets are off.

I have a bus-matering NIC from Ansel. It's 10BaseT, so I don't think that it matters much--it uses the AMD LANCE chip.

pearce_jj · Feb 27, 2017

In terms of upper limit - we can go twice as fast with DMA than REP INSW, at least peak rate, since data is driven directly from the IO device to memory (not copied through the CPU). The overhead of configuring the controller does reduce the real gains though, even so we are talking in terms of disk performance with 4.77MHz V20, 400KB/s Max with PIO and over 550KB/s with DMA.

Chuck(G) · Feb 27, 2017

James, yes, 8-bit DMA is faster, but we've got two (at least!) separate discussions going here.

16-bit DMA is slower than 16-bit programmed I/O. There, you're dealing with the issue of what amounts to a 16-bit architecture shoehorned into two 8-bit DMA chips that have their roots in the 8085 era.

Using the OP's programmed I/O example, you can't do programmed 8-bit transfers faster than REP INSW. DMA is a different story. Of course, calling REP INSW over an 8 bit bus is a little bit of a chimera--to the casual observer, it appears to be 16 bit I/O, but is done over an 8 bit bus.

No matter how you cut it, the 8 bit DMA transfer limit does create problems and can be more complex than it would first appear. For example, I have a system here where I run 3 floppy controllers simultaneously (multithreaded) each controller has its own port, IRQ and DMA channel. You can write three 2D floppies at the same time, but not three HD ones, no matter how you program the 8237. At first blush, this would not seem to be a problem, as the HD data rate is 500Kbit/sec., which works out to be 62.5KB/sec., so three controllers would be a very moderate 187.5KB/sec total. But it won't work--you'll get "lost data" errors every time. I'm not sure why this happens, but it isn't a function of the CPU.

reenigne · Feb 27, 2017

<off-topic>

njroadfan said:
Bus mastering on ISA cards wasn't really supported. It was a hackjob at best (multiple bus masters should be avoided), but Adaptec and other SCSI adapter makers figured out how to do it.

Interesting - just from looking at what signals are exposed on the ISA bus, I didn't think it was possible at all! Do you have a link to any details about how it was done?

</off-topic>

What follows applies to the 8-bit (PC/XT) variant of the ISA bus rather than the 16-bit (AT) variant.

I haven't played with it much, but my understanding is that the 8237 DMA controller in block transfer mode normally takes 4 cycles per byte rather than 5. This chip also has a "compressed timing" command which reduces most transfers to 2 cycles per byte (3 when there's a change to address bits 8-15). On a 4.77MHz machine, this would put the theoretical transfer limit at 2.38MB/s. There may be practical considerations which reduce that a bit, though.

gslick · Feb 27, 2017

reenigne said:
<off-topic>

Interesting - just from looking at what signals are exposed on the ISA bus, I didn't think it was possible at all! Do you have a link to any details about how it was done?

</off-topic>

ISA bus master transfers as used for example by an Adaptec 1542 SCSI controller are implemented by programming the DMA controller channel in Cascade Mode. Then add-in controller can drive the address lines on the bus instead of the DMA controller.

There should be lots of information on the net with details of how Cascade Mode works with the DMA controller.

Here's one link I found with a very quick search:
https://docs.freebsd.org/doc/2.1.7-RELEASE/usr/share/doc/handbook/handbook248.html

Chuck(G) · Feb 27, 2017

reenigne said:
What follows applies to the 8-bit (PC/XT) variant of the ISA bus rather than the 16-bit (AT) variant.

I haven't played with it much, but my understanding is that the 8237 DMA controller in block transfer mode normally takes 4 cycles per byte rather than 5. This chip also has a "compressed timing" command which reduces most transfers to 2 cycles per byte (3 when there's a change to address bits 8-15). On a 4.77MHz machine, this would put the theoretical transfer limit at 2.38MB/s. There may be practical considerations which reduce that a bit, though.

I have never experienced, nor have been able to design a peripheral using 8237 8-bit DMA that does better than about 1MB/sec. Just doesn't exist. Probably because, as James noted, a DMA transfer involves moving between I/O space and memory.

AlexC · Feb 28, 2017

I don't know if this helps, but I have a 10MHz NEC V20 XT clone with an Orchid EMS card plugged into an 8-bit ISA slot and I've been messing around with memory timings. Best I can get, as reported by QEMM's Manifest, is 1,075KB/sec. The EMS board is set to zero wait state and has 70ns SIMMs installed. There's probably some overhead with the driver, etc.

I don't know if Manifest measures anything useful, but if so this tends to support a real-world limit over 8-bit ISA of around 1MB. For reference, main system board RAM on this machine is clocked at around 1.4MB/sec in Manifest.

Chuck(G) · Feb 28, 2017

<sideways topic>
Instead of using the old, slow 8237/8257 DMAC intended for the 8085 family, I wish Intel would have come out with a real DMA controller such as that integrated into the 80186. No "64K" boundary issues; go from I/O port to I/O port, or memory-to-memory with no problems. Full 20 bit address capability.

The 8089 wasn't it. It was what amounts to a separate processor with its own instruction set and very expensive at that. Applicability past the 8086 is doubtful.

But then, Intel was very slow in getting a range of 16-bit capable peripheral chips for the x86 platform.

As a side note, the 80186 belongs to a different generation and the DMA speed there (for a 10 MHz clock) is quoted at 1.25M (bytes for 80188 or words for 80186) per second.

</sideways topic>

mR_Slug · Feb 28, 2017

With regard to the input and output cards, yes I am looking at it from the perspective of buffered I/O. Specifically, however fast, you read data from the input card it will refill it's buffer with another word. The output card can be written to at any speed also. This eliminates issues with, as Chuck(G) mentioned, with loop-on-data-not-ready, input when ready etc.

I will have to check out the "Chuck mod", and the REP INSW instructions. Not sure i understand what "compressed timing" command is.

reenigne, the book linked to by pearce_jj has a section on bus mastering, also available here:
https://archive.org/details/ISA_System_Architecture

DMA (as performed by the 8237, NOT bus-mastering DMA)
AT:
@8Mhz, the DMA controller operates at 4MHz. the clock-cycle time is 250ns (1000/4Mhz). "All DMA data-transfer bus cycles are 5 clock cycles...or 1.25 microseconds" -AT tech ref. i.e. 1.25us = 1250ns, 1250ns/5=250ns

One ISA bus cycle at 8MHz is: 125ns. So in terms of ISA-bus-cycles@8MHz, ONE DMA cycle takes the same time as 2 ISA bus cycles. 5 DMA cycles is 1250ns, which is 10 ISA bus cycles Right?

I cant find any assembly, for programming the 8237, but AFAIK, counting the instructions, is irrelevant anyway, as it is all setup. The 8237 does the transfer, we know this takes 5 DMA cycles, or 10 ISA bus cycles, per word moved.

From the ISA System architecture book, there are 4 modes; Single Transfer Mode, Block Transfer Mode, Demand Transfer Mode and Cascade Mode. Block Transfer Mode, if I understand correctly, is the fastest (theoretically). Lets say the DMA controller is setup to initiate a block transfer, and it never stops. This will block memory refresh, so this wont actually work on a 286. However I'm trying to keep this as simple as possible, it's sufficient to find the upper limit:

So ((2bytes x 1000)/125ns)/10 = 1.600MB/s Max transfer rate with DMA on an 8MHz AT, with no RAM refresh.

XT:
It takes 5 cycles to read or write 1 byte Via DMA. (1050ns)

Using the same setup as for the AT, (1 byte x 1000/210ns)/5 = ~0.952MB/s Max transfer rate.

<side note>
AlexC gets 1,075KB/sec on 10MHz XT system (10MHz bus?)
(1 byte x 10MHz)/5 = ~2.00MB/s Max transfer rate. So it looks like my DMA calculations are either way off, the bus is slower, or some other factor?
</side note>

PIO mode I/O port to memory:
XT:
I cant understand the memory timing of "9+EA" for the 86/88, can anyone explain it?

AT:
(Note, I am not well versed in assembly)

Code:

         mov     dx, 378h    ;Point at LPT1: data port      (2 cycles)
   	 in      ax, dx      ;Read word from printer port.  (3 cycles)
   	 mov     1000h, dx,  ;                              (3 cycles*)

*mem,reg is 3, reg,mem is 5 (so memory to register is slower, didn't know that)
http://zsmith.co/intel_m.html#mov

I had originally added instructions to increase the memory address, however lets just say that 1000h is a memory-mapped peripheral/card. That is, you write a word and the card transmits it immediately. You can then write to the same address again.

The first instruction is setup, so really the last two are all that's needed, giving 6 cycles, per word moved.

So ((2bytes x 1000)/125ns)/6 = 2.667MB/s Max transfer rate.

If the CPU speed is increased, as the memory address 1000h is on the ISA bus, as far as I can tell this wont increase the transfer speed. If we are talking ISA to real memory, we have to include an increment (say 4 cycles) to the memory address used in the MOV instruction. Slower than 2.667MB/s on an AT. But if the CPU/RAM were 10 times faster, then those additional 4 cycles and the 3 for the actual memory access could occur in 1/10 of the time. e.g.:

3 cycles (the IN instruction) + (3+4)/10 = 3 + 0.7 = 3.7 AT bus cycles.

So, ((2bytes x 1000)/125ns)/3.7 = 4.324MB/s Max transfer rate.

I think i'm starting to understand this now, cue post explaining I haven't

sorry for long post.

AlexC · Feb 28, 2017

mR_Slug said:
<side note>
AlexC gets 1,075KB/sec on 10MHz XT system (10MHz bus?)
(1 byte x 10MHz)/5 = ~2.00MB/s Max transfer rate. So it looks like my DMA calculations are either way off, the bus is slower, or some other factor?
</side note>

I don't know if the bus speed is 10MHz, only that the CPU runs at that speed (or perhaps 9.54 since it's a turbo XT, so 2x 4.77?).

There could be several other factors involved. As noted, I don't know how accurate Manifest is, though I'd be slow to criticize Quarterdeck's coding since they did some very clever stuff with memory. But the EMS card itself may well have limitations.

Since the RAM-to-CPU speed is only measured at 1.4MB/sec, I guess that defines an upper limit for performance on this particular machine.

<yet another side note>
Incidentally, the reasoning in this thread is why I came to the conclusion some time ago that it's not worth using a software disk cache on an XT machine. The extra overhead of transferring the data over the 8-bit bus to an EMS card negates any performance benefit. In all my tests, using different caches including Norton's NCACHE2, I've never seen an overall performance increase greater than about 1% from using a cache, and usually there's actually a decrease. Buffers in main system RAM help, but not a cache using memory on an add-in card.
</yet another side note>

ISA maximum sustained transfer rate.

Veteran Member

25k Member

Veteran Member

Associate Cat Herder

25k Member

Veteran Member

25k Member

Experienced Member

Veteran Member

Veteran Member

25k Member

Veteran Member

25k Member

Veteran Member

Veteran Member

25k Member

Experienced Member

25k Member

Veteran Member

Experienced Member