• Please review our updated Terms and Rules here

8088 vs V20 instruction timings...

pearce_jj

Veteran Member
Joined
May 14, 2010
Messages
2,808
Location
UK
I would be very grateful for some help checking my workings for the transfer of 512-bytes in words (rates at 4.77MHz):

Port based:

8088:
IN 12
STOW 15
LOOP 5 => 7,424 clocks, 320KB/s

V20: REP INM 9 + 16*rep => 4,105 clocks, 580KB/s


Memory-mapped:

8088: REP MOVSW 9 + 25*rep => 6,409 clocks, 372KB/s

V20: REP MOVKW 11 + 16*rep => 4,107 clocks, 580KB/s


Do these scale directly with clock, i.e. would an 8MHz V20 be running at 970KB/s?

Many thanks!
 
I just did a timing run on my XT and I make it 48 cycles per iteration for the Intel 8088 the "in ax,dx; stosw; loop" loop (with interrupts and DRAM refresh disabled). That gives you 194KB/s. It's not quite as simple as just counting cycles for the individual instructions because the execution unit has to wait for the bus. Unrolling the loop helps a bit (30 cycles or 311KB/s) if you're planning to execute it a lot and can spare the memory.

I believe these do scale directly with clock (at least for older machines) - the ISA bus shares its clock with the CPU.
 
Many thanks indeed. I guess the loop instruction is 17 cycles when taken - that pretty much adds up, plus the odd wait state.
 
reenigne pointed out something important -- waiting for the bus; more importantly you have to look at the BIU (Bus Interface Unit) which can prefetch opcodes into it's 4 byte cache. It takes 4 clock cycles to fetch a byte, so with small 2 byte low-clock operations the EU (execution unit) is going to spend a lot of clock cycles with it's thumb up it's arse. The BIU can use clocks where the EU is running that aren't used for memory operations, and use them to fill the cache, so a long running operation followed by a couple short ones can often run faster than a bunch of short ones.

IN reg16,reg16 -- 1 byte (4 clocks fetch), 12 clocks EU, 2 bytes read (8 clocks BIU read)... so that takes a minimum of 24 clocks for the first one.
STOSW -- 1 byte (4 clocks fetch), 15 clocks, 2 bytes written (8 clocks BIU write, so only 7 clocks free for fetch)

If you had the flat loop of single ops:
Code:
.readNext:
  in ax,dx       ; 4 fetch, 12 exec, 8 port, so 4 free to fetch next
  stosw          ; 0 fetch (4-4), 15 exec, 8 mem, so 7 free to fetch next
  loop .readNext ; 1 fetch (8-7), 17 exec
Every loop on a 8088 should take 49 clocks... NOT the 44 you'd think.

Unrolling the loop: This oversimplifies and it' ain't quite right; but for purposes here it will have to do, 'cause I ain't got the time to explain it to you.)
Code:
.readNext:
  ; BIU cache empty after loop
1_in    ax, dx  ; 4 fetch, 12 exec - 8 port, 4 free           - 4 1_stosw        = 0 extra, BIU 1
1_stosw         ; 0 fetch, 15 exec - 8 mem,  7 free           - 4 2_in           = 3 extra, BIU 1
2_in    ax, dx  ; 0 fetch, 12 exec - 8 port, 4 free + 3 extra - 4 2_stosw        = 3 extra, BIU 1
2_stosw         ; 0 fetch, 15 exec - 8 mem,  7 free + 3 extra - 8 3_in, 3_stosw  = 2 extra, BIU 2
3_in    ax, dx  ; 0 fetch, 12 exec - 8 port, 4 free + 2 extra - 4 4_in           = 2 extra, BIU 2
3_stosw         ; 0 fetch, 15 exec - 8 mem,  7 free + 2 extra - 8 4_stosw, 5_in  = 1 extra, BIU 3
4_in    ax, dx  ; 0 fetch, 12 exec - 8 port, 4 free + 1 extra - 4 5_stosw        = 2 extra, BIU 3
4_stosw         ; 0 fetch, 15 exec - 8 mem,  7 free + 2 extra - 8 6_in, 6_stosw  = 1 extra, BIU FULL
5_in    ax, dx  ; 0 fetch, 12 exec - 8 port, 4 free + 1 extra - 4 7_in           = 1 extra, BIU FULL
5_stosw         ; 0 fetch, 15 exec - 8 mem,  7 free + 1 extra - 4 7_stosw        = 4 extra, BIU FULL
6_in    ax, dx  ; 0 fetch, 12 exec - 8 port, 4 free + 4 extra - 4 8_in           = 4 extra, BIU FULL
6_stosw         ; 0 fetch, 15 exec - 8 mem,  7 free + 4 extra - 4 8_stosw        = 7 extra, BIU FULL
7_in    ax, dx  ; 0 fetch, 12 exec - 8 port, 4 free + 7 extra - 4 loop 1st byte  = 7 extra, BIU FULL
7_stosw         ; 0 fetch, 15 exec - 8 mem,   entire routine cached, no more BIU calcs
8_in    ax, dx  ; 0 fetch, 12 exec - 8 port
8_stosw         ; 0 fetch, 15 exec - 8 mem
loop  .readNext ; 0 fetch, 17 exec - 0 mem

There's no fetch overhead after the first IN statement... so you basically threw away 8 clocks per item without even figuring the actual LOOP commands exec time into it! We don't just save 17 clocks by unrolling, we save ~22.

Quite often I've found swapping around the order things are executed in can leverage the BIU for speedups. If you have a mul for example putting it before shorter operations on other registers allows the BIU fetch to fill up. STOSW as we can see above gives you 7 unused clocks every time you call it for the BIU to use to grab opcodes into it's cache, which is why unrolling this loop is so brutally efficient -- it eliminates the 4 byte fetch on most of the IN, as well as the entire fetch for the loop command itself. Basically that first in takes 16 clocks instead of 12, but all the ones that follow take the execution time of 12 only because we keep the cache full. The STOSW's all get cached so they too take their listed times.

That's why a lot of ASM guys coming from other processors HATED the 8088/8086 back in the day; the efficiency 'tricks' it used to speed things up make it so you can't just add execution times together. You simply could not reliably use execution times to accurately 'bit-bang' ports, so you had to latch a timer -- a relatively 'new' concept for microcomputing in the early 80's. It's also why software that does try to rely on the exact execution time of the 8088 breaks if you change to a v20 or 8086 -- even if in some cases it does not break at faster clock speeds. (like the tandy's 7.16mhz).
 
That's why a lot of ASM guys coming from other processors HATED the 8088/8086 back in the day; the efficiency 'tricks' it used to speed things up make it so you can't just add execution times together. You simply could not reliably use execution times to accurately 'bit-bang' ports, so you had to latch a timer -- a relatively 'new' concept for microcomputing in the early 80's. It's also why software that does try to rely on the exact execution time of the 8088 breaks if you change to a v20 or 8086 -- even if in some cases it does not break at faster clock speeds. (like the tandy's 7.16mhz).

Ever tune loops on a CDC 6600 or an Intel i860? Makes the 8088 look like child's play.
 
Ever tune loops on a CDC 6600 or an Intel i860? Makes the 8088 look like child's play.

While my big iron time prior to ~'84 was nonexistent, I did have the displeasure of dealing with a machine running Vox at one point; gah, the i860 was such a dog.

Though my statement was more applying to mainstream 8 bit processors contemporary to the 8088 like the Z80, 6502, 6809... or 16 bit like the TMS9900... NOT that I really got to use a TMS9900 to it's full capacity with the most common implementation being cripple-ware.
 
Where hand-timing gets difficult is when there are multiple functional units and you have to remember not only the execute time of an instruction, but the functional unit servicing it (and if said functional units were duplexed), when an operand from a previous dependent operation becomes available and lots of other little tidbits, including the size of the instruction cache, etc. On the 6600, you were particularly proud when you could fit a loop in the cache (called a "stack" then) and get one instruction to issue every cycle. Bottom-of-loop loads (to overlap with the loop branch), using different functional units to perform the same operation (e.g. shift by zero bits or AND the source with itself to perform a register-to-register transfer) were some favorite techniques. There were also tricks only possible on a ones' complement architecture that were a joy to behold.

The Cray-1 was like that also.
 
Back
Top