• Please review our updated Terms and Rules here

How many Intel 80x86 clock cycles for INC instruction?

Martin Hepperle

Experienced Member
Joined
Nov 10, 2014
Messages
203
I try to find out how many clock cycles are required by different CPUs of the 80x86 family for incrementing a 16-bit word at an even memory address like so:

Code:
INC WORD PTR [0002]

So far I came up with this table:

Code:
   INC mem        8088: 15+6+2+4 cycles 15+EA + 2(transfer) + 4(for 16-bit)
                  8086:     15+6 cycles 15+EA (assuming even aligment)
                 80286:        7 cycles
                 80386:        6 cycles
                 80486:        3 cycles

I read the Intel manuals but do not really understand the extra cycles added in case of the 8088 and 8086.
My calculation for the 8088 and 8086 seems to be excessive.
I also guess that any wait states/cycles would also be added to these memory access operations.

Maybe someone can correct my cycle count table?

Thank you,
Martin
 
I think your calculation for the 8088 is worse than you have stated. I get 15+6+(2*4) for a 16 bit transfer.

There will be two (2) transfers (one read and one write) for the INC instruction, with each transfer incurring a 4 cycle overhead for each 16-bit transfer.

Yes, wait states will add more clock cycles to the timing.

I think your table just shows the improvement in performance that the later processors brought with them, not just with the higher clock speeds.

We originally had a project that was using 8086 microprocessors. Before the system was installed, we worked out that an 8086 was not going to be fast enough for the software. I bought a 286/10 CPU card and 'bodged' that into our 8086 MULTIBUS 1 system as a test. The performance increase was dramatically larger than the clock speed increase (from 5.33 MHz to 8 MHz) implied. The software team did not believe the figures that I was coming up with until they tried it for themselves!

We never did upgrade to a 386...

Dave
 
Adding to what daver2 already explained, you should also take into account that INC mem is 4 bytes long (5 if we add a segment modifier), as opposed to INC reg16 instructions, which are 1 byte long. On the 8086/88, every byte fetched from RAM takes 4 cycles, so you should count the 16 cycles just the code took to be read, plus the slow BUS address calculations, plus the cycles the actual instruction takes to execute, plus wait states and so on. On the 8088 it's even worse as it has an 8 bit bus. The 286 is much more cycle efficient than the 8086 running at the same speed, running in practice twice faster or more.
 
The 8088 has a very short prefetch queue, which means the processor is constantly stalling on instruction fetches. In your example, that is an additional (four bytes = 16 cycles) on top of what you have.

On the 8086 (and especially later processors), the prefetch queue is longer and the bus interface is faster. The speed is more likely to be limited by instruction execution rather than instruction fetching.
 
Thank you for your comments - my worst dreams have become reality ;-)

I had hoped to simply calculate the required cycles, but the only way to really find out how a real system behaves seems to be using a logic analyzer.
The calculation seems to be still helpful for relative comparison.
 
Even using a logic analyser falls into the category of "not too simple" either...

Wait states are generally applicable to the Bus Interface Unit (BIU).

When an operand is required (read or write) then the wait states will affect the execution. Otherwise, wait states implemented on opcode (and immediate operand) fetches "may" only affect the instruction queue. I say "may" because the instruction queue can be flushed (or just run out of instructions) and, therefore, the wait states on the opcode/operand fetch are an "in line" delay with the instruction execution itself.

To work out what is going on, you need to monitor the bus activity and the CPU queue status signals to work out when an instruction is actually being executed.

I may be overthinking what you are trying to do of course...

Dave
 
I wrote https://www.reenigne.org/software/xtce_trace.zip to answer questions like this for the 8088 - it's a CPU emulator that shows what the CPU is doing cycle by cycle. With the prefetch queue empty before the instruction starts, it's 38 cycles from the first byte of the "INC" instruction leaving the queue to the first byte of the following instruction leaving the queue. If the queue is full (for example if there is a long-running non-bus instruction like MUL executed immediately prior) it's 29 cycles. So accurately counting cycles on this CPU is rather complicated. In practice counting the number of bus cycles (byte length of instruction plus bytes read/written by instruction) is a good way to figure out which of two pieces of code is likely to execute faster in practice.
 
@reengine thank you for this very interesting tool.

I applied it to one of my test cases (a simple loop with register INC instructions) and try to understand its output and my interprettaion of the Intel data 8088/8086 book.

In the attched document I aded my understanding of the output - feel free to comment or corect my interpretation.

Do you have any plans to extend this to 8086 and maybe the 80286?

Martin
 

Attachments

At IBM I used a program that ran on a mainframe called SIM86. It was able to simulate 8088/8086 and 80286. I was worried that the PC/AT I was using to drive a LED printhead for a test machine I was building would not be able to keep up. The simulator showed that it could (barely) keep up. They decided to not update SIM86 for the 80386, because it would have been extremely involved.
 
Yes, I know the documentation leaves a bit to be desired but you've got more or less the right idea. Column 68 contains "I", "S", "E" or is blank depending on what prefetch queue operation occurs on that cycle. I take the "I" cycle as the cycle when an instruction starts, even though the disassembly of the instruction isn't shown until the last "S" cycle of the instruction if it's 2+ byte instruction (this is because these logs were originally generated from logic analyser logs and we didn't necessarily know what the instruction was going to be yet on the "I" line). The execution unit (microcode logs on columns 104 onwards) executes in parallel with the bus interface unit (cycle and operation shown in columns 44-66). The Intel manuals show the best case timings (i.e. when the queue is full) so things will generally be slower in the worst case. Though some of the timings in the Intel manual are just plain wrong. The "mov cx,000a" instruction is "register, immediate" instruction so has a best case time of 4 cycles according to the Intel manual though is only actually three steps in the microcode. Your testcase is almost completely bus-bound since none of the instructions have sufficiently complex microcode to let the prefetch queue fill up (with the exception of the LOOP instruction but prefetching isn't helpful there anyway).

I could probably modify this to simulate the 8086 fairly easily - Ken Shirriff's blog has a lot of information on the low-level differences between the 8088 and the 8086 and the complicated bits are generally the same between both. However, I don't have a good way to validate the result against real hardware at the moment. The 80286 would be quite a bit more difficult since there's not so much information available about how the microcode of that CPU works.
 
At IBM I used a program that ran on a mainframe called SIM86. It was able to simulate 8088/8086 and 80286. I was worried that the PC/AT I was using to drive a LED printhead for a test machine I was building would not be able to keep up. The simulator showed that it could (barely) keep up. They decided to not update SIM86 for the 80386, because it would have been extremely involved.
If SIM86 is cycle accurate it would be interesting to play with - do you know if it was ever made public? I had a brief search but there are lots of things by that name. Presumably Intel had simulators they used internally to validate the functionality and measure performance of the chip designs prior to tape out - perhaps they made these available to IBM.
 
@reengine, I think there might be a problem with your TRACE tool and the INC WORD PTR instruction?

DEBUG XX.COM-u100
0E20:0100 B90800 MOV CX,0008
0E20:0103 FF060001 INC WORD PTR [0100]
0E20:0107 FF060001 INC WORD PTR [0100]
0E20:010B FF060001 INC WORD PTR [0100]
0E20:010F E2F2 LOOPW 0103
0E20:0111 C3 RET


In a Windows CMD window:
D:\DOS\TRACE>trace xx.com
Writing file (console) : Not enough memory for this command

The TRACE program is terminated and no further output appears.

The program runs and terminates properly in DOSBox. Also tried with incrementing different words obehind the code region, e.g. INC WORD PTR [0112].

Or am I doing something wrong?

Martin
 
That's very strange - I just tried the same xx.com and it worked correctly for me. You were able to get it running before - are you doing anything different? Does it still work with the testcases you tried before? Just to eliminate possibilities, trace.exe should be 153600 bytes and have a crc32 of e31ed4c3 (v1.1 2022122719503801 in whatsnew.txt). And it shouldn't be a problem to modify the code region - if you modify code that is just about to run it should have the same interaction with the prefetch queue that real hardware exhibits.
 
Hmmm. maybe a problem with my Windows installation (it is still Windows 7 on an older laptop).
I will investigate. File size 153600 and CRC-32 E31ED4C3 match.
Would your tool also be compilable under MS-DOS? As we are looking at MS-DOS .COM files anyway, this would be handy.
 
Try redirecting the output to a file ("trace xx.com >xx.log") since it looks like it's writing to the console that it's having problems with. Also try closing cmd.exe and opening a fresh cmd window if it's got itself into a strange mode (perhaps by running a DOS program in it?)
It might be a bit of work to compile it for DOS, as it's written in fairly modern C++. It also simulates 640kB of memory so you'd either need to change that or compile it as a 32-bit program using DJGPP. The source code is at https://github.com/reenigne/reenigne/tree/479807ae633df2456b81e949aaef344ddad74f48/8088/xtce/trace if you want to have a play with it.
 
Back
Top