• Please review our updated Terms and Rules here

8088 performance of PUSH/POP vs saving in spare register

neilobremski

Experienced Member
Joined
Oct 9, 2016
Messages
55
Location
Seattle, USA
On the 8088, is saving/restoring in a spare register significantly faster than using PUSH/POP?

PUSH/POP are single byte instructions which are fantastic for code size but they use memory as part of using the stack. PUSH/POP take 19 to 23 clock cycles together according to Intel specifications. On the other hand MOV reg,reg is two instruction bytes which balloons to 4 when used to both save and reload the value of a register in a spare register. But a MOV is rated at 2 clock cycles.

I had been, up until this point, trying to avoid stack usage because of the aforementioned memory access, thinking that it would be much slower than saving in a spare register. However, considering the prefetch queue on the 8088 is only 4 bytes and it takes 4 cycles to read any single byte off the bus, this brings the performance of the two closer together, because 4 instruction bytes x 4 clocks to read a byte = 16 clock cycles with an empty prefetch queue.

I suppose it comes down to whether one wants to save space in code (2 bytes versus 4) or you have a spare register to use.

Now this has just been a recent mental pondering and I haven't done any timings, but I was curious what people here tend to recommend ...
 
It depends on a lot of factors.

First of all, is the save-to register really "spare", or can it be put to better use? Or must it also be saved at some point to be a "spare"? There, the answer is clear: push-pop.
Do you need to save only 8 bits and do you have a spare 8-bit register? Will saving/restoring via push/pop complicate things in that it deals with 16 bits only?
Are you trying to reduce stack usage? Is the routine recursive? There, you use a spare register, if possible.

Finally, will using XCHG reg,reg be a better alternative?

It's not a cut-and-dried issue and treating it as one is simply not good practice.
 
Last edited:
These are all great points.

My main question is regarding a direct speed comparison of the two methods on the 8088. How tight of a loop must one have to decide that the PUSH overhead is too much?

I understand there is a lot to be said for the semantic issues given a particular situation. For example, the stack is accessed serially versus the randomly loadable spare register.
 
I think that PUSH/POP will always be slower than MOV reg,reg on an 8086/8088. Can you cite evidence where it would not be the case? While it's true that PUSH/POP are only one byte instructions, the memory access overhead is significant, where the 2-byte register instructions have no memory constraints other than the instruction fetch itself. Remembering that the 8088 prefetch queue is only 4 bytes, it'd be pretty hard to write a short-enough loop requiring register saving to capitalize on shorter instruction length of PUSH/POP.

Interestingly, on the 8080, PUSH/POP for general purpose 16-bit access does have a speed advantage over MOV M,B/ DCX H/ MOV M,C / DCX H and can be exploited to gain needed cycles on block I/O operations.
 
Finally, will using XCHG reg,reg be a better alternative?

Yes, but only in some cases.
Namely, the forms:
XCHG ax, r16
XCHG r16, ax

are encoded with a single-byte opcode.
Any other form of reg, reg is a 2-byte opcode, like MOV.
The shorter opcode will save you some cycles. Else, it may be one cycle slower than MOV, so it only makes sense if you actually want to exchange two registers, rather than just storing the contents of one register in another.
 
Can you cite evidence where it would not be the case?

No, you're right there.

While it's true that PUSH/POP are only one byte instructions, the memory access overhead is significant, where the 2-byte register instructions have no memory constraints other than the instruction fetch itself.

Right, the published clock cycle time for a PUSH is 11/15 (not sure when it is one versus the other) and a POP is 8 whereas a MOV reg,reg is only 2. AFAIK this means a MOV will still take 8 cycles if no instruction bytes have been prefetched.

Remembering that the 8088 prefetch queue is only 4 bytes, it'd be pretty hard to write a short-enough loop requiring register saving to capitalize on shorter instruction length of PUSH/POP.

There's no denying this either, but I also haven't seen any comparison timings a la Abrash-style. (I'd do this but I currently am out of reach of my Tandy 1000.)

I know which one is faster, what I'm trying to figure out is how much faster (in the case of an empty prefetch queue) and if that is significant enough to try to optimize for assuming there is even the availability of a spare register.

I'm hypothesizing that the raw speed advantage is not enough over the course of an entire program to prefer the spare-register approach, which is what I had been naively doing (e.g. "register juggling"). Rather, saving two bytes each time a register must be saved and then restored later will add up and the result will be a smaller, simpler program. Of course, this is a very abstract statement to make! :)
 
Yes, but only in some cases.
Namely, the forms:
XCHG ax, r16
XCHG r16, ax

are encoded with a single-byte opcode.
Any other form of reg, reg is a 2-byte opcode, like MOV.
The shorter opcode will save you some cycles. Else, it may be one cycle slower than MOV, so it only makes sense if you actually want to exchange two registers, rather than just storing the contents of one register in another.

Oh, there are a couple of cases where XCHG is more useful, such as doing I/O to two ports in the same loop. Since the I/O 16-bit address functions are constrained by use of (DX) and (AX) exclusively (i.e. there's no such instruction as IN BL,CX) XCHG can be quite useful.

As observed, it depends--on a lot.
 
I'm hypothesizing that the raw speed advantage is not enough over the course of an entire program to prefer the spare-register approach, which is what I had been naively doing (e.g. "register juggling"). Rather, saving two bytes each time a register must be saved and then restored later will add up and the result will be a smaller, simpler program. Of course, this is a very abstract statement to make! :)

Use of memory vs. register file is a common problem in HLL optimization. The Dragon book has a good discussion of this.
 
I know which one is faster, what I'm trying to figure out is how much faster (in the case of an empty prefetch queue) and if that is significant enough to try to optimize for assuming there is even the availability of a spare register.

When optimizing 8088 code, I find that a useful rule of thumb is to not try to count CPU cycles but count bus IO cycles - i.e. reads/writes to memory/ports including the ones needed to fetch the instruction bytes. When comparing two functionally-identical pieces of code, the one that does the fewest IOs is usually the faster one (the most common exception involves long-running instructions such as multiplies, divides and multi-bit shifts). So by this metric, two "mov rw,rw"s take 2/3 of the time of a "push rw" and a "pop rw". The difference is significant enough that one of the first things you want to do when writing an inner loop is try to arrange things in registers to avoid excess memory operations.

To get a more accurate measurement than that you need to know the actual instructions in the inner loop in order to determine the state of the prefetch queue.

I'd do this but I currently am out of reach of my Tandy 1000.

This is a job for the XT Server! You can upload to it a floppy disk image that contains a suitable DOS, your Zen-timer test program and an autoexec.bat that runs it and you'll see results from a real 4.77MHz 8088 (not an emulator).
 
When optimizing 8088 code, I find that a useful rule of thumb is to not try to count CPU cycles but count bus IO cycles - i.e. reads/writes to memory/ports including the ones needed to fetch the instruction bytes.
Yeah, this is probably the one reliable truism of 8088/8086 optimization, since only a few operations like multiply/divide are more bound by CPU time than by I/O. The prefetch queue does mitigate things a bit, but it's still probably the most reliable assumption you can make.
 
Yeah, this is probably the one reliable truism of 8088/8086 optimization, since only a few operations like multiply/divide are more bound by CPU time than by I/O. The prefetch queue does mitigate things a bit, but it's still probably the most reliable assumption you can make.

For 8088 optimization, yes. 8086 optimization is a bit different because it transfers 2 bytes in a single IO cycle when prefetching or doing an aligned word access. That means that bus IO is less of a bottleneck and the EU timings (and the state of the prefetch queue) are more significant, but I haven't experimented enough with 8086 optimization to determine if the rule of thumb is still as useful. But it makes most sense to optimize for the slowest machine that you're targeting, so 8088 optimization is more important unless you know that you're targeting an 8086 machine (e.g. if you're doing Amstrad PC1512 640x200x16 graphics). Normally, the only 8086-specific optimization worth thinking about is making sure that word accesses are aligned unless aligning them would make things slower on the 8088.
 
Yes, but only in some cases.
Namely, the forms:
XCHG ax, r16
XCHG r16, ax

are encoded with a single-byte opcode.
Any other form of reg, reg is a 2-byte opcode, like MOV.

Actually, XCHG ax, r16 is a 2-byte instruction but most, if not all, assemblers (with the exception of DEBUG) hide this fact by using the encoding for XCHG r16, ax.

Right, the published clock cycle time for a PUSH is 11/15 (not sure when it is one versus the other)
As I understand it, 11 is for the 8086 and 15 for the 8088. Another thing to consider is that segment registers take 1 cycle less to push (10 and 14, respectively) on the 808x processors.
 
As I understand it, 11 is for the 8086 and 15 for the 8088. Another thing to consider is that segment registers take 1 cycle less to push (10 and 14, respectively) on the 808x processors.

I just did some 8088 measurements and got 15 cycles average for both "PUSH AX" and "PUSH ES" with standard (period 18) DRAM refresh and 14 cycles for both with DRAM refresh disabled. This is running a long sequence of the same instruction in an unrolled loop, and subtracting timings from loops of different lengths to correct for initialization effects. So I think the published value of 15 cycles is actually incorrect.
 
Back
Top