neilobremski
Experienced Member
On the 8088, is saving/restoring in a spare register significantly faster than using PUSH/POP?
PUSH/POP are single byte instructions which are fantastic for code size but they use memory as part of using the stack. PUSH/POP take 19 to 23 clock cycles together according to Intel specifications. On the other hand MOV reg,reg is two instruction bytes which balloons to 4 when used to both save and reload the value of a register in a spare register. But a MOV is rated at 2 clock cycles.
I had been, up until this point, trying to avoid stack usage because of the aforementioned memory access, thinking that it would be much slower than saving in a spare register. However, considering the prefetch queue on the 8088 is only 4 bytes and it takes 4 cycles to read any single byte off the bus, this brings the performance of the two closer together, because 4 instruction bytes x 4 clocks to read a byte = 16 clock cycles with an empty prefetch queue.
I suppose it comes down to whether one wants to save space in code (2 bytes versus 4) or you have a spare register to use.
Now this has just been a recent mental pondering and I haven't done any timings, but I was curious what people here tend to recommend ...
PUSH/POP are single byte instructions which are fantastic for code size but they use memory as part of using the stack. PUSH/POP take 19 to 23 clock cycles together according to Intel specifications. On the other hand MOV reg,reg is two instruction bytes which balloons to 4 when used to both save and reload the value of a register in a spare register. But a MOV is rated at 2 clock cycles.
I had been, up until this point, trying to avoid stack usage because of the aforementioned memory access, thinking that it would be much slower than saving in a spare register. However, considering the prefetch queue on the 8088 is only 4 bytes and it takes 4 cycles to read any single byte off the bus, this brings the performance of the two closer together, because 4 instruction bytes x 4 clocks to read a byte = 16 clock cycles with an empty prefetch queue.
I suppose it comes down to whether one wants to save space in code (2 bytes versus 4) or you have a spare register to use.
Now this has just been a recent mental pondering and I haven't done any timings, but I was curious what people here tend to recommend ...