• Please review our updated Terms and Rules here

8088 prefetch algorithm

aitotat

Experienced Member
Joined
Aug 13, 2009
Messages
351
Location
Finland
Are there any documentation for 8088 (and later x86 CPUs too) prefetch algorithm? I know that prefetching occurs when execution unit does not need to access bus. I'd like to know exactly how and when prefetching is triggered and how (or if) execution unit overrides any prefetch cycles that have begun but not yet completed.
 
Are there any documentation for 8088 (and later x86 CPUs too) prefetch algorithm? I know that prefetching occurs when execution unit does not need to access bus. I'd like to know exactly how and when prefetching is triggered and how (or if) execution unit overrides any prefetch cycles that have begun but not yet completed.

The 8088/8086 were the first CPU's by Intel who did any form for prefetching, so the prefetch algorithm is actually rather simple for those early processors. The two processors contains a queue of 4 (for the 8088) or 6 (for the 8086) bytes.

So the algoritms are as follows:
If the queue is not full (one byte[8088]/word[8086] empty) and the execution unit doesn't request bus-usage, then read one byte(8088/8086 odd addresses) or word(8086 even addresses) of the next instruction and store it in the queue.
If a jump (any kind, even "POP CS" IIRC) occours, then completely reset the queue.
 
That much i already know but i need more details.

Lets assume that there is only one instruction (one byte) in the prefetch queue: PUSH AX. It executes in 15 cycles on a 8088. It has two byte writes so that leaves 7 cycles for prefetch unit.

7 cycles is enough for fetching one byte since 4 cycles is needed for byte transfer (assuming zero wait states). If prefetch unit tries to fetch second byte it would delay execution unit by one cycle. That would mean that PUSH AX would execute in 16 cycles.

That cannot be right (execution unit has higher priority) so how does prefetch unit know to fetch only one byte instead of two?
 
That much i already know but i need more details.

Lets assume that there is only one instruction (one byte) in the prefetch queue: PUSH AX. It executes in 15 cycles on a 8088. It has two byte writes so that leaves 7 cycles for prefetch unit.

7 cycles is enough for fetching one byte since 4 cycles is needed for byte transfer (assuming zero wait states). If prefetch unit tries to fetch second byte it would delay execution unit by one cycle. That would mean that PUSH AX would execute in 16 cycles.

That cannot be right (execution unit has higher priority) so how does prefetch unit know to fetch only one byte instead of two?

From a demonstration in the 8088/8086 manual (note that the demonstration is according to the 8086):
Clock cycle 13 begins the execution of the PUSH AX instruction, and in clock cycle 15, the BIU begins the fourth opcode fetch. The BUI finishes the fourth fetch in clock cycle 18 and prepares for another fetch when it receives a request from the EU for a memory write (the stack push). Instead of completing the opcode fetch and forcing the EU to wait four additional clock cycles, the BIU immediately aborts the fetch cycle (ressulting in two idle clock cycles (TI) in clock cycles 19 and 20) and performs the required memory write. This interaction between the EU and BIU results in a single clock extension to the execution time of the PUSH AX instruction, the maximum delay that can occur in response to an EU bus cycle request.
 
That is the kind of information i'm looking for. Where can i find that manual?
 
That is the kind of information i'm looking for. Where can i find that manual?

I bought it on Ebay. It's called "iAPX 8086/8088 User's Manual". I have attached a scan of the example I was refering to.
 

Attachments

  • Example 1.zip
    60.6 KB · Views: 4
  • Example 3.zip
    48.3 KB · Views: 1
  • Example 2.zip
    48.1 KB · Views: 1
Thank you for the scans. They were very interesting.

I was surprised to see that there are already one word fetched when execution of JMP instruction ends. I assumed that prefetch queue would always be empty when execution of any sort of jump instruction ends. Does that book contain information at what cycle the read or write starts for all instructions that needs to use memory?
 
The book has the instruction timings and the rough algorithm for instruction prefetch. But you should keep in mind that at best the timings and examples are estimates. Memory refresh cycles and DMA can hold things off, making the timings worse than what is published.

The prefetch queue blindly prefetches instructions in order. Even while a JMP instruction is executing .. which is fine if the jump is not taken, but if the jump is taken the prefetch queue is worthless and needs to be reloaded. All good code should be written such that branches are assumed not taken. In the case of tight loops where the branch will be taken you can reduce the penalty by unrolling the loop one or more times.

Check out Michael Abrash's book "Graphics Programming Black Book" - he has a very good chapter on the fallacy of cycle counting on x86 architecture machines. It is required reading for anybody interested in writing high performance code.


Mike
 
Even while a JMP instruction is executing .. which is fine if the jump is not taken, but if the jump is taken the prefetch queue is worthless and needs to be reloaded.

Is there ever a case where a JMP is not taken? I can see that this would be an issue with a conditional branch/jump (JZ, JB, etc.), but a JMP, to the best of my knowledge is always taken.
 
I should have been more precise. Substitute 'JMP' with 'jump', or conditional branch ..

For that matter calls and returns have the same problem too - they are always executed and represent a change of control flow.
 
It's worthwhile observing that the prefetch queue is just that--a unidirectional lineup of bytes. There's no provision in the hardware for satisfying jump targets from within the queue (as in a relative branch to skip over a single instruction). Any (taken) jump, interrupt (internal(taken) or external) or return resets the prefetch pointer and it starts again.

On the CDC 6600, there was a 10 word (60 bit, 4 "parcel") "stack" (that's what it was called) that could hold up to 32 instructions--and loops could be coded to fit "in stack", so that no memory references were required to fetch instructions. Pity that the 808x didn't do that.

If you use a V20, all of your timings will probably be off. It's quite easy on a V20 to outrun the prefetch mechanism if short execution-time instructions are used.

Other than needing to clear the queue because of self-modifying code, it's probably best to ignore the prefetch queue, given the wild variations of x86 family instruction timings.
 
If you use a V20, all of your timings will probably be off. It's quite easy on a V20 to outrun the prefetch mechanism if short execution-time instructions are used.

My understanding is that the V20 has a better (or bigger) prefetch queue, so that it is not as likely to run out of bytes. This is part of the performance improvement that the V20 gives - it is less likely to run out of instructions, even though it is more efficient on certain instructions.
 
Nope, same as the 8088 - 4 bytes or 6 for the V30. I think that was to avoid the issue of incompatibility with 8088 code.

The V20 mostly gets its speed from using dual internal data buses, additional temporary registers (for multiply and iterative instructions) and a hardware effective address generator.

For example, if you perform the instruction ADD AX,BX, the operands are transmitted to the adder inputs simultaneously, instead of serially.

Same situation exists for the temporary and loop counter registers--they eliminate needless shuffling of data around with the accompanying microcode overhead. An instruction such as SHR AX,CL needs only the instruction startup overhead plus one clock per shift.
 
I'm thinking of writing a cycle exact emulator so i need much more detailed information than can be found on the usual programming books.

I would need to know the exact cycle when bus operations get started. Those three scanned pages show, for example, that JMP instruction clears prefetch queue on 9th cycle. Instruction fetching starts after that so first word for target instruction is already fetched one cycle before the execution of JMP instruction ends. Those scans also shows that execution unit always reads one byte at a time from prefetch queue. It makes sense when opcode or ModRM byte is expected (since they need to be decoded to know what byte(s), if anything, follows them). But even immediate words are read as two bytes from prefetch queue to execution unit. I have not found that detailed information on any programming book.
 
I'm thinking of writing a cycle exact emulator so i need much more detailed information than can be found on the usual programming books.

I would need to know the exact cycle when bus operations get started. Those three scanned pages show, for example, that JMP instruction clears prefetch queue on 9th cycle. Instruction fetching starts after that so first word for target instruction is already fetched one cycle before the execution of JMP instruction ends. Those scans also shows that execution unit always reads one byte at a time from prefetch queue. It makes sense when opcode or ModRM byte is expected (since they need to be decoded to know what byte(s), if anything, follows them). But even immediate words are read as two bytes from prefetch queue to execution unit. I have not found that detailed information on any programming book.

For that, I think you will have to manually monitor the QS0 and QS1 lines while you run every single opcode, or ask Intel. They haven't really published anymore else than that example (Maybe in the "iAPX 88 Book") about detailed facts on how the prefetch queue works.

But I think the instruction cycle-time table present in the manual should be rather accurate. For an example, most conditional jumps says they uses 16 or 4 cycles, and it obivously means that 16 is when the jump is taken and 4 is then the jump is not taken.
 
But I think the instruction cycle-time table present in the manual should be rather accurate. For an example, most conditional jumps says they uses 16 or 4 cycles, and it obivously means that 16 is when the jump is taken and 4 is then the jump is not taken.

In many, if not most cases, the instruction timing tables are fiction in the real world, as the 8088 prefetch queue rarely has an instruction ready for execution, unless you're using slow, short instructions (e.g. AAD). Most often, actual execution time turns out to be (4*number of bytes in an instruction) or the published instruction time, whichever is greater (plus a little bit for refresh overhead).

But the definitive answer could probably be obtained from the 8088 microcode. But so many years after the release of the 8088, I don't know where you'd find a copy, save from the Intel archives.
 
Sorry to resurrect an old thread, but I'm interested in writing a cycle-exact 8088 emulator as well. Aitotat, did you ever get anywhere with yours? Are you interested in collaborating? I have some preliminary code on github but it's not really in a working state at the moment - I got it running as far as the PIT test in the IBM BIOS, and it takes a long time to get that far! I also need to get the instruction timings calibrated properly, which as you know is very difficult. I've got some code running to do cycle-exact timings but the results are even more complicated than I imagined they would be. For example, "cmp [bx],dl" seems to take 15 cycles if bx is even and 13 cycles if bx is odd, which is very surprising to me! I guess there must be some remnants of the 8086's 16-bit bus in the 8088. Even weirder, "and [bx+0x100],dx" takes 27 cycles, the same as "and [bx+0x100],dl" rather than the 35 cycles taken by "and [bp+0x100],dx". There is a similar discrepancy between "bp+si+0x100" and "bp+di+0x100" EAs. And I've only explored a small part of the space of instructions!

Per's suggestion of monitoring the QS0 and QS1 lines has been fermenting in my brain for a while, and I think it's the best way to figure out exactly when each byte of each instruction leaves the prefetch queue and straighten all this out. What I'm thinking of doing is making a ISA "bus sniffer" card with a microcontroller (probably an AVR ATMega328, as I have a couple of those lying around and I'm familiar with them). The idea is that it'll monitor various lines from the bus and also have some wires leading to the CPU (or possibly FPU) socket so that it can monitor the CPU status lines that aren't exposed on the bus (QS0 and QS1 are the important ones, but there are some others that might be interesting as well). It'll probably expose a couple of IO ports, one to start the monitoring (causing the microcontroller to save the state of the monitored lines in its RAM on each cycle until the RAM is full) and the other to read back the stored values once the piece of code in question has finished executing.
 
A net exact HDL representation of an 8088 would be a nice project too. It's a shame Intel doesn't open up more information now +30 years.
 
Maybe getting someone interested at e.g. Bletchly might yield something from Intel?
 
On the CDC 6600, there was a 10 word (60 bit, 4 "parcel") "stack" (that's what it was called) that could hold up to 32 instructions--and loops could be coded to fit "in stack", so that no memory references were required to fetch instructions. Pity that the 808x didn't do that.

Well, one of my favorite translation bits comes close:

Code:
LODSB
XLAT
STOSB

Each opcode is a byte. Usually followed by LOOP, which is 2 bytes, that's 5 which "fits" into the 8086's prefetch queue. (But not really, since the LOOP empties it.)
 
Back
Top