8088 prefetch algorithm

Trixter · Jul 24, 2012

reenigne said:
For example, "cmp [bx],dl" seems to take 15 cycles if bx is even and 13 cycles if bx is odd, which is very surprising to me! I guess there must be some remnants of the 8086's 16-bit bus in the 8088. Even weirder, "and [bx+0x100],dx" takes 27 cycles, the same as "and [bx+0x100],dl" rather than the 35 cycles taken by "and [bp+0x100],dx". There is a similar discrepancy between "bp+si+0x100" and "bp+di+0x100" EAs. And I've only explored a small part of the space of instructions!

8-O Never knew that! Those are some pretty big discrepancies!

I've resumed work on my benchmark and should have something out by the weekend, but sadly the numbers in the database won't really help you because I explicitly leave interrupts enabled during benchmarking and also because I store microseconds elapsed instead of cycles. They'll get very close, but that's it.

sergey · Jul 24, 2012

Trixter said:
8-O Never knew that! Those are some pretty big discrepancies!

I've resumed work on my benchmark and should have something out by the weekend, but sadly the numbers in the database won't really help you because I explicitly leave interrupts enabled during benchmarking and also because I store microseconds elapsed instead of cycles. They'll get very close, but that's it.

Not sure what is the reason for leaving interrupts enabled. If it is for timing - you can read 8254 timer's channel 0 counter directly instead of using IRQ0 interrupt. Or use channel 2 (if you want to use a divider other than 65536 and don't want to reprogram channel 0), just make sure to turn off the speaker

Interesting if anyone implemented a bus sniffer as was described above... FPU socket provides all the necessary signals for that.

sergey · Jul 24, 2012

Chuck(G) said:
On the CDC 6600, there was a 10 word (60 bit, 4 "parcel") "stack" (that's what it was called) that could hold up to 32 instructions--and loops could be coded to fit "in stack", so that no memory references were required to fetch instructions. Pity that the 808x didn't do that.

It is forgivable given 808x/80286 old age, purpose (cheap general purpose CPU), and the fact that the memory at that time was almost as fast as the CPU itself. It might be less forgivable for 386 and 486, that were pipelined, and therefore if branch was taken, they had not only to clear the instruction queue, but also to flush the pipeline.

If I remember correctly the first branch prediction mechanism in x86 family appeared in Pentium, it was very crude, but had more than 50% of correct guesses (so it was better than not having branch prediction at all). P6 (Penitum Pro and so on) implemented much better branch prediction and speculative execution. Later x86 CPUs went ever further - they do uops caching, so in case of a (short) loop, they don't have to decode instructions in the loop multiple times.

reenigne · Jul 24, 2012

Trixter said:
8-O Never knew that! Those are some pretty big discrepancies!

I'm guessing the difference is just one or two cycles in the EU, but that this is the difference between making or missing a particular BIU slot so the difference gets amplified that way.

What's really surprising to me is that it's not a byte instruction getting slowed down to the speed of a word instruction, but the other way around - a word instruction (apparently) getting sped up to the speed of a byte instruction!

	ALU [],dl	ALU [],dx	cmp [],dl	cmp [],dx	cmp [odd],dx	cmp [odd],dl	lea dx,[]
0	23	31	16	20	20	16	10
bx	22	30	15	19	19	13	7
si	22	30	15	19	19	13	7
di	22	30	15	19	19	13	7
bx+di	24	32	17	21	21	17	10
bx+si	24	32	17	21	21	17	9
bp+si	24	32	17	21	21	17	10
bp+di	24	32	17	21	21	17	9
bp+di+1	28	36	21	25	27	21	13
bp+si+1	28	36	21	25	27	21	14
bx+di+1	28	36	21	25	27	21	14
bx+si+1	28	36	21	25	27	21	13
bp+1	26	34	19	23	23	19	11
bx+1	26	34	19	23	23	19	11
si+1	26	34	19	23	23	19	11
di+1	26	34	19	23	23	19	11
bp+di+0x100	28	28	21	21	21	21	13
bp+si+0x100	28	36	21	25	27	21	13
bx+di+0x100	28	36	21	25	27	21	14
bx+si+0x100	28	36	21	25	27	21	14
bp+0x100	27	35	19	23	23	19	11
bx+0x100	27	27	19	19	19	19	11
si+0x100	27	35	19	23	23	19	11
di+0x100	27	35	19	23	23	19	11

sergey said:
Not sure what is the reason for leaving interrupts enabled. If it is for timing - you can read 8254 timer's channel 0 counter directly instead of using IRQ0 interrupt. Or use channel 2 (if you want to use a divider other than 65536 and don't want to reprogram channel 0), just make sure to turn off the speaker

I think Trixter's benchmark program has a rather different purpose than my experiments - he wants to see what the "real life" speed of various machines is (and therefore leaves enabled all these features like interrupts and DRAM refresh), while I'm trying to reverse-engineer some fiddly details of the CPU so I disable them all.

sergey said:
Interesting if anyone implemented a bus sniffer as was described above... FPU socket provides all the necessary signals for that.

Yeah, the most interesting ones are available on the FPU socket. The ones that aren't are NMI, INTR, -LOCK, -RQ/-GT0, -RD and -SS0. I'll probably try using the FPU socket first but I'll design the sniffer with enough multiplexers to do those ones too - I might find that I want to look at some of them (or indeed some signals from elsewhere on the board) as my experiments progress.

Chuck(G) · Jul 24, 2012

Do I understand correctly that the timing tables given above are all memory-op-register->memory types? Have you deducted the instruction fetch times?

Does the same difference hold if timings for memory-op-register->register instructions are used?

Trixter · Jul 24, 2012

sergey said:
Not sure what is the reason for leaving interrupts enabled. If it is for timing - you can read 8254 timer's channel 0 counter directly instead of using IRQ0 interrupt. Or use channel 2 (if you want to use a divider other than 65536 and don't want to reprogram channel 0), just make sure to turn off the speaker

My reason is that I am benchmarking real-world performance, and in the real world, you don't CLI or mask everything out of the PIC for 2 seconds. Doing so on some machines (like the PCjr or Dec Rainbow or some other clones) will bork the entire machine. Even on true IBMs, disabling interrupts for 2 seconds will likely kill things that need regular acknowledgement like a packet driver. So that's my justification. I didn't want my benchmark locking up people's machines or causing trouble.

Trixter · Jul 24, 2012

sergey said:
P6 (Penitum Pro and so on) implemented much better branch prediction and speculative execution.

If this was the case, why was the PPro so much worse at running 16-bit code than the Pentium? (I'm guessing it's because they tuned the prediction and speculative execution for 32-bit code, but if you know the real answer, I'd be curious to know.)

gslick · Jul 24, 2012

sergey said:
Interesting if anyone implemented a bus sniffer as was described above... FPU socket provides all the necessary signals for that.

I have an HP 10305B (64653A) 8086/8088 pre-processor probe interface for the HP 16500 series logic analyzers. I haven't had a need to use it yet. I'll have to take a look at the manual for some details on how it works and I should give it a try sometime to see if it would provide useful cycle accurate information for these types of experiments.

-Glen

sergey · Jul 24, 2012

Trixter said:
If this was the case, why was the PPro so much worse at running 16-bit code than the Pentium? (I'm guessing it's because they tuned the prediction and speculative execution for 32-bit code, but if you know the real answer, I'd be curious to know.)

It must be a rumor of some sort

As far as I know PPro 16-bit performance was the same or even better to similarly clocked Pentium. Perhaps people were expecting to get even better performance given the premium price of PPro (or because of the 'Pro' suffix...).
Pentium Pro was not as popular as one could expect, but that could be explained by higher system price (not only CPU, but also motherboard) than that of P54 or P55C systems, and then availability of non-Intel Socket 7 CPUs...

See this (look for The DOS Performance Of The Pentium II and DOS Game Performance):
http://www.tomshardware.com/reviews/empire-strikes-back,23.html

Or this (for Windows 95 performance):
ftp://ftp.gwdg.de/pub/misc/x86.org/http/digest/May97/Feature01.html

reenigne · Jul 24, 2012

Chuck(G) said:
Do I understand correctly that the timing tables given above are all memory-op-register->memory types?

Yes, that's right.

Chuck(G) said:
Have you deducted the instruction fetch times?

In as far as it's possible to do so - these timings were done by having a large unrolled loop of a set of instructions executed 480 times (actually I run it 48 and 528 times and then subtract to eliminate startup effects). The set of instructions consists of a multiply (to force the test to be EU bound and to fill up the prefetch queue) and the particular instruction I'm measuring. Then I subtract the time taken for the multiply on its own (measured the same way). However, I'm not sure of all the interactions between memory accesses initiated by the instructions under test, and memory accesses for fetching the next instruction - I'm guessing it's these interactions that are responsible for this strange behavior.

Chuck(G) said:
Does the same difference hold if timings for memory-op-register->register instructions are used?

Good question. I don't have my XT at the moment to try it out (I recently relocated from the US to the UK and many of my possessions including this one are in transit and won't arrive until the beginning of September). If you (or anyone) has a suitable machine to try it with, I'll put together the experiment as a DOS program (currently it gets loaded over the serial port by a program that's loaded over the keyboard port by a microcontroller).

pearce_jj · Jul 24, 2012

sergey said:
Pentium Pro was not as popular as one could expect, but that could be explained by higher system price (not only CPU, but also motherboard) than that of P54 or P55C systems, and then availability of non-Intel Socket 7 CPUs...

It did though, with NT4, put 'wintel' servers on the map. Up until that point it was AS/400 or Sun SPARC or better for anything critical and/or database like. But 4-way SMP, relatively decent address space and a stable(ish) OS to use it were compelling at the prices Compaq were selling machines like the Proliant 5000 for (compared to the aforementioned, that is).

I actually thought the P-Pro ran 16-bit code through a microcode 'interpreter' of sorts - but I could well be wrong.

sergey · Jul 25, 2012

pearce_jj said:
It did though, with NT4, put 'wintel' servers on the map.

I actually thought the P-Pro ran 16-bit code through a microcode 'interpreter' of sorts - but I could well be wrong.

Well Pentium Pro was specifically marketed for workstations and servers (contrarily to Pentium and Pentium MMX that were marketed for desktops).

P6 micro-architecture (used in PPro, PII, P3, Pentium M, Core, and later Intel CPUs) is actually implemented as a RISC machine with a front end that converts x86 instructions to RISC uops. AFAIK operand size doesn't really matter, ALUs operate 32-bit (nowadays 64-bit) internal registers anyhow. There are some tags that indicate the desirable operand / result size, but they won't make 8-bit or 16-bit instructions slower than 32-bit or 64-bit ones. Here is a pretty detailed description of P6 architecture.

From further investigation it looks like PII 16-bit performance was improved (vs. PPro) by implementing additional segment descriptor cache. This is not directly related to 16-bit vs. 32-bit instructions performance, but to the tendency of 16-bit code to manipulate segment registers more frequently than 32-bit code does (which most likely just uses flat 4 GiB segments)

Trixter · Jul 25, 2012

sergey said:
See this (look for The DOS Performance Of The Pentium II and DOS Game Performance):
http://www.tomshardware.com/reviews/empire-strikes-back,23.html

Thank you, but these are all 32-bit benchmarks

When I release my full benchmark tool, I'll be sure to include both my 200MHz PPro and my 233MHz PII numbers in it so that I can either prove my point, or disprove it

westveld · Jul 25, 2012

gslick said:
I have an HP 10305B (64653A) 8086/8088 pre-processor probe interface for the HP 16500 series logic analyzers. I haven't had a need to use it yet. I'll have to take a look at the manual for some details on how it works and I should give it a try sometime to see if it would provide useful cycle accurate information for these types of experiments.

-Glen

Is that an adapter to probe the CPU signals?

I've got a PC 64-256 board and an XT 64-256 board if someone wants code run and signals sniffed with:
http://www.seeedstudio.com/depot/open-workbench-logic-sniffer-p-612.html?cPath=174

16 signals or less, I don't have a 5v buffer for the extra 16 pins

reenigne · Jul 29, 2012

Here's my design for a bus sniffer, such as it is so far. Be gentle, it's my first attempt to design an ISA card. Also since this is a piece of experimental apparatus rather than a piece of end-user hardware, I didn't spend much time making (e.g.) the routing beautiful.

This will capture all the signals from the ISA bus and the CPU (or FPU) socket, at the same time. Which is really far more information than is needed but the more signals I capture the easier it'll be to debug the thing and to see exactly what's going on (besides which, multiplexers are cheap). I ended up not putting in proper IO ports as doing so started getting complicated, so instead I'll be triggering the sniffer by having the microcontroller watch some subset of the address lines and having the 8088 access particular addresses to set parameters and start recording. I also realized that I'd be sending the results over serial to a modern PC anyway, so I might as well send them over serial straight from the microcontroller. So the sniffer ends up being completely invisible to software running on the 8088.

I'll try to get this built in the next month or so so that I can try it out when my XT shows up.

eeguru · Jul 29, 2012

I'm a bit confused about that design. It looks like you have 11 8:1 mux's. The minimum ISA bus clock you'll face is 4.77 MHz with average transition rate closer to 2 MHz on most signals. So to poll all of them - assuming you could capture at most half the muxes at once on an 8-bit micro is:

2 MHz * 2.5x for Nyquest since your not phase locked to the ISA clock * 2 since you need to perform 2 sets of mux captures * 8 inputs on each mux = 160 MHz sampling rate. And conservatively if it took you 8 instructions assuming 1:1 clock/instruction rate to change mux inputs, and perform the reads, you would need an AVR8 running at ~1.3 GHz. Not to mention it to realistically capture all traffic in real time over ISA, it would take somewhere between 40 and 250 Mbit/s. Not going to happen over serial.

You'd be better off trying this with a FPGA or small CPLD connected to an SDR SDRAM or SRAM. Then use a fast micro to move data over USB like a STM32F4, Cypress FX2, or FT2232H. I have a hardware design that would be capable of this with little modification, however I have no time to write code for it.

I hope I was gentile enough

eeguru · Jul 29, 2012

I have another project idea in my head lately that would allow someone to code a perfectly cycle accurate recreation of an 8088 in HDL if they were after that. The Zed Board is finally shipping. It would be straight forward to build an FMC to header adapter board with level translation that would allow the connection of all signals of a real 8088. You could then instance a 8088 development core in the programmable fabric and run both in full speed in lock step. There would be enough memory bandwidth - to Linux no less - to capture all of the signal traces to memory in real time and perform a parallel compare signal to signal at each rising and falling edge of the base clock. When the signal sets diverged, the execution could be stopped and then data examined in memory. Would be a pretty neat experimental rig to hack on other processors too.

At $300 the Zed Board is fairly expensive for most hobbyists though.

westveld · Jul 30, 2012

I've been down this road before - and the micro was too slow by far

A atmega644p at 20mhz barely has time to detect an 8 bit value and turn on an output port just before a write cycle is over (4.77mhz bus on an XT).

This worked great for me -
https://www.sparkfun.com/products/9857

You can get it for $50 shipped if you can wait:
http://www.seeedstudio.com/depot/open-workbench-logic-sniffer-p-612.html

That was what allowed me to see how far off the micro was.

The only way I could see doing it with a micro would be with an sram buffer and/or lots of latches and external logic to do all the work, and the micro would just look at the results.

westveld · Jul 30, 2012

Went looking for the logic capture of when I tested the AVR, couldn't find it.

This is the system clock and a MEM write pulse -
http://dl.dropbox.com/u/2024756/stuff/isa-cap.png

When I tested the AVR, I had a super simple/small ASM loop - check port for value (high 4 bits of address bus & memw & aen) and immediately flip an output bit high.

The AVR bit high didn't start until the last 3rd of the MEMW signal time.

AVR loop:

LOOP:
in addr,PINA
cpse r18, addr
rjmp LOOP
sbi PORTD, 7

Chuck(G) · Jul 30, 2012

An 8-bit MCU seems to me to be the wrong tool for the job. A DSP with capture logic and DMA might cut the mustard, but again, I've got to wonder if an FPGA or even a couple of CPLDs might be a better choice.

8088 prefetch algorithm

Veteran Member

Veteran Member

Veteran Member

Veteran Member

25k Member

Veteran Member

Veteran Member

Veteran Member

Veteran Member

Veteran Member

Veteran Member

Veteran Member

Veteran Member

Experienced Member

Veteran Member

Veteran Member

Veteran Member

Experienced Member

Experienced Member

25k Member