Modern XT compatible PC on FPGA with real 8088

Trixter · Mar 2, 2015

You can/should unpack everything you try to run/test with UNP before starting. Maybe your Packman has the same issue?

I can provide uncompressed paratrooper if you need it; PM me.

Chuck(G) · Mar 2, 2015

There are several things that depend upon at least a register-compatible NEC uPD765 FDC. Of course, if you're just going to run games that have been hacked to disable copy protection, maybe that doesn't matter and you can funnel everything through INT 13H BIOS services.

Buf if you want to run, say, Harvard Presentation Graphics 1.0 in its "unhacked" format, good luck--the copy protection key is located in the gap between sectors and pretty much demands that the floppy is written a whole track at a time (format pattern and all). Or if you want to run the first version of FastBack (backup utility), good luck--it programs the FDC for its own unsual format. How well can you run Lotus 1-2-3 in its original version (with no hacking)?

So, it really depends on what you want to do. Games should be easy, as the copy protection has been hacked out on just about all of them. Productivity software, not so much.

DDS · Mar 3, 2015

reenigne said:
It does not. It does do some cycle counting, but uses the published (best case) counts and does not emulate waiting for the bus at all, as far as I can see.

I think it's actually marginally harder, because on the 8086 there are penalties for non-aligned accesses which (I'm pretty sure) don't exist on 8088.

I'm not aware of any programs that have been published and for which the consequences for running without cycle exact timing are anything except running at the wrong speed. I expect that to change in the next couple of months.

Back in the dim recesses of my memory lurks the phrase "always empty queue" that was in reference to the 8088 instruction prefetch queue. IIRC the main, perhaps only, difference between the 8088 and 8086 is the Bus Interface Unit (BIU) as their Execution Units (EU) are identical or nearly so. But since the 8088 could only fetch one half of an instruction word at a time it's prefetch cycle was only half as fast as an 8086's. The article referencing this mentioned that only hardware multiply and hardware divide instructions took longer to execute than to fetch, and as a result the 8088 prefetch queue was (almost) always empty. A quick search didn't turn up the article in question but did turn up a discussion on the resulting variability of 8088 execution timings in one of the letters here:

http://collaboration.cmc.ec.gc.ca/s...ebsite/articles/DDJ/1988/8801/8801k/8801k.htm

cr1901 · Mar 3, 2015

Chuck(G) said:
Buf if you want to run, say, Harvard Presentation Graphics 1.0 in its "unhacked" format, good luck--the copy protection key is located in the gap between sectors and pretty much demands that the floppy is written a whole track at a time (format pattern and all).

Just out of curiosity... let's see if I've retained the information from those helpful floppy docs you sent me :3. To actually extract the key, you'd program the NEC Gap Length a different number than 0x2A so that the controller is effectively fooled into reading "Gap 3 as data"?

Chuck(G) · Mar 3, 2015

Nope--to extract the key, you'd program a "Read Track/Read Diagnostic (which doesn't actually read a track, but collects all sectors from index-to-index without regard to sector header content . So the governing is the CHRN,DTL data in the Read Track function. However, the function does use DTL (sector length code). So, set the sector length code (DTL)to 03 (1024 bytes) and read a track with 512 byte sectors--you get the gap bytes returned, as well as the CRC bytes and probably the address mark information of the next sector.

Now, here's the gotcha that makes these disks very hard to copy with standard PC gear. There's no way to write into the gap on a 765 as there's no "write track" operation. You can, however do this during the format operation on a WD17xx or 27xx chip, as formatting is done differently.

Trixter · Mar 3, 2015

DDS said:
But since the 8088 could only fetch one half of an instruction word at a time it's prefetch cycle was only half as fast as an 8086's. The article referencing this mentioned that only hardware multiply and hardware divide instructions took longer to execute than to fetch, and as a result the 8088 prefetch queue was (almost) always empty.

This is mostly correct. The two main differences between the 8088 and 8086:

- 8086 BUI can fetch a 16-bit word in 4 cycles, whereas the 8088 BUI can fetch an 8-bit byte in 4 cycles. So, word accesses occur in half the time on an 8086 than they do on an 8088 -- if they're word-aligned. If they're not, they are broken up into two 8-bit accesses, just like 8088.

- 8086 BUI prefetch queue is 6 bytes (3 words), whereas 8088 prefetch queue is 4 bytes (2 words).

On the 8088, most typical instructions are 2-3 bytes and execute in less cycles than it takes to fetch them (for example, mov cx,bx is 2 bytes which takes 8 cycles to fetch, but only 2 cycles to execute), so the article you remember is correct; the 8088's prefetch queue is mostly empty. On the 8086, however, there are more opportunities to keep it filled because it gets filled twice as fast and is 2 bytes larger. It's still empty half the time, roughly, but that's better than being empty nearly all of the time.

Because of these limitations, optimizing for the 8088 is an exercise of optimizing for size. Smaller code takes less time to read than larger code, regardless of execution cycle count timings. Optimizing for 8086 is mostly the same, but you can focus on word accesses since they're "free" if they're aligned.

Edit: I looked over that DDJ letters compilation you linked to and the author is spot-on. I should have told you to just read that over again

newold86 · Mar 3, 2015

Chuck(G) said:
Nope--to extract the key, you'd program a "Read Track/Read Diagnostic (which doesn't actually read a track, but collects all sectors from index-to-index without regard to sector header content . So the governing is the CHRN,DTL data in the Read Track function. However, the function does use DTL (sector length code). So, set the sector length code (DTL)to 03 (1024 bytes) and read a track with 512 byte sectors--you get the gap bytes returned, as well as the CRC bytes and probably the address mark information of the next sector.

Now, here's the gotcha that makes these disks very hard to copy with standard PC gear. There's no way to write into the gap on a 765 as there's no "write track" operation. You can, however do this during the format operation on a WD17xx or 27xx chip, as formatting is done differently.

OK, seems I was way too self-confident thinking it will not be too difficult to emulate accurate FDD behavior

newold86 · Mar 3, 2015

Another question - is anyone aware of any program that uses specific addresses for BIOS procedures instead of using INTs ?

Krille · Mar 3, 2015

Lee Pelletier (DDJ Letters) said:
The few exceptions are multiply and divide instructions, which take longer to execute than to fetch. There are also a few esoteric instructions that take longer to execute than to fetch, but their use is so rare that you can forget about them.

This made me wonder which instructions he was talking about so I made this little list. I've only included instructions where the time to fetch is less than or equal the time it takes to execute. Also, this list does not include any of the (forms of) instructions where one of the operands is a memory location (my head starts to hurt just by trying to figure those out). The I/O instructions are not included either for the same reason.

Code:

Mnemonic		EU Cycles (8088)	Bytes
AAA			8			1
AAD			60			2
AAM			83			2
AAS			8			1
CWD			5			1
DAA			4			1
DAS			4			1
DIV			80-162			2
IDIV			101-184			2
IMUL			80-154			2
INTO (no jump)		4			1
LAHF			4			1
LEA			2+EA (7+ in total)	2-4
MUL			70-118?			2
RCL (reg,CL)		8+4n			2
RCR (reg,CL)		8+4n			2
ROL (reg,CL)		8+4n			2
ROR (reg,CL)		8+4n			2
SAHF			4			1
SAL/SHL (reg,CL)	8+4n			2
SAR (reg,CL)		8+4n			2
SHR (reg,CL)		8+4n			2
WAIT/FWAIT		4			1

Shift/Rotate using CL is slow and I'm sure the mem,CL variants should be included in the above list as well.

Please note! I've pulled these numbers from a document that I know is full of errors so take this with a grain of salt.

Trixter · Mar 3, 2015

Krille said:
I've only included instructions where the time to fetch is less than or equal the time it takes to execute.

Just FYI, you missed XLAT. Also, the string instructions (LODS, MOVS, CMPS, SCAS).

Shift/Rotate using CL is slow

Yes, but there is a break-even point which IIRC is shifting/rotating by 3. Meaning, once you need to iterate 4 or more times, the shift/rotate reg,CL method is faster than doing reg,1 reg,1 reg,1 etc. repeatedly. Unfortunately, both are slow enough that when I need to do nybble work (like swap nybbles, or shift by 4), I try to use a translation table and XLAT.

Chuck(G) · Mar 3, 2015

I never understood the omission of a nibble shift (left or right) from the 8086 instruction set, given the segment/offset addressing scheme. I was even more surprised that there were no "adjust segment after add/subtract" instructions. For example, suppose that I'm at address 1000:FFFE and I'm going through an array of 4-byte elements. So to get to the next element, from my address in (DS:BX) I'd write:

Code:

   add     bx,4
   jnc     l1
   mov   ax,ds
   add    ah,10h
   mov   ds,ax
l1:

I remember trying to explain to management why a 68000 was much superior in this respect. I got no traction at all--we went with the 80186. By 1981, Intel should have been in the position that a version of the x86 architecture could be produced without the stupid segments and with 32-bit registers. It would have saved everybody a bunch of trouble. I suspect that the 432 project was sucking up a lot of resources, however.

Trixter · Mar 3, 2015

I've always wished that the segment : offset architecture used 64K paragraphs instead of 16b paragraphs. If that were the case, and you could perform math on segment registers, then moving through segmented memory could have been as simple as this:

Code:

add si,value
adc es,0

...to advance to the next segment. I'm still miffed there's no way to perform math on segment registers.

I've always been jealous of the 68000. 8 32-bit general-purpose registers and 8 32-bit address registers?!? You'd have to be a complete moron to not make that CPU perform wonderfully! Although, big-endian seems like it would be annoying (I like how little-endian casts are free).

Chuck(G) · Mar 3, 2015

Not only that, but Intel boxed themselves into a 1MB address space, which was incredibly short-sighted.

Krille · Mar 3, 2015

Trixter said:
Just FYI, you missed XLAT. Also, the string instructions (LODS, MOVS, CMPS, SCAS).

They all access memory so I left them out since I don't know how they interfere with the BIU prefetching code. Though I suppose any instruction that takes 4 or more cycles to execute (excluding the memory accesses) would allow the BIU to fill up the prefetch queue. The problem is knowing how much of the execution time is used for memory transfers. An obvious example that is guaranteed to fill the prefetch queue is IDIV [mem16] which is 2-4 bytes but takes a staggering 175-194 cycles to execute on an 8088.

Yes, but there is a break-even point which IIRC is shifting/rotating by 3. Meaning, once you need to iterate 4 or more times, the shift/rotate reg,CL method is faster than doing reg,1 reg,1 reg,1 etc. repeatedly.

Let's see, if we include the overhead for setup, assuming an empty prefetch queue (mov cl,4 ; 2 bytes, 8 cycles) and add the cycles for fetching the shift/rotate; another 2 bytes/8 cycles, then add 8+4*4 cycles for the execution we get 40 cycles in total.

With the shift/rotate reg,1 instructions we get 4 instructions x 2 bytes each x 4 cycles per byte, resulting in 32 cycles in total.

If we shift/rotate 5 bit positions then the results are 44/40. With 6 bit positions the results are 48/48. This is all assuming the prefetch queue is empty to begin with, the results changes considerably if it's not. We also need to consider that the prefetch queue will be full after the shift/rotate reg,CL but completely empty after the shift/rotate reg,1 instructions. Not to mention the latter is more likely to be interrupted by a DRAM refresh. Bottom line is, if you say the real-world sweet spot is 4 then I'll take your word for it.

Oh btw, I remember reading somewhere (I think it might have been one of your posts here actually) that fetching code from ROM only takes 3 cycles per byte? If true then that would also change the math to give more weight to the execution time vs fetching time. Now my head is really starting to hurt!

DDS · Mar 4, 2015

Trixter said:
I've always wished that the segment : offset architecture used 64K paragraphs instead of 16b paragraphs. If that were the case, and you could perform math on segment registers, then moving through segmented memory could have been as simple as this:

Code:

add si,value adc es,0

...to advance to the next segment. I'm still miffed there's no way to perform math on segment registers.

I've always been jealous of the 68000. 8 32-bit general-purpose registers and 8 32-bit address registers?!? You'd have to be a complete moron to not make that CPU perform wonderfully! Although, big-endian seems like it would be annoying (I like how little-endian casts are free).

Not to mention that you could very easily create endless seemingly unique segment

ffset pairs that actually addressed the same memory location.

Many years ago I was taking a comparative computer architecture class and had a discussion on just this topic with the professor. Why in the world would IBM box themselves into a similar segment

ffset mess with the new PC that they had endured with the S360? He responded that IBM was never known for making the most technically superior computers, rather they made the most commercially successful ones. They were also looking at the same "upgrade path" that intel was looking at when they designed the x86 series to be a superset of the 8080. One could easily create a translator that would take 8080 assembler files and create 8088/8086 versions with little need for rewrite, something that would have been a lot more difficult for the MC68K folks who on the other hand had the advantage of a clean rewrite (although some say the 68k looks a lot like a "VAX on a chip").

But we're treading on dangerous ground here. The religious fervor displayed in the flame wars between "The 6'ers" and "The 8'ers" can be intense. ;-) And whatever you do, don't say anything negative about EMACS!

Chuck(G) · Mar 4, 2015

On your FPGA emulator, what happens to AX if you execute "8D C2"? (sort of LEA AX,DX, which DEBUG and most assemblers will choke on).

newold86 · Mar 4, 2015

Chuck(G) said:
On your FPGA emulator, what happens to AX if you execute "8D C2"? (sort of LEA AX,DX, which DEBUG and most assemblers will choke on).

Is this question to me ?

In my project I don't emulate CPU, just use real one - the main reason of going with real 8088 was to avoid all those things that being discussed above and get real compatibility with XT...

Chuck(G) · Mar 4, 2015

Ah, okay, I understand. For some reason, I got the impression that you were using an FPGA to emulate the CPU also.

So, never mind. But the 8D C2 sequence is very interesting and I suspect that few, if any, 8088 emulators actually handle it the same way that a real 8088 does.

Trixter · Mar 4, 2015

Chuck(G) said:
But the 8D C2 sequence is very interesting and I suspect that few, if any, 8088 emulators actually handle it the same way that a real 8088 does.

I've never heard of this before! Holy mackerel, it crashes DOSBox! LEA performs the effective address calculation of the second argument and puts it into the first argument, but "DX" is not a valid EA expression, so what is this supposed to do? What does this do on a real 8088?

Trixter · Mar 4, 2015

Krille said:
Oh btw, I remember reading somewhere (I think it might have been one of your posts here actually) that fetching code from ROM only takes 3 cycles per byte?

About that: I got that info from Richard Wilton's book about programming video hardware. He had the observation that the BIOS was able to avoid CGA "snow" using MOV AX,BX; STOSW whereas when you run that code out of system RAM, you get snow. His words:

The IBM ROM BIOS routines that write to the video buffer during
horizontal retrace use the sequence

mov ax,bx
stosw

to move a character and attribute into the buffer without snow.
Nevertheless, if you use the same two instructions in a RAM-based
program, you see snow on a CGA running on a 4.77 MHz PC. The reason is
that, at the point where these instructions are executed, the 4-byte
instruction prefetch queue in the 8088 has room for only two more
bytes. This means that the STOSW opcode cannot be prefetched. Instead,
the 8088 must fetch the opcode from memory before it can be executed.

That last memory access to fetch the STOSW instruction makes the
difference. Because accesses to ROM are faster than accesses to RAM,
the instruction fetch is slightly faster out of ROM, so no snow is
visible because the STOSW can run before the horizontal blanking
interval ends. The routine in Listing 3-10 sidesteps the problem by
using XCHG AX,BX (a 1-byte opcode) instead of MOV AX,BX (a 2-byte
opcode). This avoids the extra instruction fetch, so the code executes
fast enough to prevent display interference.

(emphasis mine) However, a few months ago I tried timing a REP LODSW from a ROM location and it was no faster than a RAM location. So this is slightly confusing, because you can definitely observe the behavior that the MOV AX,BX; STOSW executes faster out of ROM than RAM -- you can see the speed difference onscreen. So my only conclusion is that, while general data accesses from ROM may not be faster, code executes slighter faster out of ROM.

I entertain proofs or challenges to the above.

Modern XT compatible PC on FPGA with real 8088

Veteran Member

25k Member

Veteran Member

Veteran Member

25k Member

Veteran Member

Experienced Member

Experienced Member

Veteran Member

Veteran Member

25k Member

Veteran Member

25k Member

Veteran Member

Veteran Member

25k Member

Experienced Member

25k Member

Veteran Member

Veteran Member