• Please review our updated Terms and Rules here

Observing a CPU bug in action

Trixter

Veteran Member
Joined
Aug 31, 2006
Messages
7,478
Location
Chicagoland, Illinois, USA
I stumbled onto the 8086-80286 "only the two most recent prefixes are honored after an interrupt" bug doing something visual and could reproduce it, so I thought I'd make a short video about it:

 
Since ES: is assumed for the destination, and DS: for the source, a couple of pushes and pops on the segregs before and after the movsb should handle things quite nicely. The rep/movs combination was a minefield on the early steppings of the 186 also. If executed while a DMA transfer was in progress, SI and DI could get clobbered.
 
There were two fixes we used. The size-optimized fix was:

Code:
@@again:
        seges   movsb
        loop    @@again

The speed-optimized fix gives the user one of two ways to work around the problem: Either disable interrupts around the REP MOVS, or spend time rearranging register contents so it can be a normal DS:SI -> ES:DI REP MOVS copy leaving interrupts enabled. User gets to pick via a define in the assembler source.
 
It mostly depends on how much data you're moving as to which triumphs. I suspect a major code overhaul would eliminate the need for any fixup code at all. :)
 
When I timed the code with varying input, this:

Code:
        cli
        es: rep movsb
        sti

...was faster than this:

Code:
        mov     bp,ds
        mov     bx,es
        mov     ds,bx
        rep     movsb
        mov     ds,bp

...which was faster than this:

Code:
        push    ds
        push    es
        pop     ds
        rep     movsb
        pop     ds

The first two are given as a configuration option in the code for the user to choose which one they want to use. The one that disables interrupts never takes a CX higher than 127 due to what the code does, so the maximum number of cycles interrupts could be disabled at any one time is (127*4)+(4*4)=524 cycles. The user is warned about both tradeoffs in the comments around the compile directives they can alter.
 
Normally I'd say no, since the 8088 is the target. But if the 8086 were the target, then yes, branching to a rep movsw section would help for longer copies (like, 16 bytes or more). However, the code in question is decompression code, and the compression method rarely results in match lengths over 10 bytes for typical inputs, so a check-and-branch to handle it better would take more time than it saves.

An alternate to a branch would be something that handles everything, like this:

Code:
shr cx,1
rep movsw
adc cx,cx
rep movsb

However, the code in question makes extensive use of the carry bit, which means I'd have to preserve carry before the above sequence, and restore it afterwards, and that also takes more time than it saves.

Speed optimization, like compression algorithms, is a minefield of trade-offs.
 
Does (movsb) rep movsw gain anything in this case?

rep movsb and rep movsw certainly perform differently, even on 8088.
I found that my CGA clone was slightly too slow to run Codeblasters' CGA demo properly on a 4.77 MHz system.
You could see that the bottom scanline of the scroller was not updated at the time the CRT hit that part of the screen.
The demo was apparently designed to *just* finish updating the scroller (they probably hand-tuned the size of the rasterbars at the top to be as big as possible). However, this CGA clone apparently inserted a few waitstates more than a real CGA card does.
So I disassembled and studied the code, and found that it did rep movsb for the scroller.
By rewriting it to rep movsw movsb (it was an odd number of bytes to be copied, namely 79 bytes per scanline), I saved just enough cycles to make it run perfectly on the clone CGA card.
See blog and code here: https://scalibq.wordpress.com/2014/11/22/cgademo-by-codeblasters/
 
rep movsw is the reason that the Lo-tech storage adapters are able to perform better that other types. But not all early hardware correctly implements the byte transfer order (AT&T PC6300 I think was one).
 
I think this bug only exists on the 8088/8086, not on anything newer.

The size-optimized fix was:

Code:
@@again:
        seges   movsb
        loop    @@again

This will lead to an off-by-one error for every interrupt(ion). All the workarounds I've seen return to the string instruction with CX unchanged.
This might be a better way;
Code:
@@again:
        seges   movsb
        inc     cx
        loop    @@again

Also;

When I timed the code with varying input, this:

Code:
        cli
        es: rep movsb
        sti

This will not work because the prefixes are in the wrong order. The CPU only "remembers" the last prefix so the above code will use DS as the source segment when returning from an interrupt.

EDIT: Scratch this last one, I'm stupid.
 
Last edited:
This will lead to an off-by-one error for every interrupt(ion). All the workarounds I've seen return to the string instruction with CX unchanged.

Are you sure?
Note that this code does not use the 'rep' prefix at all. It uses loop *instead* of rep, therefore the issue does not exist.
 
rep movsw is the reason that the Lo-tech storage adapters are able to perform better that other types. But not all early hardware correctly implements the byte transfer order (AT&T PC6300 I think was one).

On the 6300, it was the hardware 16-bit-to-8 bit BIU implemented in external hardware that was the problem. It wasn't so much the MOVSW (which works okay), but the IN AX,DX instruction. The Olivetti engineers didn't quite get it right.
 
Last edited:
Are you sure?
Note that this code does not use the 'rep' prefix at all. It uses loop *instead* of rep, therefore the issue does not exist.

Aargh :headslap: You're right of course. I need to work on my speed reading. :)
 
I'd call it a "documented bug" :) especially since they "fixed" it on later processors.

Yea, I guess it wasn't a 'bug' until they introduced a CPU that had different behaviour.
In fact, you could even argue that the new behaviour is a bug, since it is not backward-compatible with 8088/8086 :)
But they probably documented that just as nicely. I guess I'd have to check the 80186, 286 and possibly 386 manuals as well, to see where they started reporting different behaviour.

Edit: I'm not entirely sure, but I think I found it in the 286 manual: http://bitsavers.informatik.uni-stu...d_80287_Programmers_Reference_Manual_1987.pdf
At the part discussing the rep-instruction, they mention overriding ds:si to es:si, but make no mention of any special case.
Go further to the chapter about interrupts, and on page 5-5 it says:
"(the saved value of CS:IP will include all leading prefixes)"
But this part of the manual seems to deal with exceptions, not specifically rep movsw etc.
In appendix C-1 they list changes from 8088/8086, and they do point out that:
"Any interrupt on the 80286 will always leave the saved CS:IP value pointing at the beginning of the instruction that failed (including prefixes). On the 8086, the CS:IP value saved for a divide exception points at the next instruction."
But that is the only difference in interrupt handling that they specify.
So the manual isn't entirely clear about this, but it sounds like the 286 behaves differently than 8088/8086 in this case.

Did you test and verify whether the 286 has the bug or not?
 
Last edited:
Did you test and verify whether the 286 has the bug or not?

I did not. I don't have easy access to a 286 right now, but it would be pretty easy to someone to test: Set DS:SI = ES:DI, then perform a REP ES MOVS with CX=FFFF while IRQ 0 is set to fire at a high rate and test whether or not CX=0 when done. If the bug exists, the REP will be dropped and CX wouldn't have counted all the way down to 0.
 
Last edited:
Ah yes, I have a 286 as well of course (a late model Harris 286-20). I could do a little test myself if I get round to it.
 
Ah yes, I have a 286 as well of course (a late model Harris 286-20). I could do a little test myself if I get round to it.
There are still no results. :(

Interestingly that http://tcm.computerhistory.org/ComputerTimeline/Chap37_intel_CS2.pdf (page 631) has the next text about 8086
During the execution of a repeated primitive operation the operand pointer registers (SI and DI) and the operation count register (CX) are updated after each repetition, whereas the instruction pointer will retain the offset address of the repeat prefix byte (assuming it immediately precedes the string operation instruction). Thus, an interrupted repeated operation will be correctly resumed when control returns from the interrupting task.
So the described behaviors don't guarantee the correct execution of the sequence of two prefixes too.
 
Back
Top