The prefetch queue is treated as a buffer of 6 bytes. It must be, because otherwise it would have to redo the instruction fetch for all instructions not aligned on a WORD boundary. That would be a serious design flaw considering that so many instructions have an odd length. So yes, the benefit of that extra byte fetched is what makes the difference in performance.
This makes sense.
If I understand reenigne's description correctly, there's an extra string operation for every interrupt? So if there are several interrupts the count will be off by that same amount? It makes me wonder though, everywhere I've read about this string-operation-interrupted bug on 808x CPUs they only mention that it falls through with the count in CX being non-zero (hence the workaround in the eSEG_STR macro). Was it because of a misunderstanding of the details of this bug or is this yet another, previously unknown, bug?
I think all of the other explanations are simply wrong, propagated down through the years.
Since there's no easy way to determine how many interrupts occurred (ie. how much CX is wrong), this technique simply can't be trusted on 808x systems.
BTW, I'm curious how you discovered it. What did the code look like and what were you doing?
Decompression code. My test harness does a REP CMPSW to ensure the decompressed data matches the source. It was also easy to visualize: Decompress a 16K CGA raw image to b800:0000, and you can literally see where the interrupts occurred, as the image would shift vertically after the interrupt occurred and the extra iteration was copied to the screen.
Fun fact: At speeds as slow as 4.77 MHz, I could deliberately introduce more errors by mashing the keyboard keys in the 0.5 seconds it took for the decompression routine to finish.
Also, this made me wonder why we haven't had more reports of buggy behaviour in XUB. That is, until I counted the amount of times the eSEG_STR macro is used in the list files of the official builds. It turns out that the affected code is used only once and only in the large XT build.
Anyway, so the fix is to disable interrupts during the string operation? It won't help with NMIs but that shouldn't be a concern in this case.
In my case, I was going for the fastest possible speed, and had a requirement of not disabling interrupts (music player in use; would have introduced audible jitter in the music output when something was decompressed), so I had to rewrite my code to not use REP ES: MOVSB. I set DS:SI and ES:DI to proper source and destination, and leave interrupts enabled. To minimize the setup/teardown cost, I used registers for temp variables, ie. MOV AX,DS; MOV ES,AX instead of PUSH DS; POP ES.
Since you're only using it to put strings to screen IIRC, probably doesn't hurt to just disable interrupts. If the routine is used inside and outside interrupts, or you're not sure if it will be, don't do CLI; STI but instead do PUSHF; CLI; POPF so that it will work in both scenarios.