• Please review our updated Terms and Rules here

XTIDE Universal BIOS

Eudimorphodon

Veteran Member
Joined
May 9, 2011
Messages
5,084
Location
Upper Triassic
If you're not (which would be kind of silly) then yes, you would get higher performance with the newer version with XT-CF cards also.

Already on the Plus train. (It's been a long time since I did the bake-off but my recollection that it totally delivered on being at least 50%-ish faster.) So, yeah, just wondering if the loop unrolling would do anything with the 8-bit hardware. I'm pretty sure my current flash is configured to use one of the "BIU Offload" variants of the CF driver, if that would have any bearing.
 

Krille

Veteran Member
Joined
Aug 14, 2010
Messages
1,008
Location
Sweden
I'm pretty sure my current flash is configured to use one of the "BIU Offload" variants of the CF driver, if that would have any bearing.
Yeah, it's as fast as it gets then. Still, you might want to upgrade anyway for other reasons, depending on what revision you're using now and if any bugs has been fixed since then. But the performance won't improve.
 

Cloudschatze

Veteran Member
Joined
Apr 17, 2007
Messages
633
Location
Western United States
Slower transfers but higher IOPS for random reads. That was unexpected. Did you run the benchmark several times with consistent results? BTW, what kind of controller is this?
The results are mostly consistent over several iterations, but with some slight variation. Not enough to change things significantly though; the reported read/write performance with r623 is consistently ~10KB/s less than with r604. I suspect a comparison between r622 and r623 might prove the latter slightly better instead, similar to Malc's results, and I can run those tests later just to confirm. In other words, I don't think r623 makes things slower; that seems to have happened somewhere between r604 and r614.

I'm using one of the "XT-IDE Deluxe" cards, based on the r2 design.
 
Last edited:

Trixter

Veteran Member
Joined
Aug 31, 2006
Messages
7,262
Location
Chicagoland, Illinois, USA
Why would it only help with even-sized instructions?

If you have an odd-sized instruction, it doesn't matter if you jump to it aligned or not, since it's going to take a full operation to read the last one or two bytes. For example, if I have a 3-byte instruction:

Aligned: first fetch grabs bytes 0 and 1, second fetch grabs 2
Unaligned: first fetch grabs byte 0, second fetch grabs bytes 1 and 2

It's two fetches no matter the alignment. So that's why I say alignment only helps with even-sized instructions.

If the 8086 prefetch queue can make use of the "extra" fetched byte in the Aligned scenario, then that would be beneficial... but I was under the impression the 8086 prefetch queue is 3 words, not 6 bytes, and is handled as such.

jcxz %%End ; Jump to end if no repeats left (preserves FLAGS)

Actually, you can't rely on this. On 8088 (I have not verified on other CPUs), I witnessed behavior that proved CX was not consistently decremented when an interrupt occurs. I brought this to @reenigne 's attention and he checked the microcode and confirmed it. Since what he wrote me is a good explanation, I don't think he'll mind me reproducing it here:

The way that MOVSB works internally is that it first (if the REP flag is set) checks for CX==0 and decrements, then does a move, then (if the REP flag is set) checks for interrupts and repeats.

Let's consider the case going into this code with CX==1. The REP flag is set so we check for CX==0 (it's not) and then decrement. Then we move a byte, incrementing SI and DI. Now let's suppose we've got an interrupt. The RPTI microcode subroutine is executed, which moves the instruction pointer back by 2 bytes, assuming that it will now point at the "REP" (but it's not - it now points at the "ES:").

After the interrupt returns, execution is resumed with "ES: MOVSB" even though CX is already 0 at this point! So a second byte is copied. Basically there will always be an extra byte copied if an interrupt occurs on the last byte.

Unfortunately I don't see an easy way to tell if an interrupt happened on that last byte - CX is 0 either way.

If there's just one instance of this code, and you know about every interrupt that could occur (as could be done in a demo situation) then the interrupt vectors could check to see if the return address is at the "ES:" and decrement it if it is. Then the JCXZ/DEC/JMP sequence could be removed and the routine would be faster in the common case when no interrupt occurs. I'm not sure if this would be more trouble than it's worth - I'll let you be the judge of that!

To be as general as possible you could just override all the PC/XT hardware interrupts (INT 8 - INT 0xF) to do this adjustment, and even all the AT hardware interrupts too (INT 0x70 - INT 0x77). I think at some point CPUs changed to work correctly if a multiple-prefix instruction was interrupted but in that case the return-address-adjustment code path would just never get hit.
 

Krille

Veteran Member
Joined
Aug 14, 2010
Messages
1,008
Location
Sweden
I suspect a comparison between r622 and r623 might prove the latter slightly better instead, similar to Malc's results, and I can run those tests later just to confirm. In other words, I don't think r623 makes things slower; that seems to have happened somewhere between r604 and r614.
Yeah, my immediate thought when reading your post was that something changed between r604 and r623 that made transfers a lot slower. However, when looking through the changes in each of those revisions I can't find anything that should affect the performance so drastically. It would be great if you could pinpoint the exact revision where performance dropped.
 

Krille

Veteran Member
Joined
Aug 14, 2010
Messages
1,008
Location
Sweden
If the 8086 prefetch queue can make use of the "extra" fetched byte in the Aligned scenario, then that would be beneficial... but I was under the impression the 8086 prefetch queue is 3 words, not 6 bytes, and is handled as such.

The prefetch queue is treated as a buffer of 6 bytes. It must be, because otherwise it would have to redo the instruction fetch for all instructions not aligned on a WORD boundary. That would be a serious design flaw considering that so many instructions have an odd length. So yes, the benefit of that extra byte fetched is what makes the difference in performance.

Actually, you can't rely on this. On 8088 (I have not verified on other CPUs), I witnessed behavior that proved CX was not consistently decremented when an interrupt occurs. I brought this to @reenigne 's attention and he checked the microcode and confirmed it. Since what he wrote me is a good explanation, I don't think he'll mind me reproducing it here:

If I understand reenigne's description correctly, there's an extra string operation for every interrupt? So if there are several interrupts the count will be off by that same amount? It makes me wonder though, everywhere I've read about this string-operation-interrupted bug on 808x CPUs they only mention that it falls through with the count in CX being non-zero (hence the workaround in the eSEG_STR macro). Was it because of a misunderstanding of the details of this bug or is this yet another, previously unknown, bug?

BTW, I'm curious how you discovered it. What did the code look like and what were you doing?

Also, this made me wonder why we haven't had more reports of buggy behaviour in XUB. That is, until I counted the amount of times the eSEG_STR macro is used in the list files of the official builds. It turns out that the affected code is used only once and only in the large XT build.

Anyway, so the fix is to disable interrupts during the string operation? It won't help with NMIs but that shouldn't be a concern in this case.
 

Trixter

Veteran Member
Joined
Aug 31, 2006
Messages
7,262
Location
Chicagoland, Illinois, USA
The prefetch queue is treated as a buffer of 6 bytes. It must be, because otherwise it would have to redo the instruction fetch for all instructions not aligned on a WORD boundary. That would be a serious design flaw considering that so many instructions have an odd length. So yes, the benefit of that extra byte fetched is what makes the difference in performance.

This makes sense.

If I understand reenigne's description correctly, there's an extra string operation for every interrupt? So if there are several interrupts the count will be off by that same amount? It makes me wonder though, everywhere I've read about this string-operation-interrupted bug on 808x CPUs they only mention that it falls through with the count in CX being non-zero (hence the workaround in the eSEG_STR macro). Was it because of a misunderstanding of the details of this bug or is this yet another, previously unknown, bug?

I think all of the other explanations are simply wrong, propagated down through the years.

Since there's no easy way to determine how many interrupts occurred (ie. how much CX is wrong), this technique simply can't be trusted on 808x systems.

BTW, I'm curious how you discovered it. What did the code look like and what were you doing?

Decompression code. My test harness does a REP CMPSW to ensure the decompressed data matches the source. It was also easy to visualize: Decompress a 16K CGA raw image to b800:0000, and you can literally see where the interrupts occurred, as the image would shift vertically after the interrupt occurred and the extra iteration was copied to the screen.

Fun fact: At speeds as slow as 4.77 MHz, I could deliberately introduce more errors by mashing the keyboard keys in the 0.5 seconds it took for the decompression routine to finish.

Also, this made me wonder why we haven't had more reports of buggy behaviour in XUB. That is, until I counted the amount of times the eSEG_STR macro is used in the list files of the official builds. It turns out that the affected code is used only once and only in the large XT build.

Anyway, so the fix is to disable interrupts during the string operation? It won't help with NMIs but that shouldn't be a concern in this case.

In my case, I was going for the fastest possible speed, and had a requirement of not disabling interrupts (music player in use; would have introduced audible jitter in the music output when something was decompressed), so I had to rewrite my code to not use REP ES: MOVSB. I set DS:SI and ES:DI to proper source and destination, and leave interrupts enabled. To minimize the setup/teardown cost, I used registers for temp variables, ie. MOV AX,DS; MOV ES,AX instead of PUSH DS; POP ES.

Since you're only using it to put strings to screen IIRC, probably doesn't hurt to just disable interrupts. If the routine is used inside and outside interrupts, or you're not sure if it will be, don't do CLI; STI but instead do PUSHF; CLI; POPF so that it will work in both scenarios.
 

Krille

Veteran Member
Joined
Aug 14, 2010
Messages
1,008
Location
Sweden
Sorry for disappearing on you guys. I've been sick and also busy with other stuff.

It looks like r605, based on the following testing/results:

Google Docs - "ide_xtp.bin, V30@8MHz"

Thanks for doing all that testing and creating a spread sheet. I bet that was a lot of work.

Anyway, I can't explain the loss of performance in r605. The only change that should affect performance is WORD_ALIGN being set to 2 and that is supposed to increase performance on 8086/V30 CPUs by WORD aligning stuff like function tables. So I'm stumped. You might want to make a custom build with WORD_ALIGN set to 1 as it used to be just to confirm.

There's one thing I find a bit peculiar though. The write speed is consistently slower on the first two iterations for every revision with the first iteration being the slowest. I wonder why that is?

Since you're only using it to put strings to screen IIRC, probably doesn't hurt to just disable interrupts. If the routine is used inside and outside interrupts, or you're not sure if it will be, don't do CLI; STI but instead do PUSHF; CLI; POPF so that it will work in both scenarios.
This is what I've come up with;
Code:
;--------------------------------------------------------------------
; Repeats string instruction with segment override.
; This macro prevents 8088/8086 restart bug.
;
; eSEG_STR
;    Parameters:
;        %1:        REP/REPE/REPZ or REPNE/REPNZ prefix
;        %2:        Source segment override (destination is always ES)
;        %3:        String instruction
;        %4:        An exclamation mark (!) if the state of the IF must
;                be preserved (can not be used together with CMPS or
;                SCAS instructions), otherwise it will be set on
;                return from the macro (i.e. interrupts will be on)
;        CX:        Repeat count
;    Returns:
;        FLAGS for CMPS and SCAS only
;    Corrupts registers:
;        FLAGS
;--------------------------------------------------------------------
%macro eSEG_STR 3-4
%ifndef USE_186    ; 8088/8086 has string instruction restart bug when more than one prefix
%ifidn %4, !                ; Preserve the IF
    FSIS    cmps, %3
%ifn strpos
    FSIS    scas, %3
%endif
%if strpos
    %error "The state of the IF can not be preserved when using CMPS or SCASB!"
%endif
    pushf
    cli
    %1                        ; REP is the prefix that can be lost
    %2                        ; SEG is the prefix that won't be lost
    %3                        ; String instruction
    popf
%else                        ; No need to preserve the IF
    cli
    %1
    %2
    %3
    sti
%endif
%else    ; No bug on V20/V30 and later, don't know about 188/186
    %2
    %1 %3
%endif
%endmacro
 

Cloudschatze

Veteran Member
Joined
Apr 17, 2007
Messages
633
Location
Western United States
Anyway, I can't explain the loss of performance in r605. The only change that should affect performance is WORD_ALIGN being set to 2 and that is supposed to increase performance on 8086/V30 CPUs by WORD aligning stuff like function tables. So I'm stumped. You might want to make a custom build with WORD_ALIGN set to 1 as it used to be just to confirm.
I'll give that a shot soon and will report back. Thank-you for the suggestion!

There's one thing I find a bit peculiar though. The write speed is consistently slower on the first two iterations for every revision with the first iteration being the slowest. I wonder why that is?
Each initial run would have included the disk free space calculation. I'm not sure what might explain the second...
 
Top