XTIDE Universal BIOS

Eudimorphodon · Jun 17, 2022

Krille said:
If you're not (which would be kind of silly) then yes, you would get higher performance with the newer version with XT-CF cards also.

Already on the Plus train. (It's been a long time since I did the bake-off but my recollection that it totally delivered on being at least 50%-ish faster.) So, yeah, just wondering if the loop unrolling would do anything with the 8-bit hardware. I'm pretty sure my current flash is configured to use one of the "BIU Offload" variants of the CF driver, if that would have any bearing.

Krille · Jun 17, 2022

Eudimorphodon said:
I'm pretty sure my current flash is configured to use one of the "BIU Offload" variants of the CF driver, if that would have any bearing.

Yeah, it's as fast as it gets then. Still, you might want to upgrade anyway for other reasons, depending on what revision you're using now and if any bugs has been fixed since then. But the performance won't improve.

Cloudschatze · Jun 17, 2022

Krille said:
Slower transfers but higher IOPS for random reads. That was unexpected. Did you run the benchmark several times with consistent results? BTW, what kind of controller is this?

The results are mostly consistent over several iterations, but with some slight variation. Not enough to change things significantly though; the reported read/write performance with r623 is consistently ~10KB/s less than with r604. I suspect a comparison between r622 and r623 might prove the latter slightly better instead, similar to Malc's results, and I can run those tests later just to confirm. In other words, I don't think r623 makes things slower; that seems to have happened somewhere between r604 and r614.

I'm using one of the "XT-IDE Deluxe" cards, based on the r2 design.

Trixter · Jun 17, 2022

Krille said:
Why would it only help with even-sized instructions?

If you have an odd-sized instruction, it doesn't matter if you jump to it aligned or not, since it's going to take a full operation to read the last one or two bytes. For example, if I have a 3-byte instruction:

Aligned: first fetch grabs bytes 0 and 1, second fetch grabs 2
Unaligned: first fetch grabs byte 0, second fetch grabs bytes 1 and 2

It's two fetches no matter the alignment. So that's why I say alignment only helps with even-sized instructions.

If the 8086 prefetch queue can make use of the "extra" fetched byte in the Aligned scenario, then that would be beneficial... but I was under the impression the 8086 prefetch queue is 3 words, not 6 bytes, and is handled as such.

Krille said:
jcxz %%End ; Jump to end if no repeats left (preserves FLAGS)

Actually, you can't rely on this. On 8088 (I have not verified on other CPUs), I witnessed behavior that proved CX was not consistently decremented when an interrupt occurs. I brought this to @reenigne 's attention and he checked the microcode and confirmed it. Since what he wrote me is a good explanation, I don't think he'll mind me reproducing it here:

The way that MOVSB works internally is that it first (if the REP flag is set) checks for CX==0 and decrements, then does a move, then (if the REP flag is set) checks for interrupts and repeats.

Let's consider the case going into this code with CX==1. The REP flag is set so we check for CX==0 (it's not) and then decrement. Then we move a byte, incrementing SI and DI. Now let's suppose we've got an interrupt. The RPTI microcode subroutine is executed, which moves the instruction pointer back by 2 bytes, assuming that it will now point at the "REP" (but it's not - it now points at the "ES:").

After the interrupt returns, execution is resumed with "ES: MOVSB" even though CX is already 0 at this point! So a second byte is copied. Basically there will always be an extra byte copied if an interrupt occurs on the last byte.

Unfortunately I don't see an easy way to tell if an interrupt happened on that last byte - CX is 0 either way.

If there's just one instance of this code, and you know about every interrupt that could occur (as could be done in a demo situation) then the interrupt vectors could check to see if the return address is at the "ES:" and decrement it if it is. Then the JCXZ/DEC/JMP sequence could be removed and the routine would be faster in the common case when no interrupt occurs. I'm not sure if this would be more trouble than it's worth - I'll let you be the judge of that!

To be as general as possible you could just override all the PC/XT hardware interrupts (INT 8 - INT 0xF) to do this adjustment, and even all the AT hardware interrupts too (INT 0x70 - INT 0x77). I think at some point CPUs changed to work correctly if a multiple-prefix instruction was interrupted but in that case the return-address-adjustment code path would just never get hit.

Krille · Jun 20, 2022

Cloudschatze said:
I suspect a comparison between r622 and r623 might prove the latter slightly better instead, similar to Malc's results, and I can run those tests later just to confirm. In other words, I don't think r623 makes things slower; that seems to have happened somewhere between r604 and r614.

Yeah, my immediate thought when reading your post was that something changed between r604 and r623 that made transfers a lot slower. However, when looking through the changes in each of those revisions I can't find anything that should affect the performance so drastically. It would be great if you could pinpoint the exact revision where performance dropped.

Krille · Jun 20, 2022

Trixter said:
If the 8086 prefetch queue can make use of the "extra" fetched byte in the Aligned scenario, then that would be beneficial... but I was under the impression the 8086 prefetch queue is 3 words, not 6 bytes, and is handled as such.

The prefetch queue is treated as a buffer of 6 bytes. It must be, because otherwise it would have to redo the instruction fetch for all instructions not aligned on a WORD boundary. That would be a serious design flaw considering that so many instructions have an odd length. So yes, the benefit of that extra byte fetched is what makes the difference in performance.

Actually, you can't rely on this. On 8088 (I have not verified on other CPUs), I witnessed behavior that proved CX was not consistently decremented when an interrupt occurs. I brought this to @reenigne 's attention and he checked the microcode and confirmed it. Since what he wrote me is a good explanation, I don't think he'll mind me reproducing it here:

If I understand reenigne's description correctly, there's an extra string operation for every interrupt? So if there are several interrupts the count will be off by that same amount? It makes me wonder though, everywhere I've read about this string-operation-interrupted bug on 808x CPUs they only mention that it falls through with the count in CX being non-zero (hence the workaround in the eSEG_STR macro). Was it because of a misunderstanding of the details of this bug or is this yet another, previously unknown, bug?

BTW, I'm curious how you discovered it. What did the code look like and what were you doing?

Also, this made me wonder why we haven't had more reports of buggy behaviour in XUB. That is, until I counted the amount of times the eSEG_STR macro is used in the list files of the official builds. It turns out that the affected code is used only once and only in the large XT build.

Anyway, so the fix is to disable interrupts during the string operation? It won't help with NMIs but that shouldn't be a concern in this case.

Cloudschatze · Jun 20, 2022

Krille said:
It would be great if you could pinpoint the exact revision where performance dropped.

It looks like r605, based on the following testing/results:

Google Docs - "ide_xtp.bin, V30@8MHz"

Trixter · Jun 20, 2022

Krille said:
The prefetch queue is treated as a buffer of 6 bytes. It must be, because otherwise it would have to redo the instruction fetch for all instructions not aligned on a WORD boundary. That would be a serious design flaw considering that so many instructions have an odd length. So yes, the benefit of that extra byte fetched is what makes the difference in performance.

This makes sense.

Krille said:
If I understand reenigne's description correctly, there's an extra string operation for every interrupt? So if there are several interrupts the count will be off by that same amount? It makes me wonder though, everywhere I've read about this string-operation-interrupted bug on 808x CPUs they only mention that it falls through with the count in CX being non-zero (hence the workaround in the eSEG_STR macro). Was it because of a misunderstanding of the details of this bug or is this yet another, previously unknown, bug?

I think all of the other explanations are simply wrong, propagated down through the years.

Since there's no easy way to determine how many interrupts occurred (ie. how much CX is wrong), this technique simply can't be trusted on 808x systems.

Krille said:
BTW, I'm curious how you discovered it. What did the code look like and what were you doing?

Decompression code. My test harness does a REP CMPSW to ensure the decompressed data matches the source. It was also easy to visualize: Decompress a 16K CGA raw image to b800:0000, and you can literally see where the interrupts occurred, as the image would shift vertically after the interrupt occurred and the extra iteration was copied to the screen.

Fun fact: At speeds as slow as 4.77 MHz, I could deliberately introduce more errors by mashing the keyboard keys in the 0.5 seconds it took for the decompression routine to finish.

Krille said:
Also, this made me wonder why we haven't had more reports of buggy behaviour in XUB. That is, until I counted the amount of times the eSEG_STR macro is used in the list files of the official builds. It turns out that the affected code is used only once and only in the large XT build.

Anyway, so the fix is to disable interrupts during the string operation? It won't help with NMIs but that shouldn't be a concern in this case.

In my case, I was going for the fastest possible speed, and had a requirement of not disabling interrupts (music player in use; would have introduced audible jitter in the music output when something was decompressed), so I had to rewrite my code to not use REP ES: MOVSB. I set DS:SI and ES:DI to proper source and destination, and leave interrupts enabled. To minimize the setup/teardown cost, I used registers for temp variables, ie. MOV AX,DS; MOV ES,AX instead of PUSH DS; POP ES.

Since you're only using it to put strings to screen IIRC, probably doesn't hurt to just disable interrupts. If the routine is used inside and outside interrupts, or you're not sure if it will be, don't do CLI; STI but instead do PUSHF; CLI; POPF so that it will work in both scenarios.

Krille · Jul 8, 2022

Sorry for disappearing on you guys. I've been sick and also busy with other stuff.

Cloudschatze said:
It looks like r605, based on the following testing/results:

Google Docs - "ide_xtp.bin, V30@8MHz"

Thanks for doing all that testing and creating a spread sheet. I bet that was a lot of work.

Anyway, I can't explain the loss of performance in r605. The only change that should affect performance is WORD_ALIGN being set to 2 and that is supposed to increase performance on 8086/V30 CPUs by WORD aligning stuff like function tables. So I'm stumped. You might want to make a custom build with WORD_ALIGN set to 1 as it used to be just to confirm.

There's one thing I find a bit peculiar though. The write speed is consistently slower on the first two iterations for every revision with the first iteration being the slowest. I wonder why that is?

Trixter said:
Since you're only using it to put strings to screen IIRC, probably doesn't hurt to just disable interrupts. If the routine is used inside and outside interrupts, or you're not sure if it will be, don't do CLI; STI but instead do PUSHF; CLI; POPF so that it will work in both scenarios.

This is what I've come up with;

Code:

;--------------------------------------------------------------------
; Repeats string instruction with segment override.
; This macro prevents 8088/8086 restart bug.
;
; eSEG_STR
;    Parameters:
;        %1:        REP/REPE/REPZ or REPNE/REPNZ prefix
;        %2:        Source segment override (destination is always ES)
;        %3:        String instruction
;        %4:        An exclamation mark (!) if the state of the IF must
;                be preserved (can not be used together with CMPS or
;                SCAS instructions), otherwise it will be set on
;                return from the macro (i.e. interrupts will be on)
;        CX:        Repeat count
;    Returns:
;        FLAGS for CMPS and SCAS only
;    Corrupts registers:
;        FLAGS
;--------------------------------------------------------------------
%macro eSEG_STR 3-4
%ifndef USE_186    ; 8088/8086 has string instruction restart bug when more than one prefix
%ifidn %4, !                ; Preserve the IF
    FSIS    cmps, %3
%ifn strpos
    FSIS    scas, %3
%endif
%if strpos
    %error "The state of the IF can not be preserved when using CMPS or SCASB!"
%endif
    pushf
    cli
    %1                        ; REP is the prefix that can be lost
    %2                        ; SEG is the prefix that won't be lost
    %3                        ; String instruction
    popf
%else                        ; No need to preserve the IF
    cli
    %1
    %2
    %3
    sti
%endif
%else    ; No bug on V20/V30 and later, don't know about 188/186
    %2
    %1 %3
%endif
%endmacro

Cloudschatze · Jul 8, 2022

Krille said:
Anyway, I can't explain the loss of performance in r605. The only change that should affect performance is WORD_ALIGN being set to 2 and that is supposed to increase performance on 8086/V30 CPUs by WORD aligning stuff like function tables. So I'm stumped. You might want to make a custom build with WORD_ALIGN set to 1 as it used to be just to confirm.

I'll give that a shot soon and will report back. Thank-you for the suggestion!

Krille said:
There's one thing I find a bit peculiar though. The write speed is consistently slower on the first two iterations for every revision with the first iteration being the slowest. I wonder why that is?

Each initial run would have included the disk free space calculation. I'm not sure what might explain the second...

Jackson · Oct 17, 2022

With the large 386 BIOS and 32-bit disk access mode, booting a disk setup of Windows 2000 results in an error code 4. I know that the BIOS wasn't meant for this, but I'd still like to know why. It goes as far as loading ntkrnlmp.exe before it crashes.

Quagmire · Oct 17, 2022

Jackson said:
With the large 386 BIOS and 32-bit disk access mode, booting a disk setup of Windows 2000 results in an error code 4. I know that the BIOS wasn't meant for this, but I'd still like to know why. It goes as far as loading ntkrnlmp.exe before it crashes.

I believe the main reason is at bare minimum it requires a Pentium 133.

Jackson · Oct 17, 2022

EDIT: This appears to be an unrelated issue to the Universal BIOS-- nevertheless, I still got to get that Dallas out and flat.

maferv · Jan 12, 2023

Configuring a NIC to be used with XT-IDE, what would you recommend as IRQ, I/O address and memory segment to achieve max compatibility with as many machines as possible (XT, 286, 386, 486, P5, P6...). What's the "safest bet"?

Trixter · Jan 12, 2023

There is no safe bet because IRQs are used for different things on XTs vs. ATs. For example:

IRQ 2 is safe on XT, but is a chain to IRQ 9 on AT+.
IRQ 5 is usually safe on AT but is usually used for hard drive controllers on XT.
IRQ 7 is common for both printing and sound cards (and sometimes both installed at the same time

IRQ 4 is usually COM1 and IRQ 3 is usually COM2, so IRQ 3 is probably your safest cross-platform bet (many systems have COM1, but usually not a COM2).

As for I/O address and memory segment, every system is different so "it depends". Port 260 is probably safe across systems, and memory window at D800 is probably safe (not common for EMS page frame, not common for ROMs).

But the correct answer is "nothing is safe, you need to inventory your system and pick the best settings for that system."

spark2k06 · Jan 31, 2023

Hello!

I'm coming to this thread because I have a problem related to XTIDE on a board I've designed from Monotech's XT-IDE-Deluxe, but I'm still not sure if it's a design problem or a bug with XTIDE's BIOS. To give you some context, the card is as follows:

Details of it can be found on the product page of my Tindie shop, although it is not available for purchase at the moment.

You can find the schematic here.

Although the idea of the product is based on an SD to IDE card adapter, I wanted to rule out any incompatibility problem with it, so the tests, with the same results, are being carried out with a CF to IDE card adapter, and a 512Mb card that I have previously verified that it works perfectly with an XTCF based card and the corresponding XTIDE BIOS:

Having detailed the context, I will now explain the problem I have. What happens is that during writing, half of the bytes are not written properly, normally it is 00, instead of the byte that corresponds to it... although I have noticed that if the previous byte is EE, instead of being 00, it is 02... but I think a debug image is worth a thousand words, as it is self-descriptive:

At first I had thought that it was a problem with the design of the card, but what I find most strange, if that were the reason, is why it only happens when writing? the reading is always correct, it reads the content that is actually recorded in the sectors. These debug commands make use of interrupt 13h corresponding to read and write respectively (AH = 02 and 03):

Code:

; Project name    :    XTIDE Universal BIOS
; Description    :    Int 13h function AH=3h, Write Disk Sectors.

;
; XTIDE Universal BIOS and Associated Tools
; Copyright (C) 2009-2010 by Tomi Tilli, 2011-2013 by XTIDE Universal BIOS Team.
;
; This program is free software; you can redistribute it and/or modify
; it under the terms of the GNU General Public License as published by
; the Free Software Foundation; either version 2 of the License, or
; (at your option) any later version.
;
; This program is distributed in the hope that it will be useful,
; but WITHOUT ANY WARRANTY; without even the implied warranty of
; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
; GNU General Public License for more details.
; Visit http://www.gnu.org/licenses/old-licenses/gpl-2.0.html
;

; Section containing code
SECTION .text

;--------------------------------------------------------------------
; Int 13h function AH=3h, Write Disk Sectors.
;
; AH3h_HandlerForWriteDiskSectors
;    Parameters:
;        AL, CX, DH, ES:    Same as in INTPACK
;        DL:        Translated Drive number
;        DS:DI:    Ptr to DPT (in RAMVARS segment)
;        SS:BP:    Ptr to IDEREGS_AND_INTPACK
;    Parameters on INTPACK:
;        AL:        Number of sectors to write (1...128)
;        CH:        Cylinder number, bits 7...0
;        CL:        Bits 7...6: Cylinder number bits 9 and 8
;                Bits 5...0:    Starting sector number (1...63)
;        DH:        Starting head number (0...255)
;        ES:BX:    Pointer to source data
;    Returns with INTPACK:
;        AH:        Int 13h/40h floppy return status
;        AL:        Number of sectors actually written (only valid if CF set for some BIOSes)
;        CF:        0 if successful, 1 if error
;--------------------------------------------------------------------
ALIGN JUMP_ALIGN
AH3h_HandlerForWriteDiskSectors:
    call    Prepare_BufferToESSIforOldInt13hTransfer
    call    Prepare_GetOldInt13hCommandIndexToBX
    mov        ah, [cs:bx+g_rgbWriteCommandLookup]
    mov        bx, TIMEOUT_AND_STATUS_TO_WAIT(TIMEOUT_DRQ, FLG_STATUS_DRQ)
%ifdef USE_186
    push    Int13h_ReturnFromHandlerAfterStoringErrorCodeFromAHandTransferredSectorsFromCL
    jmp        Idepack_TranslateOldInt13hAddressAndIssueCommandFromAH
%else
    call    Idepack_TranslateOldInt13hAddressAndIssueCommandFromAH
    jmp        Int13h_ReturnFromHandlerAfterStoringErrorCodeFromAHandTransferredSectorsFromCL
%endif

Finally, I'm going to show the XTIDE BIOS configuration I'm using (by the way, tested with beta 2.0.3 and also with r624:

Before I try to understand the ins and outs of how this interrupt implementation works at the lowest level and how it might affect the board design, does anyone have any suggestions as to where this might go?

Thank you very much in advance,

Aitor (spark2k06)

spark2k06 · Jan 31, 2023

spark2k06 said:
You can find the schematic here.

I have to check, but I think I've found the problem... it should be GND, not VCC:

What an absurd mistake

spark2k06 · Jan 31, 2023

spark2k06 said:
I have to check, but I think I've found the problem... it should be GND, not VCC:
View attachment 1252376

What an absurd mistake

Fixed:

maferv · Feb 3, 2023

Trixter said:
There is no safe bet because IRQs are used for different things on XTs vs. ATs. For example:

IRQ 2 is safe on XT, but is a chain to IRQ 9 on AT+.
IRQ 5 is usually safe on AT but is usually used for hard drive controllers on XT.
IRQ 7 is common for both printing and sound cards (and sometimes both installed at the same time

IRQ 4 is usually COM1 and IRQ 3 is usually COM2, so IRQ 3 is probably your safest cross-platform bet (many systems have COM1, but usually not a COM2).

As for I/O address and memory segment, every system is different so "it depends". Port 260 is probably safe across systems, and memory window at D800 is probably safe (not common for EMS page frame, not common for ROMs).

But the correct answer is "nothing is safe, you need to inventory your system and pick the best settings for that system."

Thank you very much for all the detailed information.

I did some testing regarding the different versions and the one that gave me a better performance on every single board (286, 386 and 486) was the "XT Plus". I tried all the other possible roms but I always get the best performance with the "XTPL" 12k rom.
I wonder why really, I thought I would get better performance using the "386L" on 386 and 486 systems but that wasn't the case.

As an example, here's the performance of a CompactFlash on a PCChips M326 with an Am386DX40.
I used Checkit

XTIDE r624

*XTPL*
DRAM 2 W.S.	AT CLK 40/4	3170.4 K/s	0.2 ms, 0.2 ms
DRAM 0 W.S.	AT CLK 40/4	3263.7 K/s	0.1 ms, 0.1 ms
DRAM 2 W.S.	AT CLK 40/3	3698.9 K/s	0.2 ms, 0.2 ms
DRAM 0 W.S.	AT CLK 40/3	3698.9 K/s	0.1 ms, 0.1 ms

*386L*
DRAM 2 W.S.	AT CLK 40/4	2706.5 K/s	0.2 ms, 0.2 ms
DRAM 0 W.S.	AT CLK 40/4	2774.1 K/s	0.1 ms, 0.1 ms
DRAM 2 W.S.	AT CLK 40/3	3082.4 K/s	0.2 ms, 0.2 ms
DRAM 0 W.S.	AT CLK 40/3	3170.4 K/s	0.1 ms, 0.2 ms

Krille · Feb 3, 2023

maferv said:
I did some testing regarding the different versions and the one that gave me a better performance on every single board (286, 386 and 486) was the "XT Plus". I tried all the other possible roms but I always get the best performance with the "XTPL" 12k rom.
I wonder why really, I thought I would get better performance using the "386L" on 386 and 486 systems but that wasn't the case.

Thanks for posting this! I always find benchmarks interesting, especially since I'm not able to test and benchmark as much as I'd like to do myself.

A lot of people assume that the processor specific builds were created to improve performance for those types of CPU:s. For the most part, that's not true. Most of the time the higher level CPU instructions are just used to optimize for size so that we can save ROM space which allows us to pack more functionality into the BIOS. The biggest exception to this rule is the string I/O instructions available from the V20/V30/188/186 CPU:s which makes a huge difference when doing port I/O.

Is "Enable interrupt" set to Yes for the IDE controller in the 386L build? It is set to Yes by default, so if you haven't changed that you will get lower performance in DOS because polling is faster than using interrupts in a single tasking OS like DOS. The XTPL build on the other hand is always using polling as it doesn't even have MODULE_IRQ included.

VCF West	Aug 01 - 02 2025,	CHM, Mountain View, CA
VCF Midwest	Sep 13 - 14 2025,	Schaumburg, IL
VCF Montreal	Jan 24 - 25, 2026,	RMC Saint Jean, Montreal, Canada
VCF SoCal	Feb 14 - 15, 2026,	Hotel Fera, Orange CA
VCF Southwest	May 29 - 31, 2026,	Westin Dallas Fort Worth Airport
VCF Southeast	June, 2026	Atlanta, GA

XTIDE Universal BIOS

Veteran Member

Veteran Member

Veteran Member

Veteran Member

Veteran Member

Veteran Member

Veteran Member

Veteran Member

Veteran Member

Veteran Member

Experienced Member

Veteran Member

Experienced Member

Experienced Member

Veteran Member

Experienced Member

Experienced Member

Experienced Member

Experienced Member

Veteran Member