XTIDE Universal BIOS

Krille · Jun 9, 2022

@FreddyV; Have you tried replacing the first "xchg ah, al" with a "mov ah, al"? It should be slightly faster in theory (if my documentation is correct).

It would be interesting to see the difference in performance (on various machines, with 8088 and 8086 CPUs) so if anyone wants to implement your suggestion in the actual XUB code and test/benchmark it then I might consider adding another "controller" specifically for 8086 machines if there's a significant difference.

maxtherabbit · Jun 9, 2022

Why not ROL AX,8?

Krille · Jun 9, 2022

maxtherabbit said:
Why not ROL AX,8?

That's a 186+ instruction and much slower. Besides, there's no need to preserve the contents of AH.

maxtherabbit · Jun 9, 2022

Krille said:
That's a 186+ instruction and much slower. Besides, there's no need to preserve the contents of AH.

Pretty sure it was always in the instruction set. My understanding is that ROL is faster than XCHG. But yeah if you don't need to preserve AH then it's a moot point

Krille · Jun 9, 2022

maxtherabbit said:
Pretty sure it was always in the instruction set.

Yes, but only the rotate-by-one variant of the instruction. You can rotate by multiple bit positions but then you need to put the count in CL (e.g. ROL AX, CL).

My understanding is that ROL is faster than XCHG. But yeah if you don't need to preserve AH then it's a moot point

Multibit shifts and rotates are among the slowest instructions on the 808x (depending on how many bits you shift/rotate). Only the division and multiplication instructions are slower. If you need to rotate 8 bit positions then XCHG is way faster.

Trixter · Jun 9, 2022

FreddyV said:
Then, for the XTIDE it is not worth the effort.
For the CH375 it is a great improvement and simple to use (Just use a .SYS file instead of the other).

But you can't boot off of it, which is why I'm not that excited about it.

I agree XUB can be made faster by unrolling, but unfortunately there is a code space limitation. You'd need more than 8K EEPROM to do it, and the speedup would only be about 8% or so.

Krille said:
Have you tried replacing the first "xchg ah, al" with a "mov ah, al"? It should be slightly faster in theory (if my documentation is correct).

I don't think there's any difference because the opcode sizes are the same (86 C4 vs. 88 C4).

I was just verifying this with reenigne the other day, with real measured tests: 8088 is almost completely I/O bound. Whether a CPU instruction takes 2, 3, or 4 cycles, it all gets quantized to 4 cycles because that's how long it takes to read a byte, so the only way such instructions can actually execute in less than 4 cycles is if they're prefetched, and that almost never happens. If you put them after a very long instruction like MUL, then yes, they'll be prefetched, but you just spent 140 cycles doing the MUL so saving 2 or 3 cycles after that is a moot point. So, while that's depressing, optimizing for 8088 is really easy: The smallest code wins >95% of the time.

8086 and higher change that; 16-bit reads/writes help greatly, 16-bit alignment is important. 286 and higher, MUL/DIV are 22 cycles or faster, so you can *51 faster with MUL than with adds and shifts. But for 8088, smallest wins.

FreddyV · Jun 10, 2022

Trixter said:
But you can't boot off of it, which is why I'm not that excited about it.

It is just a matter of time, It can take only some days to update the Original BIOS to make it as fast as my driver.
With the Lo-tech board, it is not a problem, with this board, it is needed to add a logical chip to correct the @ decoding.

It is also really simple to build a parallel port version of this board, (Like OPL2LPT and so on) to have something faster and cheaper that ZIP on // Port
We can also do Mouse/Keyboard or whatever driver.
So, this is definitely worth a try

The Lo-tech board also support &-Bit I/O Read/Write (We can place A1 on the Chip Adress line) it is working For an aditionnal Speed BOOST.
On a 8Bit BUS, it is faster that anything we can do with the XTIDE.
Plus the fact that we are less dependant from the CHS Geometry (This is accessed in LBA) so we can partition/Format from Windows 10 with no problem.
I will also soon add the partition selection, to select the partition we want to mount on this driver.
It is also possible to do an utility, to change the partition on the fly (No Reboot) and of course, mount all the partition at the same time.

Utlimately, have a board with 2 USB Port : One Bootable, the other using the driver to mount whatever we cant and do USB Key Hot Swap.

Finally, there is no limit....

sorphin · Jun 10, 2022

FreddyV said:
It is just a matter of time, It can take only some days to update the Original BIOS to make it as fast as my driver.
With the Lo-tech board, it is not a problem, with this board, it is needed to add a logical chip to correct the @ decoding.

I don't use the onboard socket for the BIOS.. I use one of the spare sockets on the 5150/5160/clone boards, or one of my doublerom boards (or an ethernet card rom socket or....) you get the idea. less hassle than butchering the card itself (unless you're somehow out of slots)

FreddyV · Jun 10, 2022

sorphin said:
I don't use the onboard socket for the BIOS.. I use one of the spare sockets on the 5150/5160/clone boards, or one of my doublerom boards (or an ethernet card rom socket or....) you get the idea. less hassle than butchering the card itself (unless you're somehow out of slots)

Of course, the PC I use the more are PC1512/1640 with only 3 ISA Slot and no aditionnal ROM. (PC200 Only 2 Slot)
Anyway, yes, the onboard ROM socket is not mandatory.

Krille · Jun 15, 2022

Trixter said:
I agree XUB can be made faster by unrolling, but unfortunately there is a code space limitation. You'd need more than 8K EEPROM to do it, and the speedup would only be about 8% or so.

Let's see if you're right! (see below)

I don't think there's any difference because the opcode sizes are the same (86 C4 vs. 88 C4).

I was just verifying this with reenigne the other day, with real measured tests: 8088 is almost completely I/O bound. Whether a CPU instruction takes 2, 3, or 4 cycles, it all gets quantized to 4 cycles because that's how long it takes to read a byte, so the only way such instructions can actually execute in less than 4 cycles is if they're prefetched, and that almost never happens. If you put them after a very long instruction like MUL, then yes, they'll be prefetched, but you just spent 140 cycles doing the MUL so saving 2 or 3 cycles after that is a moot point. So, while that's depressing, optimizing for 8088 is really easy: The smallest code wins >95% of the time.

8086 and higher change that; 16-bit reads/writes help greatly, 16-bit alignment is important. 286 and higher, MUL/DIV are 22 cycles or faster, so you can *51 faster with MUL than with adds and shifts. But for 8088, smallest wins.

I knew it wouldn't matter on an 8088 but I am curious if it makes a difference on an 8086. That is what FreddyV is testing on after all.

Anyway, I've created a couple of new defines that can be used to unroll the transfer loops a bit more. EXTRA_LOOP_UNROLLING_SMALL and EXTRA_LOOP_UNROLLING_LARGE. They are included in the small and large BIOS builds respectively. Do note that EXTRA_LOOP_UNROLLING_LARGE actually fits in the small builds but I decided against that because I don't want to spend too much of the available ROM space on this. We might need the space for future changes (supporting the CH375 controller for example). Besides, the law of diminishing returns applies here, so I'm not sure it's worth the ROM space for this reason. Anyone who disagrees with me can always make a custom build with EXTRA_LOOP_UNROLLING_LARGE included.

I also added a USE_086 define. This is for people with actual 8086 or V30 processors and all it does is WORD align jump targets for higher performance. How much higher you say? I don't know. That's what I need you guys to test.

So, in short, r623 is out.

Trixter · Jun 15, 2022

I can't test until at least the second week of August; I'm currently working on a project, sorry. Maybe someone else can help in the meantime.

I read in the r623 notes "jump destinations WORD aligned which should improve performance on 8086/V30 CPUs" -- Where did you source that information from? I don't think that's true; 16-bit-aligned word memory accesses occur in 4 cycles, but if you jump to a word-aligned instruction that is an odd number of bytes, I don't think you've gained anything there...

Are there instructions for obtaining and assembling the XUB source? I would imagine that building from source is how I'll be able to enable the extra loop unrolling, but the changeset for r623 (and that entire site, actually) doesn't seem to have a simple download link for the source. Is there a way to download the entire trunk easily? If not, what SCCS is it using and how can one grab the source?

Malc · Jun 16, 2022

Trixter said:
Are there instructions for obtaining and assembling the XUB source? I would imagine that building from source is how I'll be able to enable the extra loop unrolling, but the changeset for r623 (and that entire site, actually) doesn't seem to have a simple download link for the source. Is there a way to download the entire trunk easily? If not, what SCCS is it using and how can one grab the source?

xtideuniversalbios

xtideuniversalbios.org

Different builds

XTIDE Universal BIOS is modular and has many optional features. It is not possible to include all features in the Small (8 kiB) builds. Officially released builds include the modules that benefits most people. You can quite easily make your own custom build from source to include only the features you need if you are not satisfied with the official builds.
See the build instructions for module descriptions and how to create custom builds.

Krille · Jun 16, 2022

Trixter said:
I can't test until at least the second week of August; I'm currently working on a project, sorry. Maybe someone else can help in the meantime.

No worries!

I read in the r623 notes "jump destinations WORD aligned which should improve performance on 8086/V30 CPUs" -- Where did you source that information from? I don't think that's true; 16-bit-aligned word memory accesses occur in 4 cycles, but if you jump to a word-aligned instruction that is an odd number of bytes, I don't think you've gained anything there...

I don't think the length of the instruction is relevant. On 8086 and V30 CPUs RAM is accessed 16 bits at a time on WORD boundaries. Fetching an instruction from an odd address will waste half of that memory access and will also require an extra memory access to fetch the rest of the instruction if it is shorter than or equal to 3 bytes (which most instructions are). If, on the other hand, the instruction is WORD aligned then the EU can start executing the instruction as soon as the BIU fetches it which a lot of the time will be immediately since many instructions are 2 bytes or less. Longer instructions will of course require additional memory accesses but the take-away is that the EU should be fed ASAP and with no waste of RAM access bus cycles.

At least that's my understanding of how alignment works. I could be wrong.

Are there instructions for obtaining and assembling the XUB source? I would imagine that building from source is how I'll be able to enable the extra loop unrolling, but the changeset for r623 (and that entire site, actually) doesn't seem to have a simple download link for the source. Is there a way to download the entire trunk easily? If not, what SCCS is it using and how can one grab the source?

Malc has answered this but I just wanted to say that the extra loop unrolling is in the official builds already so if you or anyone else wants to run benchmarks to see the speed differences there's no need for custom builds.

Trixter · Jun 16, 2022

I appreciate the link to the build instructions, thank you.

Krille said:
At least that's my understanding of how alignment works. I could be wrong.

No, your understanding is correct; we just disagree on how useful it will be. It only helps even-sized instructions, and I don't know if the code has more of those than not. As always, there's no substitute for metrics, so I'll test this when I'm able, someday.

I noticed a few things in the CGA snow code:

Code:

WAIT_UNTIL_SAFE_CGA_WRITE
xchg    ax, bx

Is nasm smart enough to use the optimized form ("xchg bx,ax") when it encounters this? If not, may want to change it.

Code:

130    .RepMovsbWithoutWaitSinceUnknownPort:
131        eSEG_STR rep, es, movsb

Does that code generate REP ES: MOVSB? If so, is that guaranteed to run with interrupts disabled? It doesn't look like it, and if interrupts are enabled and it's <386, an interrupt will result in the REP stopping before CX=0.

Cloudschatze · Jun 17, 2022

So, in my V30@8MHz-based system:

r604, ide_xtp.bin

Code:

DiskTest, by James Pearce.  Version 1.2.

Configuration: 4096 KB test file, 256 IOs in random tests.

Write Speed         : 205.42 KB/s
Read Speed          : 848.03 KB/s
8K random, 70% read : 43.6 IOPS
Sector random read  : 122.5 IOPS

Average seek, including latency, is 8 ms.

r623, ide_xtp.bin

Code:

DiskTest, by James Pearce.  Version 1.2.

Configuration: 4096 KB test file, 256 IOs in random tests.

Write Speed         : 193.66 KB/s
Read Speed          : 839.34 KB/s
8K random, 70% read : 43.5 IOPS
Sector random read  : 126.1 IOPS

Average seek, including latency, is 8 ms.

Krille · Jun 17, 2022

Trixter said:
No, your understanding is correct; we just disagree on how useful it will be. It only helps even-sized instructions, and I don't know if the code has more of those than not. As always, there's no substitute for metrics, so I'll test this when I'm able, someday.

Why would it only help with even-sized instructions? By making sure that instructions start on even addresses we don't waste RAM accesses. Now, there are plenty of alignment directives in the code where we both jump to AND fall through to a label. Those are the ones I feel unsure about whether they should be removed or not. I read somewhere that if you fall through to and jump to an aligned label about an equal amount then it's not worth having the alignment padding. Of course, the hard part is knowing exactly how much we fall through to versus jump to a label. But having only jump targets aligned seems to be a good thing to me. The only downside might be cache pollution where the cache is not used as effectively with the code being larger than it has to be - especially where loops become too large to fit in the cache. But these old systems don't have any cache, generally speaking.

Code:
I noticed a few things in the CGA snow code:

Code:

WAIT_UNTIL_SAFE_CGA_WRITE xchg ax, bx

Is nasm smart enough to use the optimized form ("xchg bx,ax") when it encounters this? If not, may want to change it.

Yes it is. In fact, I don't know of any assembler that doesn't use the shorter form of the instruction except for the original DOS DEBUG.

Code:
Code:

130 .RepMovsbWithoutWaitSinceUnknownPort: 131 eSEG_STR rep, es, movsb

Does that code generate REP ES: MOVSB? If so, is that guaranteed to run with interrupts disabled? It doesn't look like it, and if interrupts are enabled and it's <386, an interrupt will result in the REP stopping before CX=0.

Yes it does, and it does not disable interrupts. However, the eSEG_STR macro is used specifically to avoid this bug (which exists only in 8088/8086 CPUs). It looks like this;

Code:

;--------------------------------------------------------------------
; Repeats string instruction with segment override.
; This macro prevents 8088/8086 restart bug.
;
; eSEG_STR
;	Parameters:
;		%1:		REP/REPE/REPZ or REPNE/REPNZ prefix
;		%2:		Source segment override (destination is always ES)
;		%3:		String instruction
;		CX:		Repeat count
;	Returns:
;		FLAGS for cmps and scas only
;	Corrupts registers:
;		FLAGS
;--------------------------------------------------------------------
%macro eSEG_STR 3
%ifndef USE_186	; 8088/8086 has string instruction restart bug when more than one prefix
	%%Loop:
		%1						; REP is the prefix that can be lost
		%2						; SEG is the prefix that won't be lost
		%3						; String instruction
FSIS	cmps, %3
%ifn strpos
	FSIS	scas, %3
%endif
%if strpos						; Must preserve FLAGS
		jcxz	%%End			; Jump to end if no repeats left (preserves FLAGS)
		jmp		SHORT %%Loop	; Loop while repeats left
	%%End:
%else							; No need to preserve FLAGS
		inc		cx
		loop	%%Loop
%endif
%else	; No bug on V20/V30 and later, don't know about 188/186
	%2
	%1 %3
%endif
%endmacro

Cloudschatze said:
So, in my V30@8MHz-based system:

Slower transfers but higher IOPS for random reads. That was unexpected. Did you run the benchmark several times with consistent results? BTW, what kind of controller is this?

Malc · Jun 17, 2022

Just did a quick test on my XT 5160 with stock 8088 and GW R4 with Sandisk Ultra II 2Gb CF.

Official XUB r622 8Kb IDE_XT.BIN. Write = 127.72 Kb/s | Read = 246.15 Kb/s

Official XUB r623 8Kb IDE_XT.BIN. Write = 132.21 Kb/s | Read = 253.62 Kb/s

Official XUB r623 IDE_XTL.BIN. Write = 132.73 Kb/s | Read = 258.10 Kb/s

So the EXTRA_LOOP_UNROLLING_LARGE is a tad better on my XT.

Krille · Jun 17, 2022

Thanks for testing Malc! That's more in line with what I expected.

Eudimorphodon · Jun 17, 2022

Would we expect any change in performance with these changes on a V20 with XT-CF-Lite hardware? It's been a while since I tried a new version, I'd be willing to test this on my Tandy 1000 HX.

Krille · Jun 17, 2022

No, with a V20 you would only get higher performance with XT-IDE cards. That's with the assumption that you are using one of the XT Plus builds of course. If you're not (which would be kind of silly) then yes, you would get higher performance with the newer version with XT-CF cards also.

VCF West	Aug 01 - 02 2025,	CHM, Mountain View, CA
VCF Midwest	Sep 13 - 14 2025,	Schaumburg, IL
VCF Montreal	Jan 24 - 25, 2026,	RMC Saint Jean, Montreal, Canada
VCF SoCal	Feb 14 - 15, 2026,	Hotel Fera, Orange CA
VCF Southwest	May 29 - 31, 2026,	Westin Dallas Fort Worth Airport
VCF Southeast	June, 2026	Atlanta, GA

XTIDE Universal BIOS

Veteran Member

Veteran Member

Veteran Member

Veteran Member

Veteran Member

Veteran Member

Experienced Member

Experienced Member

Experienced Member

Veteran Member

Veteran Member

Veteran Member

Different builds​

Veteran Member

Veteran Member

Veteran Member

Veteran Member

Veteran Member

Veteran Member

Veteran Member

Veteran Member

Different builds