Please Help Debug Assembler Routine!

pearce_jj · May 2, 2012

Just for interest, I tried to follow the Int13h read routine from start to finish. If I've followed it properly there are something like 600 clocks spent just in jumps getting to the actual read routine.

Although even for a single-sector transfers that's only about 10% overhead, maybe there is some scope to build an even faster BIOS somehow, stripping it out to a custom BIOS specifc to the card or making use of macros more than routines etc., I'm new to the ASM stuff at this level of detail so really looking for suggestions...

BTW I have 32KB ROM on this card.

Chuck(G) · May 2, 2012

You've got a valid point--when you try to generalize BIOS routines, there are decisions to be made and every decision slows things down. One approach is to code several different versions (if you've got the ROM space) of the same routine and use a word in RAM to select the desired combination (i.e. indirect jump). Some SCSI adapters provide a little private RAM within the address space of the BIOS ROM for just that purpose ("scratch pad RAM").

You can strip the code down--or even provide a tailored DOS device driver once the system gets booted as an alternative.

pearce_jj · May 2, 2012

Like the device driver idea, that would be very neat to keep the 'universal' aspect of the BIOS

Mike Chambers · May 2, 2012

Trixter said:
The code change you suggested makes the inner loop slower. Bad Mike Chambers! Bad dog!

oops! yeah you're right. that load of 0x200 into DI from an immediate instead of a register.

aitotat · May 3, 2012

pearce_jj said:
is it possible that the BIOS code is somehow taking a (relatively) long time between polls?

IDE Status register is polled (or IRQ waited) before transferring first block and after transferring block. After the IDE status register reports non busy, possible error flags are tested from it and from IDE error register if necessary. Certainly the error checking should not be removed to get more speed but there are couple of other things you can try to improve speed. There are some risks involved.

You could try to remove the polling before transferring first sector. Modern drives are fast and XT systems are slow so the wait might be unnecessary. This is a bit risky since errors from outputting the transfer command and parameters wouldn't be checked until after transferring the first block.

Less risky optimization would be to remove first polling of the status register in IdePollBsyAndFlgInAH and PollBsyOnly in IdeWait.asm. IDE Status register contents of first read should be discarded to give the drive enough time to properly update the status register. I don't think there will be any problems with modern drives and slow CPUs like 8088 or V20.

If you want to do testing, please do test how much timeout processing slows thing down. Remove calls to Timer_InitializeTimeoutWithTicksInCL and Timer_SetCFifTimeout from IdeWait.asm.

Trixter · May 4, 2012

aitotat said:
You could try to remove the polling before transferring first sector. Modern drives are fast and XT systems are slow so the wait might be unnecessary. This is a bit risky since errors from outputting the transfer command and parameters wouldn't be checked until after transferring the first block.

I'd be worried about exploring this because it was my understanding that the controller was not meant to be limited to XT systems. It's an 8-bit ISA card, sure, but it has other advantages (ie. not limited to 540M drives) and people might want to use it in something faster.

pearce_jj · May 4, 2012

Me too, I'm just trying to understand where the apparent performance degradation for compact flash cards is coming from specifically.

Chuck(G) · May 4, 2012

pearce_jj said:
Me too, I'm just trying to understand where the apparent performance degradation for compact flash cards is coming from specifically.

Many IDE drives (particularly the later ones) have a read-ahead feature. Function 55H of the "set features" will disable it. I don't know if CF cards have the same functionality.

aitotat · May 4, 2012

pearce_jj said:
Me too, I'm just trying to understand where the apparent performance degradation for compact flash cards is coming from specifically.

I've tested few CF cards and only one of them support blocks larger than 1 sector (it supports blocks with 2 sectors). They support block mode commands but when block size is 1 sector, it is essentially the same as no block mode at all. For comparison, the 6 GB Hitachi microdrive supports 16 sector blocks.

pearce_jj · May 4, 2012

Thanks (Chuck+aitotat) for the replies. So the speed the code and get the next command to the card is then highly significant, perhaps explaining why there is less difference with a much faster processor.

pearce_jj · May 5, 2012

The block size theory seems to hold - I bodged up some 'short-cut' code and transfer rates from CF were significantly improved, from about 255KB/s to just about 300KB/s. Here's the code:

Code:

ReadFromDrive:
	; Prepare to read data to ESSI
	mov		bx, g_rgfnPioRead
	call	InitializePiovarsInSSBPwithSectorCountInAH

	; Wait until drive is ready to transfer
	call	IdeWait_IRQorDRQ					; Wait until ready to transfer
	jc		SHORT ReturnWithTransferErrorInAH		; Jump out if there was a device error
	
	mov		cx, [bp+PIOVARS.wSectorsInBlock]		; Max 128
	cmp		cx, 1						; Are we working in single-sector transfers?
	jne	.ReadNextBlockFromDriveEntry				; if not, use the normal block-transfer loop

	; single-sector mode - we'll take some short-cuts to reduce overhead
	; find the status port for use here
	mov		dl, STATUS_REGISTER_in				; 
	mov		bl, IDEVARS.wPort
	call	GetPortToDXandTranslateA0andA3ifNecessary
	push		dx						; then save it for later
	xchg		si, di						; ES:DI now points buffer

.ReadNextSectorFromDrive:						; single-sector transfer short-cut code
	mov		dx, [bp+PIOVARS.wDataPort]			; get IO port
	call		[bp+PIOVARS.fnXfer]				; get the data
	dec		BYTE [bp+PIOVARS.bSectorsLeft]			; 
	inc		BYTE [bp+PIOVARS.bSectorsDone]			; update variables
	mov		cl, [bp+PIOVARS.bSectorsLeft]			; 
	pop		dx						; get back status port (or clear it from stack)
	or		cl, cl						; check if there's more to do
	jz	.SectorTransferComplete					; we're done when CL is zero

.ReadNextSectorFromDrivePollLoop:					; more to get so wait for the device
	in		al, dx						; read status port value
	test		al, FLG_STATUS_BSY				; Is the controller busy?
	jnz	.ReadNextSectorFromDrivePollLoop			; If so, keep checking
	push		dx						; Ready for next sector - save status port again
	mov		cx, 1						; reset sector count to 1
	jmp	.ReadNextSectorFromDrive				; and collect the data

.SectorTransferComplete:							; finished the transfer
	xchg		di, si						; reset DS:DI to point to DPT
	jmp	 CheckErrorsAfterTransferringLastBlock			; check all is well

ALIGN JUMP_ALIGN
.ReadNextBlockFromDriveEntry:
	xchg		si, di						; ES:DI now points buffer
.ReadNextBlockFromDrive:

(.... same as normal from there)

Some rough calculations from that suggest the poll-loop in the current v2b is about 1400 clocks per iteration?

Please Help Debug Assembler Routine!

pearce_jj

Veteran Member

Chuck(G)

25k Member

pearce_jj

Veteran Member

Mike Chambers

Veteran Member

aitotat

Experienced Member

Trixter

Veteran Member

pearce_jj

Veteran Member

Chuck(G)

25k Member

aitotat

Experienced Member

pearce_jj

Veteran Member

pearce_jj

Veteran Member