Some new code, joystick and BCD

deathshadow · Mar 2, 2014

Working on my game library for a few things -- biggest of which being memory footprint and speed, I've found two major bottlenecks...

I didn't realize just how ridiculously USELESS the range of sensitivity the joystick on the Jr was at 128k. My existing routine was returning a center value of 16 -- and running that twice (once for each axis) sucks down an ungodly 30% of my frame rate cycle!!! No wonder having interrupts enabled caused jitter!

I can't use the PIT for it, as both channels 0 and 2 are in use for other things like actual game timing and audio control... and one of these upcoming programs is going to be multiplayer supporting two joysticks, so I've had to come at this from a whole different direction.

First thing I needed to do was come up with a test for the joystick values that would take the same amount of time true or false, I could then use a single loop to read all four values. It would need to preserve the original value, and have a way to mask off joysticks that aren't connected.

Masking is easy, we figure out what sticks we want to read, and simply AND their bits against the value we have in AL from port $201. Since joystick 0 X axis is bit one, we can ROR AL, 1 and then ADC to our memory location storing the X coord value. Lather rinse, repeat for each of the other axis. Since I use ROR for this the value is still in there, so we can OR AL,AL to LOOPNZ... Needing the loop to make sure it has a timeout since an unconnected unmasked stick will loop indefinitely.

To figure out which axis are connected isn't too hard -- we just set our mask to $0F and run our routine. If the axis equals our timeout (CX) starting value, it bombed.

So that gives me this TP7 unit.

Code:

unit joystick;

interface
	
var
	stick0x, stick0y, stick1x, stick1y:word;
	stickMask:byte;
	
procedure stickUpdate;	
function button0a:boolean;
function button0b:boolean;
function button1a:boolean;
function button1b:boolean;
	
implementation

const
	stickLimit = $8000;

procedure stickUpdate; assembler;
asm
	xor  al, al
	mov  ah, stickMask
	xor  bx, bx
	mov  cx, stickLimit
	mov  dx, $201
	mov  stick0x, bx
	mov  stick0y, bx
	mov  stick1x, bx
	mov  stick1y, bx
	cli
	out  dx, al
@loop:
	in   al, dx
	and  al, ah
	ror  al, 1
	adc  stick0x, bx
	ror  al, 1
	adc  stick0y, bx
	ror  al, 1
	adc  stick1x, bx
	ror  al, 1
	adc  stick1y, bx
	or   al, al
	loopnz @loop
	sti
end;

function button0a:boolean; assembler;
asm
	mov  dx, $201
	in   al, dx
	and  al, $10
	xor  al, $10
end;

function button0b:boolean; assembler;
asm
	mov  dx, $201
	in   al, dx
	and  al, $20
	xor  al, $20
end;

function button1a:boolean; assembler;
asm
	mov  dx, $201
	in   al, dx
	and  al, $40
	xor  al, $40
end;

function button1b:boolean; assembler;
asm
	mov  dx, $201
	in   al, dx
	and  al, $80
	xor  al, $80
end;

begin
	asm
		mov  stickMask, $0F
		call stickUpdate
		xor  al, al
		mov  bx, stickLimit
		cmp  stick0x, bx
		je   @test0y
		or   al, $01
	@test0y:
		cmp  stick0y, bx
		je   @test1x
		or   al, $02
	@test1x:
		cmp  stick1x, bx
		je   @test1y
		or   al, $04
	@test1y:
		cmp  stick1y, bx
		je   @done
		or   al, $08
	@done:
		mov  stickMask, al
	end;
end.

Which is stable for all four stick axis, auto-detects if they're connected, and so forth. I also hard-coded the button checks instead of making them functional because even the stupid shift was killing me. If you're wondering why I'm using BX for zero, mem:imm16 are usually two or three bytes larger than mem:reg (depending on the operation), so that's ~6 bytes saved on average, inside the loop netting more loops.

On the 128k Jr it's returning a center value of 10, which is even more sucktastic, but since I'm only using it for digital style input with a dead zone, it's functional. Disabling interrupts also appears to no longer interfere with timing issues on the t1k using this routine, I suspect because I'm only doing it once instead of twice in the input "slice".

The other bottleneck is outputting score updates. 32 bit math is slow on the 8088, converting a 32 bit integer to a string even more so. While the routine I had for a fixed 8 digit result was pretty peppy as such things go, the simple fact was that by the time I got to actually outputting the score it was just painful to use.

Having made a port to the C64 I knew BCD was a great way around these issues; but I know jack **** about doing it on the x86 platform. (I know it on the Z80 and 6502 pretty good). For genuine speed at outputting the score I need to use unpacked, and since the only math I need is addition I was like "ok, how hard can it be?" especially since little-endian 8 bytes would be my fastest approach (both for math and display)

Thing is, while it's WAY faster, my addition method feels, I dunno... sloppy. Just wondering if anyone knows a better way of doing it. I've unrolled the loop for speed and to remove extra unneeded operations.

BCDUnpacked is an array[0..7] of byte;

Code:

procedure BCDUnpackedAdd(var b1, b2:BCDUnpacked); assembler;
asm
	les  di, b1
	mov  dx, ds
	lds  si, b2
	
	{0}
	lodsb
	xor  ah, ah
	add  al, es:[di]
	aaa
	stosb
	
	{1}
	lodsb
	add  al, ah
	xor  ah, ah
	add  al, es:[di]
	aaa
	stosb
	
	{2}
	lodsb
	add  al, ah
	xor  ah, ah
	add  al, es:[di]
	aaa
	stosb
	
	{3}
	lodsb
	add  al, ah
	xor  ah, ah
	add  al, es:[di]
	aaa
	stosb
	
	{4}
	lodsb
	add  al, ah
	xor  ah, ah
	add  al, es:[di]
	aaa
	stosb
	
	{5}
	lodsb
	add  al, ah
	xor  ah, ah
	add  al, es:[di]
	aaa
	stosb
	
	{6}
	lodsb
	add  al, ah
	xor  ah, ah
	add  al, es:[di]
	aaa
	stosb
	
	{7}
	lodsb
	add  al, ah
	xor  ah, ah
	add  al, es:[di]
	aaa
	stosb
	
	mov  ds, dx
end;

Something about it feels wrong... I can't put my finger on it. At first I thought I needed extra AAA in there, once for the carry in AH and once for the add, but since 9+1+9 == 19, I only need one so that's not the problem... It works, it's WAY faster than longint math with trying to turn that longint to a string for output... I dunno, I can't place what's not kosher about it. I almost want to add some short-circuit code for when the value being added is empty/done and CF is zero, but those extra jumps and tests take longer than just letting it finish. (scary with that many memory ops involved)... Maybe pair them up for LODSW? Nah, too much flipping to another register... What am I doing wrong there that's making me think I'm doing it wrong?!?

The output routine ended up being something of a laugh -- TP7 lets you assign functions as a type, so I have an array of functions assigned for coded sprites... each of the functions takes the character being displayed (0..9) and the video offset to display them at, returning the video offset at which the next character should be shown... this results in a rather unusual approach to doing this:

Code:

function fastNumberBCD(var b:BCDUnpacked; vOffset, colorPair:word):word;
begin
	cPair := colorPair;
	fastNumberBCD := fastNumber[b[0]](
		fastNumber[b[1]](
			fastNumber[b[2]](
				fastNumber[b[3]](
					fastNumber[b[4]](
						fastNumber[b[5]](
							fastNumber[b[6]](
								fastNumber[b[7]](vOffset)
							)
						)
					)
				)
			)
		)
	);
end;

In case you're curious, colorPair is a word containing 4 nybbles -- first two are background:background, second two are foreground:foreground. I'm sending the colors packed that way so as to be able to use AND instead of shifts... and rotates to set up some 'easy' access to the pairs. Take drawing a zero:

Code:

function fastNum0(vOffset:word):word; assembler;
asm
	mov  ax, textSegment
	mov  es, ax
	mov  di, vOffset
	or   di, 1
	mov  dx, cPair { should be bb:FF }
	mov  bx, dx
	ror  bx, 1
	ror  bx, 1
	ror  bx, 1
	ror  bx, 1 { BX should be Fb:bF }
	mov  cx, 157
	
	mov  al, bl
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, cx
	
	stosb
	inc  di
	stosb
	add  di, cx
	
	stosb
	inc  di
	stosb
	add  di, cx
	
	stosb
	inc  di
	stosb
	add  di, cx
	
	mov  al, dl
	stosb
	inc  di
	mov  al, dh
	stosb
	
	mov  ax, di
	sub  ax, 639
end; { fastNum0 }

If we sent that a colorpair of $11FF (white on blue)
DH = $11
DL = $FF
BH = $F1
BL = $1F

Which I can then quickly copy to AL to STOSB them out.

I'm playing with just hardcoding the various number draws into the output function directly. Still trying to decide if the reduction in playing with setting ES, BX and DX over and over again along with the array lookup and far calls would equal the various TEST and JMP an equivalent "all in one" function would need... If I btree my TEST I could set the maximum number of test/jmp ever run to 4, that wouldn't be too bad...

Something like:

Code:

	test al, $08
	jnz  @test_89
	test al, $04
	jnz  @test_4567
	test al, $02
	jnz  @test_23
	test al, $01
	jz  @char_0
	{ output character 1 }
	jmp  @next
@test_89:
	test al, $01
	jnz  @char_9
	{ output character 8 }
	jmp  @next
@char_9:
	{ output character 9 }
	jmp  @next
@test_4567:
	test al, $02
	jnz  @test_67
	test al, $01
	jnz  @char_5
@char_4:
	{ output character 4 }
	jmp  @next
@char_5:
	{ output character 5 }
	jmp  @next
@test_67:
	test  al, $01
	jnz  @char_7
	{ output character 6 }
	jmp  @next
@char_7:
	{ output character 7 }
	jmp  @next
@test_23:
	test al, $01
	jnz  @char_3
	{ output character 2 }
	jmp @next
@char_3:
	{ output character 3 }
	jmp  @next
@char_0:
	{ output character 0 }
@next:

I dunno, that feels ugly as hell too...

Just thought I'd share what I've been working on -- any suggestions and/or improvements are welcome.

deathshadow · Mar 2, 2014

Just hit me the largest value a player would score is five digits long, that would mean for bytes 5, 6 and 7 in my add I could drop the lodsb.

Code:

	mov  al, ah
	xor  ah, ah
	add  al, es:[di]
	aaa
	stosb

on each of them... Maybe toss a JNC in there too? I dunno, that might be taking the optimization a bit too far.

deathshadow · Mar 2, 2014

Did some real testing over five seconds (cheap check of $6C in the BDA for rollovers) of my various number routines, doing a +1 per loop, on the Jr.

Longint, string conversion, "normal" string output routine (how paku paku does it)
491 loops in five seconds

BCD, nested calls to array lookup of functions (working code above)
1150 loops in five seconds

BCD, btree TEST/JNZ with characters as part of the output function
1497 loops in five seconds

Realized that TEST is a bit slow though compared to say... a shift and checking JC instead. So...

BCD, btree SHR/JC with characters as part of the output function
1592 loops in five seconds

We have a winner. (unless someone can suggest a faster way to handle these jumps) -- Over three times the speed of what I was doing in Paku Paku for displaying score changes! Moving this into my new textgraph unit as a core part of it.

Code:

procedure tg_numberBCD(var value; vOffset, colorPair:word); assembler;
asm
	push ds
	push bp

	mov  ax, textSegment
	mov  es, ax

	mov  di, vOffset
	or   di, 1
	add  di, 28

	mov  dx, colorPair
	mov  bx, dx
	mov  cl, 4
	ror  bx, cl

	lds  si, value

	mov  cx, 157
	mov  bp, 8


@loop:

	lodsb

	shr  al, 1
	jc   @test_1_3_5_7_9

@test_0_2_4_6_8:
	jz   @char_0

@test_2_4_6_8:
	shr  al, 1
	jc   @test_2_6

@test_4_8:
	shr  al, 1
	jc   @char_4

@char_8:

	mov  al, dl
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, cx

	stosb
	inc  di
	stosb
	add  di, cx

	mov  al, dl
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, cx

	stosb
	inc  di
	stosb
	add  di, cx

	mov  al, dl
	stosb
	inc  di
	mov  al, bh
	stosb

	jmp  @done

@char_4:

	mov  al, bh
	stosb
	inc  di
	stosb
	add  di, cx

	stosb
	inc  di
	stosb
	add  di, cx

	mov  al, dl
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, cx

	mov  al, dh
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, cx

	mov  al, dh
	stosb
	inc  di
	mov  al, bh
	stosb

	jmp  @done

@test_2_6:
	shr  al, 1
	jc   @char_6

@char_2:

	mov  al, dl
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, cx

	mov  al, dh
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, cx

	mov  al, dl
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, cx

	stosb
	inc  di
	mov  al, dh
	stosb
	add  di, cx

	mov  al, dl
	stosb
	inc  di
	mov  al, bh
	stosb

	jmp  @done

@char_6:

	mov  al, dl
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, cx

	stosb
	inc  di
	mov  al, dh
	stosb
	add  di, cx

	mov  al, dl
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, cx

	stosb
	inc  di
	stosb
	add  di, cx

	mov  al, dl
	stosb
	inc  di
	mov  al, bh
	stosb

	jmp  @done

@test_1_3_5_7_9:
	jz   @char_1
	shr  al, 1
	jc   @test_3_7

@test_5_9:
	shr  al, 1
	jc   @char_5

@char_9:

	mov  al, dl
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, cx

	stosb
	inc  di
	stosb
	add  di, cx

	mov  al, dl
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, cx

	mov  al, dh
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, cx

	mov  al, dh
	stosb
	inc  di
	mov  al, bh
	stosb

	jmp  @done

@char_5:

	mov  al, dl
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, cx

	stosb
	inc  di
	mov  al, dh
	stosb
	add  di, cx

	mov  al, dl
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, cx

	mov  al, dh
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, cx

	mov  al, dl
	stosb
	inc  di
	mov  al, bh
	stosb

	jmp  @done

@test_3_7:
	shr  al, 1
	jc   @char_7

@char_3:

	mov  al, dl
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, cx

	mov  al, dh
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, cx

	mov  al, dl
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, cx

	mov  al, dh
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, cx

	mov  al, dl
	stosb
	inc  di
	mov  al, bh
	stosb

	jmp  @done

@char_7:

	mov  al, dl
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, cx

	mov  al, dh
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, cx

	mov  al, dh
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, cx

	mov  al, dh
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, cx

	mov  al, dh
	stosb
	inc  di
	mov  al, bh
	stosb

	jmp  @done

@char_1:

	mov  al, bl
	stosb
	inc  di
	mov  al, dh
	stosb
	add  di, cx

	mov  al, dl
	stosb
	inc  di
	mov  al, dh
	stosb
	add  di, cx

	mov  al, bl
	stosb
	inc  di
	mov  al, dh
	stosb
	add  di, cx

	mov  al, bl
	stosb
	inc  di
	mov  al, dh
	stosb
	add  di, cx

	mov  al, dl
	stosb
	inc  di
	mov  al, bh
	stosb

	jmp  @done

@char_0:
	mov  al, bl
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, cx

	stosb
	inc  di
	stosb
	add  di, cx

	stosb
	inc  di
	stosb
	add  di, cx

	stosb
	inc  di
	stosb
	add  di, cx

	mov  al, dl
	stosb
	inc  di
	mov  al, dh
	stosb

@done:
	sub  di, 647

	dec  bp
	jnz  @loop

	pop  bp
	pop  ds
end;

Might seem silly to use BP for the outer count, but it lets me use CX for that 157, making the routine ~10% faster overall. Of course like an idiot I tried doing my lds si, value AFTER I changed BP in the first version -- yeah, that was gonna work! (shame you can't add dx, bp)

Krille · Mar 3, 2014

deathshadow said:
If you're wondering why I'm using BX for zero, mem:imm16 are usually two or three bytes larger than mem:reg (depending on the operation), so that's ~6 bytes saved on average, inside the loop netting more loops.

Good thinking but wouldn't it be even better to use the 3 free registers (SI, DI and BP) inside the loop and update the variables after?

deathshadow said:
We have a winner. (unless someone can suggest a faster way to handle these jumps)

I believe this should be faster;

Code:

procedure tg_numberBCD(var value; vOffset, colorPair:word); assembler;
asm
	push ds
	push bp

	mov  ax, textSegment
	mov  es, ax

	mov  di, vOffset
	or   di, 1
	add  di, 28

	mov  dx, colorPair
	mov  bx, dx
	mov  cl, 4
	ror  bx, cl

	lds  si, value

	mov  cx, 157
	mov  bp, 8


@loop:

	lodsb

	cbw		; Clear AH
	shl  al, 1
	add  ax, @table	; Add offset to table
	xchg bx, ax	; Swap since we cannot call [ax]
	call cs:[bx]	

	sub  di, 647

	dec  bp
	jnz  @loop

	pop  bp
	pop  ds

	ret


@table:
	dw  @char_0	; Offsets to each function
	dw  @char_1
	dw  @char_2
	dw  @char_3
	dw  @char_4
	dw  @char_5
	dw  @char_6
	dw  @char_7
	dw  @char_8
	dw  @char_9

@char_0:
	xchg bx, ax
	; Do actual work here
	ret

@char_1:
	xchg bx, ax
	; Do actual work here
	ret

	; Etc...
	
end;

(shame you can't add dx, bp)

I'm not sure what you mean here?

deathshadow · Mar 3, 2014

Krille said:
Good thinking but wouldn't it be even better to use the 3 free registers (SI, DI and BP) inside the loop and update the variables after?

You'd think so, it's effectively the same code size (only 4 bytes smaller), and that section isn't speed essential since you're waiting around with your thumb up yer ass waiting for the capacitor to charge. Still, it's a good thought, I might switch to that just because hey... 3 bytes is three bytes.

Krille said:
I believe this should be faster;

I'm going to add that to my test suite -- it's close to what I had originally but it was for some reason painfully slow; but I should compare side-by-side to be sure. I think in TP I'll need to use the OFFSET keyword, and I'm not 100% sure it'll even let me do it the way you have it there. (limitations of the inline compiler).

I'd probably also try to jmp instead of call, and put a jnz @loop jmp@done at the end of each item instead of trying to ret... no reason to get the stack involved even if the result is a few more bytes of code.

I'm actually playing with reversing es:di and ds:si's roles, so I can
mov [si], bh
mov [si+2], dl
mov [si+160], bh
etc, etc, etc...

so the math at the end is only sub si, 4

It sounds wierd, but it's almost identical in execution time to mov al, value; stosb; inc di; since using SI and setting up DS I don't need a segment override -- the long execution time (18 clocks, -4 memory access) filling the BIU with the next MOV. This frees up CX -- which with your smaller loop I could actually use loop...

Thanks! That helps.

-- edit -- P.S. Thanks for reminding me about CBW, I completely spaced the existence of that! (doh!)

deathshadow · Mar 3, 2014

NUMTEST.PAS(254): Error 155: Invalid combination of opcode and operands
call cs:[bx]

So much for that...

I tried changing to this:

Code:

@loop:

	mov   bl, es:[di]
	inc   di
	xor   bh, bh
	shl   bx, 1
	add   bx, offset @table
	mov   bx, [bx]
	call  bx
	sub   si, 4
	
	loop  @loop
	
	pop ds
	ret
	
@table:
	dw  @char_0
	dw  @char_1
	dw  @char_2
	dw  @char_3
	dw  @char_4
	dw  @char_5
	dw  @char_6
	dw  @char_7
	dw  @char_8
	dw  @char_9

But that's just crash crash... crash crash crash... complete and total system lockup. I HAVE to say 'offset' on @table or it won't compile... I thought adding 'offset' to the dw would help, no difference either way. I THINK the problem is that DS isn't pointed at... the data, it's pointed at screen. To implement this I'd have to swap ds and es around so blasted much, it's not worth the overhead. there's just not enough registers to do it efficiently this way with DS and ES already in use pointing at two locations, neither of which is the actual program ds...

Though since I'm NOT using the heap at ALL, I could make this routine only need the offset of the variable instead of passing it as var... Eh, this would probably work better with the mov al, stosb, inc di version.

Oh, duh... not only is DS not pointed right, we can't call, BP is in use too! (no stack!)

deathshadow · Mar 3, 2014

Even with DS and BP out of the picture, trying to do a call table heads off to never never land. I think TP7's inline compiler is plugging garbage into that table. I'd drop to a real assembler, but given how TASM 5's interfacing to TP7 is completely buggered here... (Noticed they don't even have the chapter on interfacing to TP in the manual for it?!? Same year 7 was introduced and MENTIONS a chapter about it?!?)

Seems like none of my old externals will compile and link right anymore. Anybody got a copy of TASM 3 DOS? I can find 2 (which sucks) and 5 (which doesn't seem to work)...

I can't even get a simple test "increment a var" program working in TASM 5...

deathshadow · Mar 3, 2014

nevermind on the TASM stuff -- like an idiot I forgot to declare all my externals as 'far'.

Though, can anyone explain what I'm doing wrong in this?

Code:

DATA SEGMENT WORD PUBLIC
	ASSUME ds:DATA
	EXTRN  textSegment
DATA ENDS

CODE SEGMENT
	ASSUME cs:CODE,ds:DATA
	
; procedure fastFont(valueOffset, vidOffset, colorPair:word);
fastFont PROC FAR

	PUBLIC fastFont
	ARG colorPair : WORD, vidOfs : WORD, valueOfs : WORD = retBytes
	
	push bp
	mov  bp, sp
	
	mov  ax, textSegment
	mov  es, ax
	
	mov  dx, colorPair
	mov  bx, dx
	mov  cl, 4
	ror  bx, cl
	
	mov  di, vidOfs
	or   di, 1
	add  di, 28
	
	mov  si, valueOfs
	
	mov  cx, 8
	
charLoop:
	
	lodsb
	aaa ; force result to 0..9 during testing since values are garbage
	cbw
	shl  ax, 1
	add  ax, OFFSET charTable

	xchg bx, ax
	mov  bx, [bx]
	xchg bx, ax
	call ax

	sub  di, 647
	
	loop charLoop
	
	pop  bp
	ret  retBytes
	
char_1:

	mov  al, bl
	stosb
	inc  di
	mov  al, dh
	stosb
	add  di, 157

	mov  al, dl
	stosb
	inc  di
	mov  al, dh
	stosb
	add  di, 157

	mov  al, bl
	stosb
	inc  di
	mov  al, dh
	stosb
	add  di, 157

	mov  al, bl
	stosb
	inc  di
	mov  al, dh
	stosb
	add  di, 157

	mov  al, dl
	stosb
	inc  di
	mov  al, bh
	stosb

	ret 0
	
char_2:

	mov  al, dl
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, 157

	mov  al, dh
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, 157

	mov  al, dl
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, 157

	stosb
	inc  di
	mov  al, dh
	stosb
	add  di, 157

	mov  al, dl
	stosb
	inc  di
	mov  al, bh
	stosb

	ret 0

char_3:

	mov  al, dl
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, 157

	mov  al, dh
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, 157

	mov  al, dl
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, 157

	mov  al, dh
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, 157

	mov  al, dl
	stosb
	inc  di
	mov  al, bh
	stosb

	ret 0

char_4:

	mov  al, bh
	stosb
	inc  di
	stosb
	add  di, 157

	stosb
	inc  di
	stosb
	add  di, 157

	mov  al, dl
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, 157

	mov  al, dh
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, 157

	mov  al, dh
	stosb
	inc  di
	mov  al, bh
	stosb

	ret 0

char_5:

	mov  al, dl
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, 157

	stosb
	inc  di
	mov  al, dh
	stosb
	add  di, 157

	mov  al, dl
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, 157

	mov  al, dh
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, 157

	mov  al, dl
	stosb
	inc  di
	mov  al, bh
	stosb

	ret 0

char_6:

	mov  al, dl
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, 157

	stosb
	inc  di
	mov  al, dh
	stosb
	add  di, 157

	mov  al, dl
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, 157

	stosb
	inc  di
	stosb
	add  di, 157

	mov  al, dl
	stosb
	inc  di
	mov  al, bh
	stosb

	ret 0

char_7:

	mov  al, dl
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, 157

	mov  al, dh
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, 157

	mov  al, dh
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, 157

	mov  al, dh
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, 157

	mov  al, dh
	stosb
	inc  di
	mov  al, bh
	stosb

	ret 0

char_8:

	mov  al, dl
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, 157

	stosb
	inc  di
	stosb
	add  di, 157

	mov  al, dl
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, 157

	stosb
	inc  di
	stosb
	add  di, 157

	mov  al, dl
	stosb
	inc  di
	mov  al, bh
	stosb
 
	ret 0

char_9:

	mov  al, dl
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, 157

	stosb
	inc  di
	stosb
	add  di, 157

	mov  al, dl
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, 157

	mov  al, dh
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, 157

	mov  al, dh
	stosb
	inc  di
	mov  al, bh
	stosb

	ret 0
	
char_0:

	mov  al, bl
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, 157

	stosb
	inc  di
	stosb
	add  di, 157

	stosb
	inc  di
	stosb
	add  di, 157

	stosb
	inc  di
	stosb
	add  di, 157

	mov  al, dl
	stosb
	inc  di
	mov  al, dh
	stosb

	ret 0
	
charTable:
	dw  offset char_0
	dw  offset char_1
	dw  offset char_2
	dw  offset char_3
	dw  offset char_4
	dw  offset char_5
	dw  offset char_6
	dw  offset char_7
	dw  offset char_8
	dw  offset char_9
	
fastFont ENDP

CODE ENDS

END

When it gets to the section that does the xchg ax, bx, mov bx, [bx], xchg ax, bx it seems that BX, DX, DI and SI all get banjaxed, and I can't figure out why. EVENTUALLY TP7 bombs out with a runtime error 202, so it's screwing up the stack as well?!?

Krille · Mar 4, 2014

Sorry about that, I'm used to NASM where the keyword OFFSET isn't needed. I guess you'll have to use OFFSET both when addressing the table and in the actual table.

About the CS segment override; would this work in TP7?

Code:

	cs:
	call [bx]

If not, try this;

Code:

	db  $2E
	call [bx]

deathshadow · Mar 4, 2014

I think you're not used to 8088 either... there is no [bx], [bp] or any other [reg16] for call on the 8088/8086. you MUST store it in the register FIRST before you can call it. "call mem16" is 286+ only.

mov bx, [bx]
call bx

hence why my version is doing:

xchg bx, ax
mov bx, [bx]
xchg bx, ax
call ax

in an attempt to preserve BX

A segment override first doesn't seem to help... though usually calls are CS based anyways so...

I'm just trying to figure out why as TASM code it's corrupting DI, SI, BX and DX... when it DOES seem to be calling the routines properly (finally), just by the time it gets there those regs are fubar.

I'm starting to think this technique is just too advanced for an 8088/8086 to actually do. Even controlled tests not as fancy are failing miserably, as after "call ax" every GP, Base and Index register except CX and SP are totally banjaxed.

Krille · Mar 4, 2014

That variant of the call instruction does exist on the 808x. The problem is likely that the OFFSETs are relative to whatever TP thinks DS points to and since CS is probably not the same, the offsets will be wrong. My version of the code will work if you can just cajole TP into using the correct offsets.

Your version should also work if you change "mov bx, [bx]" into "mov bx, cs:[bx]". The corruption of registers is because some random code is being called.

deathshadow · Mar 4, 2014

Krille said:
The corruption of registers is because some random code is being called.

That's what's weird, the correct code is in fact being called as it IS blitting to ES, and DI is maintained, it's just the WRONG starting value in DI and BX/DX contain the wrong values. I traced and the calls ARE going to the right place, but when it gets there DI, SI, BX and DX are gibberish! Somehow when that call runs, it screws up half the register values. (both on real hardware and in dosbox)

... and no, that variant does NOT exist on the 808x, as that's a mem16. Early x86 only has reg16 (value in register), rel16 (basically an imm16), ptr16:16 (also imm) or m16:16, which is oddball as it's basically memory as the segment and the offset as immediate... that's IT for choices. Mem16 only exists 286/newer last I knew...

deathshadow · Mar 4, 2014

FINALLY got it working in the inline assembler (TASM is still messed up, even with the same code?)

Code:

procedure callNumberBCD(valueOffset, vidOffset, colorPair:word); assembler;
asm	
	mov  ax, textSegment
	mov  es, ax
	
	mov  dx, colorPair
	mov  bx, dx
	mov  cl, 4
	ror  bx, cl
	
	mov  di, vidOffset
	or   di, 1
	add  di, 28
	
	mov  si, valueOffset
	
	mov  cx, 8
	mov  bp, 157
	
@charLoop:
	
	lodsb
	cbw
	shl  ax, 1
	add  ax, OFFSET @charTable

	xchg bx, ax
	mov  bx, cs:[bx]
	xchg bx, ax
	call ax 

	sub  di, 647
	
	loop @charLoop
	
	jmp  @done
	
@char_0:

	mov  al, bl
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, bp

	stosb
	inc  di
	stosb
	add  di, bp

	stosb
	inc  di
	stosb
	add  di, bp

	stosb
	inc  di
	stosb
	add  di, bp

	mov  al, dl
	stosb
	inc  di
	mov  al, dh
	stosb

	ret
	
@char_1:

	mov  al, bl
	stosb
	inc  di
	mov  al, dh
	stosb
	add  di, bp

	mov  al, dl
	stosb
	inc  di
	mov  al, dh
	stosb
	add  di, bp

	mov  al, bl
	stosb
	inc  di
	mov  al, dh
	stosb
	add  di, bp

	mov  al, bl
	stosb
	inc  di
	mov  al, dh
	stosb
	add  di, bp

	mov  al, dl
	stosb
	inc  di
	mov  al, bh
	stosb

	ret
	
@char_2:

	mov  al, dl
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, bp

	mov  al, dh
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, bp

	mov  al, dl
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, bp

	stosb
	inc  di
	mov  al, dh
	stosb
	add  di, bp

	mov  al, dl
	stosb
	inc  di
	mov  al, bh
	stosb

	ret

@char_3:

	mov  al, dl
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, bp

	mov  al, dh
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, bp

	mov  al, dl
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, bp

	mov  al, dh
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, bp

	mov  al, dl
	stosb
	inc  di
	mov  al, bh
	stosb

	ret

@char_4:

	mov  al, bh
	stosb
	inc  di
	stosb
	add  di, bp

	stosb
	inc  di
	stosb
	add  di, bp

	mov  al, dl
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, bp

	mov  al, dh
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, bp

	mov  al, dh
	stosb
	inc  di
	mov  al, bh
	stosb

	ret

@char_5:

	mov  al, dl
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, bp

	stosb
	inc  di
	mov  al, dh
	stosb
	add  di, bp

	mov  al, dl
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, bp

	mov  al, dh
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, bp

	mov  al, dl
	stosb
	inc  di
	mov  al, bh
	stosb

	ret

@char_6:

	mov  al, dl
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, bp

	stosb
	inc  di
	mov  al, dh
	stosb
	add  di, bp

	mov  al, dl
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, bp

	stosb
	inc  di
	stosb
	add  di, bp

	mov  al, dl
	stosb
	inc  di
	mov  al, bh
	stosb

	ret

@char_7:

	mov  al, dl
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, bp

	mov  al, dh
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, bp

	mov  al, dh
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, bp

	mov  al, dh
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, bp

	mov  al, dh
	stosb
	inc  di
	mov  al, bh
	stosb

	ret

@char_8:

	mov  al, dl
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, bp

	stosb
	inc  di
	stosb
	add  di, bp

	mov  al, dl
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, bp

	stosb
	inc  di
	stosb
	add  di, bp

	mov  al, dl
	stosb
	inc  di
	mov  al, bh
	stosb
 
	ret

@char_9:

	mov  al, dl
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, bp

	stosb
	inc  di
	stosb
	add  di, bp

	mov  al, dl
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, bp

	mov  al, dh
	stosb
	inc  di
	mov  al, bh
	stosb
	add  di, bp

	mov  al, dh
	stosb
	inc  di
	mov  al, bh
	stosb

	ret
	
	
@charTable:
	DW  OFFSET @char_0
	DW  OFFSET @char_1
	DW  OFFSET @char_2
	DW  OFFSET @char_3
	DW  OFFSET @char_4
	DW  OFFSET @char_5
	DW  OFFSET @char_6
	DW  OFFSET @char_7
	DW  OFFSET @char_8
	DW  OFFSET @char_9

@done:
	
end;

Unfortunately on real hardware it's 10% slower than the shift and jc approach. I'm actually not too surprised by that -- it using call/ret introduces overhead comparable to one jump branching or two drop-throughs, while the extra code of adding an immediate, pulling a memory lookup (that's the real pain there), ends up almost as much execution time as the 3-depth test all branching... and since, well... the average number of tests using my method is TWO since I'm basically splitting it as binary...

That's the fun of binary splitting, I could compare up to 0..255 and my maximum execution depth would be eight. For 0..9 the max depth is 4, and only two of the values would ever go that deep. The ~20 cycle average per test ends up comparable to doing a memory lookup. Remember, lookups are often NOT faster if setting up for them or doing something useful with the result takes longer.

I am gonna make a test-case using mov [si+disp] approach with the table lookup/call method -- that might be easier as I can use AL instead of BL in the routines, and use mov bl, es:[di] to pull the value.

Right now the test case is kinda funny as on the Jr. and a 8088 equipped 1K HX, mov[si+disp] is ~1% faster than the lodsb/inc di approach, while on the V20 equipped 1K SX it's the other way around.

... but since both routines are more than four times faster than what I was using originally, it's all good. I'll probably go with the mov[si+disp] approach anyways just because it's a lot easier to understand the code.

-- edit -- the TP7 manual says you 'have' to preserve BP, but that's not ENTIRELY correct. You only need BP to access values passed on the stack. Once you have those values you can manipulate BP until blue in the face as the function wrapper automatically adds push bp, mov bp, sp to the beginning and pop bp at the end ANYWAYS!

reenigne · Mar 4, 2014

deathshadow said:
... and no, that variant does NOT exist on the 808x, as that's a mem16. Early x86 only has reg16 (value in register), rel16 (basically an imm16), ptr16:16 (also imm) or m16:16, which is oddball as it's basically memory as the segment and the offset as immediate... that's IT for choices. Mem16 only exists 286/newer last I knew...

I'm with Krille on this one. I just tried assembling the following with yasm and running it on my XT Server and it works exactly like I'd expect:

org 0
cpu 8086

mov ax,cs
mov ds,ax
mov bx,foo
call [bx]
mov ax,0x1234
int 0x63
int 0x67
bar:
mov ax,0x5678
int 0x63
ret

foo: dw bar

The output is 56781234.

Mike Chambers · Mar 4, 2014

Yep, it exists. If I remember correctly from writing my emu, it's GRP 5 (FFh) then a 2 in the register field in the addr mode byte.

deathshadow · Mar 5, 2014

I stand corrected, it compiles in BASM... I think my copy of TASM 5 is screwed up or something, it's giving me bizzaro-land errors and incompatible code that does all sorts of weirdness.

I knew there was a reason I usually stick to the safety of BASM (the TP7 inline compiler) -- even if I do miss having macros available.

IF one were looking for an external assembler for TP7, what would you guys think is the best choice? I'm half tempted to grab MASM because TASM is pissing me off.

Trixter · Mar 5, 2014

deathshadow said:
I can't use the PIT for it, as both channels 0 and 2 are in use for other things like actual game timing and audio control... and one of these upcoming programs is going to be multiplayer supporting two joysticks, so I've had to come at this from a whole different direction.

You can read the 8259 without reprogramming it and the values you read are CLK/12 which are the same no matter what the divisor is set to. I can post my entire joystick routine that gets values from reading the timer if you'd like to see it. It isn't (too) sensitive to interrupt jitter because it is grabbing values from a constant timer.

I could then use a single loop to read all four values.

This is acceptable for digital up/down/left/right values, but not good for analog work because a single routine that figures out all four values results in coarser values.

When I wrote my joystick code, I was optimizing for the most granular analog values possible, so my code is called once per stick+axis you want to read (I test only one bit at a time). For X and then Y, two calls are necessary -- but the results if using a loop-based method have a range approximately 0-140, so I decided the tradeoff was worth it. You can certainly alter the code to test all four bits.

Code:
32 bit math is slow on the 8088[/quotes]

For scores? I don't know what you're doing, but adding a 16-bit value to a 32-bit value is this:

Code:

add ax,val adc dx,0

Maybe I'm misunderstanding where you're trying to optimize. I don't know if you're trying to unlock an optimization achievement or something, but "updating the score" is typically not something you need to spend a lot of time on...?

deathshadow · Mar 5, 2014

Trixter said:
When I wrote my joystick code, I was optimizing for the most granular analog values possible, so my code is called once per stick+axis you want to read (I test only one bit at a time).

I get 0..20 now for range on a 128k jr, so I think that's more than sufficient from a analog perspective; usually faster routines have +-6 jitter making that 0..140 you mentioned nearly the same thing.

Trixter said:
Maybe I'm misunderstanding where you're trying to optimize.

You are, the problem isn't with the longint addition, it's with turning that longint into a string to then turn into rendered text on the screen. Even the 'faster' method of DIV 1000 then AAD AX and DX is agonizingly slow. The 'slower' BCD math of 8 bytes is faster because the result is eight 8 BCD bytes I can then use to either index a table of routines or shift/jc.

Trixter said:
but "updating the score" is typically not something you need to spend a lot of time on...?

Yeah, you missed it. The problem isn't the addition, it's the string conversion.

I was profiling the code, and when it runs it's sucking >20% of each frame time; which is part of why the speed visibly changes on a 128k Jr. when you're eating pellets vs. when you're not... and it's all because turning a longint into a string SUCKS. It's more efficient to just store it as BCD in the first place; the overhead of BCD addition being far, far lower than the process of turning longint into a string.

This is the best I've been able to come up with for doing longint to string:

Code:

function longToSt8(n:longint):st8; assembler;
asm
	les  di, @result
	mov  ax, $0008
	mov  es:[di],al
	add  di, ax
	mov  dx, word ptr n+2
	mov  ax, word ptr n
	mov  bx, 10000
	div  bx { ax=high 0..9999, dx=low 0..9999 }
	mov  si, ax
	mov  ax, dx
	mov  cx, 4
	mov  bx, 10
	std
	
@loop1:
	xor  dx, dx
	div  bx
	xchg al, dl
	or   al, $30
	stosb
	mov  al, dl
	loop @loop1
	
	mov  ax, si
	mov  cx, 4
	
@loop2:
	xor  dx, dx
	div  bx 
	xchg al, dl
	or   al, $30
	stosb
	mov  al, dl
	loop @loop2

	cld
end;

Two or three times faster than TP's "STR" function... and it's still PAINFULLY BAD. An unrolled BCD add is WAY faster. I originally thought leveraging AAA might be the answer, but that got so complex it was in fact many times worse than sucking it up an just using DIV.

Makes sense it would be more efficient this way, BCD is the most efficient way to do long values you're going to want to display quickly on 8-bit targets like the 6502... 8088 might as well be a (crippled) 8 bit processor on things like this.

Trixter · Mar 5, 2014

deathshadow said:
I get 0..20 now for range on a 128k jr, so I think that's more than sufficient from a analog perspective; usually faster routines have +-6 jitter making that 0..140 you mentioned nearly the same thing.

My loop-based method has no jitter; the values are indeed in that range and stable. But you don't need that kind of accuracy for up/down/left/right, and you're not interested in seeing my code anyway, so I'll drop the issue.

Yeah, you missed it. The problem isn't the addition, it's the string conversion.

How fast does the string conversion need to be? Are you updating the score at 60Hz? What percentage of time is used by the string conversion in comparison to all of the other code necessary to update numbers onscreen in graphics mode?

I was profiling the code, and when it runs it's sucking >20% of each frame time ... and it's all because turning a longint into a string SUCKS.

A single longint to string conversion is not taking up 20% of available CPU. I think updating numbers onscreen in graphics mode is taking up 20% of the CPU. Optimize your graphics routines or your expectations, not the string conversion.

If you don't care and want to speed up your routine anyway, then replace your DIV with a series of shifts.

deathshadow · Mar 5, 2014

Trixter said:
My loop-based method has no jitter; the values are indeed in that range and stable. But you don't need that kind of accuracy for up/down/left/right, and you're not interested in seeing my code anyway

I'd be interested in seeing it, I just can't waste the time to read each axis individually. Takes too long.

Trixter said:
What percentage of time is used by the string conversion in comparison to all of the other code necessary to update numbers onscreen in graphics mode?

Using longint? Around 60% of the time -- though that's before changing to the method of blitting discussed so far in this thread (with the whole call vs shift/jc thing)... Now that I have the blitting optimized, it jumps to 85% or so of the time spent updating the score. It's around ~300 clocks per DIGIT using the code I posted (long2str8), so upwards of 2400 clocks just to convert it? In an area where on a 128k Jr I need it to take <500 clocks?

Trixter said:
A single longint to string conversion is not taking up 20% of available CPU.

It is over the narrow timeslice which is allocated to it -- I'd have to do the conversion and the update in separate timeslices, and that would add overhead; not a great solution.

I should have said per timeslice though -- there's 6 to 8 timeslices per frame.

Trixter said:
I think updating numbers onscreen in graphics mode is taking up 20% of the CPU. Optimize your graphics routines or your expectations, not the string conversion.

Not anymore since I have the new blitting routines, but as I just said, the number conversion took longer than the blitting did on the old routines because divide by 10 or it's equivalent are slow as molassas. Great choices I have for doing this; eight 60 cycle a pop BCD functions with tons of register flipping per word (result size), or four divide by ten per 4 string bytes;

Trixter said:
If you don't care and want to speed up your routine anyway, then replace your DIV with a series of shifts.

Never quite grasped how to do a divide by 10 with shifts... at least not in anything resembling an efficient manner... I mean you do four of them, on the fourth one you condition and carry, then have to reverse DAA -- the net result again being SLOWER than just doing MOV BX, 1000, div BX, xchg bl, al, div bl, div bl, div bl

Even with div being 'slow' it's one opcode vs. christmas knows how many. Even the best software div by ten using shifts is what, 9 shifts, double that for dword, a nightmare once carry is involved... and doesn't return a remainder without a subtractive loop -- at which point you might as well just use a subtractive loop.

I wrote a testcase to compare BCD vs. Longint using the old blitting routine vs. my new blitting routine; over five seconds flat out the routines managed this many intervals on a 128k Jr:

Longint + old text blitter = 452
Longint + new text blitter = 844
BCD + old text blitter = 1455
BCD + new text blitter = 2492

Writing the graphics was only HALF the problem.

Some new code, joystick and BCD

Veteran Member

Veteran Member

Veteran Member

Veteran Member

Veteran Member

Veteran Member

Veteran Member

Veteran Member

Veteran Member

Veteran Member

Veteran Member

Veteran Member

Veteran Member

Veteran Member

Veteran Member

Veteran Member

Veteran Member

Veteran Member

Veteran Member

Veteran Member