deathshadow
Veteran Member
- Joined
- Jan 4, 2011
- Messages
- 1,378
Working on my game library for a few things -- biggest of which being memory footprint and speed, I've found two major bottlenecks...
I didn't realize just how ridiculously USELESS the range of sensitivity the joystick on the Jr was at 128k. My existing routine was returning a center value of 16 -- and running that twice (once for each axis) sucks down an ungodly 30% of my frame rate cycle!!! No wonder having interrupts enabled caused jitter!
I can't use the PIT for it, as both channels 0 and 2 are in use for other things like actual game timing and audio control... and one of these upcoming programs is going to be multiplayer supporting two joysticks, so I've had to come at this from a whole different direction.
First thing I needed to do was come up with a test for the joystick values that would take the same amount of time true or false, I could then use a single loop to read all four values. It would need to preserve the original value, and have a way to mask off joysticks that aren't connected.
Masking is easy, we figure out what sticks we want to read, and simply AND their bits against the value we have in AL from port $201. Since joystick 0 X axis is bit one, we can ROR AL, 1 and then ADC to our memory location storing the X coord value. Lather rinse, repeat for each of the other axis. Since I use ROR for this the value is still in there, so we can OR AL,AL to LOOPNZ... Needing the loop to make sure it has a timeout since an unconnected unmasked stick will loop indefinitely.
To figure out which axis are connected isn't too hard -- we just set our mask to $0F and run our routine. If the axis equals our timeout (CX) starting value, it bombed.
So that gives me this TP7 unit.
Which is stable for all four stick axis, auto-detects if they're connected, and so forth. I also hard-coded the button checks instead of making them functional because even the stupid shift was killing me. If you're wondering why I'm using BX for zero, mem:imm16 are usually two or three bytes larger than mem:reg (depending on the operation), so that's ~6 bytes saved on average, inside the loop netting more loops.
On the 128k Jr it's returning a center value of 10, which is even more sucktastic, but since I'm only using it for digital style input with a dead zone, it's functional. Disabling interrupts also appears to no longer interfere with timing issues on the t1k using this routine, I suspect because I'm only doing it once instead of twice in the input "slice".
The other bottleneck is outputting score updates. 32 bit math is slow on the 8088, converting a 32 bit integer to a string even more so. While the routine I had for a fixed 8 digit result was pretty peppy as such things go, the simple fact was that by the time I got to actually outputting the score it was just painful to use.
Having made a port to the C64 I knew BCD was a great way around these issues; but I know jack **** about doing it on the x86 platform. (I know it on the Z80 and 6502 pretty good). For genuine speed at outputting the score I need to use unpacked, and since the only math I need is addition I was like "ok, how hard can it be?" especially since little-endian 8 bytes would be my fastest approach (both for math and display)
Thing is, while it's WAY faster, my addition method feels, I dunno... sloppy. Just wondering if anyone knows a better way of doing it. I've unrolled the loop for speed and to remove extra unneeded operations.
BCDUnpacked is an array[0..7] of byte;
Something about it feels wrong... I can't put my finger on it. At first I thought I needed extra AAA in there, once for the carry in AH and once for the add, but since 9+1+9 == 19, I only need one so that's not the problem... It works, it's WAY faster than longint math with trying to turn that longint to a string for output... I dunno, I can't place what's not kosher about it. I almost want to add some short-circuit code for when the value being added is empty/done and CF is zero, but those extra jumps and tests take longer than just letting it finish. (scary with that many memory ops involved)... Maybe pair them up for LODSW? Nah, too much flipping to another register... What am I doing wrong there that's making me think I'm doing it wrong?!?
The output routine ended up being something of a laugh -- TP7 lets you assign functions as a type, so I have an array of functions assigned for coded sprites... each of the functions takes the character being displayed (0..9) and the video offset to display them at, returning the video offset at which the next character should be shown... this results in a rather unusual approach to doing this:
In case you're curious, colorPair is a word containing 4 nybbles -- first two are background:background, second two are foreground:foreground. I'm sending the colors packed that way so as to be able to use AND instead of shifts... and rotates to set up some 'easy' access to the pairs. Take drawing a zero:
If we sent that a colorpair of $11FF (white on blue)
DH = $11
DL = $FF
BH = $F1
BL = $1F
Which I can then quickly copy to AL to STOSB them out.
I'm playing with just hardcoding the various number draws into the output function directly. Still trying to decide if the reduction in playing with setting ES, BX and DX over and over again along with the array lookup and far calls would equal the various TEST and JMP an equivalent "all in one" function would need... If I btree my TEST I could set the maximum number of test/jmp ever run to 4, that wouldn't be too bad...
Something like:
I dunno, that feels ugly as hell too...
Just thought I'd share what I've been working on -- any suggestions and/or improvements are welcome.
I didn't realize just how ridiculously USELESS the range of sensitivity the joystick on the Jr was at 128k. My existing routine was returning a center value of 16 -- and running that twice (once for each axis) sucks down an ungodly 30% of my frame rate cycle!!! No wonder having interrupts enabled caused jitter!
I can't use the PIT for it, as both channels 0 and 2 are in use for other things like actual game timing and audio control... and one of these upcoming programs is going to be multiplayer supporting two joysticks, so I've had to come at this from a whole different direction.
First thing I needed to do was come up with a test for the joystick values that would take the same amount of time true or false, I could then use a single loop to read all four values. It would need to preserve the original value, and have a way to mask off joysticks that aren't connected.
Masking is easy, we figure out what sticks we want to read, and simply AND their bits against the value we have in AL from port $201. Since joystick 0 X axis is bit one, we can ROR AL, 1 and then ADC to our memory location storing the X coord value. Lather rinse, repeat for each of the other axis. Since I use ROR for this the value is still in there, so we can OR AL,AL to LOOPNZ... Needing the loop to make sure it has a timeout since an unconnected unmasked stick will loop indefinitely.
To figure out which axis are connected isn't too hard -- we just set our mask to $0F and run our routine. If the axis equals our timeout (CX) starting value, it bombed.
So that gives me this TP7 unit.
Code:
unit joystick;
interface
var
stick0x, stick0y, stick1x, stick1y:word;
stickMask:byte;
procedure stickUpdate;
function button0a:boolean;
function button0b:boolean;
function button1a:boolean;
function button1b:boolean;
implementation
const
stickLimit = $8000;
procedure stickUpdate; assembler;
asm
xor al, al
mov ah, stickMask
xor bx, bx
mov cx, stickLimit
mov dx, $201
mov stick0x, bx
mov stick0y, bx
mov stick1x, bx
mov stick1y, bx
cli
out dx, al
@loop:
in al, dx
and al, ah
ror al, 1
adc stick0x, bx
ror al, 1
adc stick0y, bx
ror al, 1
adc stick1x, bx
ror al, 1
adc stick1y, bx
or al, al
loopnz @loop
sti
end;
function button0a:boolean; assembler;
asm
mov dx, $201
in al, dx
and al, $10
xor al, $10
end;
function button0b:boolean; assembler;
asm
mov dx, $201
in al, dx
and al, $20
xor al, $20
end;
function button1a:boolean; assembler;
asm
mov dx, $201
in al, dx
and al, $40
xor al, $40
end;
function button1b:boolean; assembler;
asm
mov dx, $201
in al, dx
and al, $80
xor al, $80
end;
begin
asm
mov stickMask, $0F
call stickUpdate
xor al, al
mov bx, stickLimit
cmp stick0x, bx
je @test0y
or al, $01
@test0y:
cmp stick0y, bx
je @test1x
or al, $02
@test1x:
cmp stick1x, bx
je @test1y
or al, $04
@test1y:
cmp stick1y, bx
je @done
or al, $08
@done:
mov stickMask, al
end;
end.
Which is stable for all four stick axis, auto-detects if they're connected, and so forth. I also hard-coded the button checks instead of making them functional because even the stupid shift was killing me. If you're wondering why I'm using BX for zero, mem:imm16 are usually two or three bytes larger than mem:reg (depending on the operation), so that's ~6 bytes saved on average, inside the loop netting more loops.
On the 128k Jr it's returning a center value of 10, which is even more sucktastic, but since I'm only using it for digital style input with a dead zone, it's functional. Disabling interrupts also appears to no longer interfere with timing issues on the t1k using this routine, I suspect because I'm only doing it once instead of twice in the input "slice".
The other bottleneck is outputting score updates. 32 bit math is slow on the 8088, converting a 32 bit integer to a string even more so. While the routine I had for a fixed 8 digit result was pretty peppy as such things go, the simple fact was that by the time I got to actually outputting the score it was just painful to use.
Having made a port to the C64 I knew BCD was a great way around these issues; but I know jack **** about doing it on the x86 platform. (I know it on the Z80 and 6502 pretty good). For genuine speed at outputting the score I need to use unpacked, and since the only math I need is addition I was like "ok, how hard can it be?" especially since little-endian 8 bytes would be my fastest approach (both for math and display)
Thing is, while it's WAY faster, my addition method feels, I dunno... sloppy. Just wondering if anyone knows a better way of doing it. I've unrolled the loop for speed and to remove extra unneeded operations.
BCDUnpacked is an array[0..7] of byte;
Code:
procedure BCDUnpackedAdd(var b1, b2:BCDUnpacked); assembler;
asm
les di, b1
mov dx, ds
lds si, b2
{0}
lodsb
xor ah, ah
add al, es:[di]
aaa
stosb
{1}
lodsb
add al, ah
xor ah, ah
add al, es:[di]
aaa
stosb
{2}
lodsb
add al, ah
xor ah, ah
add al, es:[di]
aaa
stosb
{3}
lodsb
add al, ah
xor ah, ah
add al, es:[di]
aaa
stosb
{4}
lodsb
add al, ah
xor ah, ah
add al, es:[di]
aaa
stosb
{5}
lodsb
add al, ah
xor ah, ah
add al, es:[di]
aaa
stosb
{6}
lodsb
add al, ah
xor ah, ah
add al, es:[di]
aaa
stosb
{7}
lodsb
add al, ah
xor ah, ah
add al, es:[di]
aaa
stosb
mov ds, dx
end;
Something about it feels wrong... I can't put my finger on it. At first I thought I needed extra AAA in there, once for the carry in AH and once for the add, but since 9+1+9 == 19, I only need one so that's not the problem... It works, it's WAY faster than longint math with trying to turn that longint to a string for output... I dunno, I can't place what's not kosher about it. I almost want to add some short-circuit code for when the value being added is empty/done and CF is zero, but those extra jumps and tests take longer than just letting it finish. (scary with that many memory ops involved)... Maybe pair them up for LODSW? Nah, too much flipping to another register... What am I doing wrong there that's making me think I'm doing it wrong?!?
The output routine ended up being something of a laugh -- TP7 lets you assign functions as a type, so I have an array of functions assigned for coded sprites... each of the functions takes the character being displayed (0..9) and the video offset to display them at, returning the video offset at which the next character should be shown... this results in a rather unusual approach to doing this:
Code:
function fastNumberBCD(var b:BCDUnpacked; vOffset, colorPair:word):word;
begin
cPair := colorPair;
fastNumberBCD := fastNumber[b[0]](
fastNumber[b[1]](
fastNumber[b[2]](
fastNumber[b[3]](
fastNumber[b[4]](
fastNumber[b[5]](
fastNumber[b[6]](
fastNumber[b[7]](vOffset)
)
)
)
)
)
)
);
end;
In case you're curious, colorPair is a word containing 4 nybbles -- first two are background:background, second two are foreground:foreground. I'm sending the colors packed that way so as to be able to use AND instead of shifts... and rotates to set up some 'easy' access to the pairs. Take drawing a zero:
Code:
function fastNum0(vOffset:word):word; assembler;
asm
mov ax, textSegment
mov es, ax
mov di, vOffset
or di, 1
mov dx, cPair { should be bb:FF }
mov bx, dx
ror bx, 1
ror bx, 1
ror bx, 1
ror bx, 1 { BX should be Fb:bF }
mov cx, 157
mov al, bl
stosb
inc di
mov al, bh
stosb
add di, cx
stosb
inc di
stosb
add di, cx
stosb
inc di
stosb
add di, cx
stosb
inc di
stosb
add di, cx
mov al, dl
stosb
inc di
mov al, dh
stosb
mov ax, di
sub ax, 639
end; { fastNum0 }
If we sent that a colorpair of $11FF (white on blue)
DH = $11
DL = $FF
BH = $F1
BL = $1F
Which I can then quickly copy to AL to STOSB them out.
I'm playing with just hardcoding the various number draws into the output function directly. Still trying to decide if the reduction in playing with setting ES, BX and DX over and over again along with the array lookup and far calls would equal the various TEST and JMP an equivalent "all in one" function would need... If I btree my TEST I could set the maximum number of test/jmp ever run to 4, that wouldn't be too bad...
Something like:
Code:
test al, $08
jnz @test_89
test al, $04
jnz @test_4567
test al, $02
jnz @test_23
test al, $01
jz @char_0
{ output character 1 }
jmp @next
@test_89:
test al, $01
jnz @char_9
{ output character 8 }
jmp @next
@char_9:
{ output character 9 }
jmp @next
@test_4567:
test al, $02
jnz @test_67
test al, $01
jnz @char_5
@char_4:
{ output character 4 }
jmp @next
@char_5:
{ output character 5 }
jmp @next
@test_67:
test al, $01
jnz @char_7
{ output character 6 }
jmp @next
@char_7:
{ output character 7 }
jmp @next
@test_23:
test al, $01
jnz @char_3
{ output character 2 }
jmp @next
@char_3:
{ output character 3 }
jmp @next
@char_0:
{ output character 0 }
@next:
I dunno, that feels ugly as hell too...
Just thought I'd share what I've been working on -- any suggestions and/or improvements are welcome.
Last edited: