deathshadow
Veteran Member
- Joined
- Jan 4, 2011
- Messages
- 1,378
One of the big reasons for my rewriting Paku Paku and the engine it's based on from scratch was to make it more viable on an unexpanded 128k PcJr... which laughably is lowering the system target from the original 5150 concept... a LOT.
I've got my own internal profiler that lets me break the code into slices while running them in their normal order. It lets me dial in things a normal profiler wouldn't tell me -- isolating spikes in execution in realtime that would be missed by 'classic' profiling methods.
I was thinking that blitting the screen updates was the big bottleneck, but isolated to it's own slice it wasn't taking any more or less time, but the slice it was originally in was eating up massive time... move it back and move out the refresh of the old data to the backbuffer... and that's not it either? The only thing left in there is keyboard... KEYBOARD? that makes no sense, wait... it is spiking every keypress! (or key-repeat); something that on a "real" PC takes effectively zero time (below the 2400ths of a second accuracy of my profiler) is taking almost a sixtieth of a second?
Then it hit me -- the Junior's keyboard process is REALLY convoluted.
1) messages are handled by the NMI.
2) Which the ISR uses a lookup table to turn its scancodes into PC/XT scancodes for "normal" software to even work.
3) I'm based on replicating TP's CRT keypressd and readkey -- which calls int 21h functions 0x0A and 0x0B... and DOS is in RAM.
4) Those int 21h calls themselves call int 16h in BIOS...
5) both of which means pushing/popping a lot of registers to RAM.
6) ... and RAM on a unexpanded PCJr. is slow as molassas since it's the same speed as Video RAM.
Add that mess altogether? OUCH. almost the same amount of time as an entire frame of video sucked down JUST by checking if a key has been pressed and reading it's value?
That's why this original code:
Is a disaster on the Junior... taking longer when a key is actually pressed to run both of them than it takes to blit from the backbuffer to screen all 9 game sprites. (4 pellets, 4 ghosts, player) AND update the score.
So... thinking on how to fix the problem.
1) I was looking at it and int16h AH=0x00 returns PC/XT scancodes regardless of the underlying hardware... so why am I dicking with ASCII and char0/extended codes?
2) Int16h AH=01 (equivalent to keypressed) only checks if 0x0040:001A == 0x0040:001C, so rather than have the overhead of an INT involved, why not just check that my damned self?
3) the CASE statement to check what key is pressed would be faster if there were less values -- so how about a lookup table for the scancodes to translate ones that do the same thing -- like turning arrow keys into WASD. Sucks down a handful of RAM, but would also let me filter oddities like AT&T and Tandy numpad6 returning 0xFC when num-lock is off. (instead of the proper response of 0x77).
4) since the Jr. has a function to do it, disable key repeat. For good measure, set key repeat really low with a high delay for non-jr systems.
5) trap pointless repeated keystrokes and strip them out of the buffer
6) You know the Junior's memory scheme is crap when ROM is faster than RAM.
So the routine I came up with is this:
Since no key is assigned scancode zero, simply returning zero for no keypressed OR unrecognized key seems good enough. Oh, cute trick in there, since the BDA is in the bottom 64k, use segment zero to access it. Notice I store the real scancode for compare and not the XLAT one.
... and this simple change reduces keyboard handling back to being 'unnoticeable' just like on a real PC
This probably explains why int21h functions 0x09 and 0x06 inhales sharply upon the proverbial equine of short stature as well... strings sent through function 0x09 outputting to the screen slower than old school 300 baud modem communications.
I'd probably use my own output routines there if not for trying to keep the size under control... though it LOOKS like I may end up with some spare space. If once the game is fully functional and playable the memory footprint is less than 64k, I may go back and write something better for those too.
THEN, there was a really strange error; if I tried to play with the keyboard when joystick was enabled, the system went off to never-never land. Play with the joystick, it's fine... keyboard with joystick enabled, fine... joystick enabled play with keyboard, totally banjaxed... unable to recreate this on the Tandy 1000 or my Sharp PC7000 or in DOSBox...
See my first 'problem' above -- keyboard is handled by the NMI? The keyboard NMI expects a valid stack with enough room on it for whatever it is it thinks it's doing. My joystick routine was repurposing SP to store "0" for adc in an attempt to make 'inside the loop' faster since that routine doesn't need a stack... rather than disable the NMI, I just switched those back to "adc r16, 0" instead of "adc r16, sp"
Laughably changes elsewhere (thanks guys for noticing some of my flubs) tripled the output range of that routine anyways, so switching back to immediate zero is no biggy. No more crashes there -- and I'm getting a VERY usable range of 1..28 for the joystick reads junior, 0..94 on the 1000SX in "Slow".
Cute how PC NMI's don't try to use the stack, but Junior? Pfft.
... and thanks to these changes, I've dragged my codebase kicking and screaming into being fast enough on an unexpanded 128k Jr.
Of course I'm sitting here with sockets, 41256's and a 128k external expansion ... and not bothering for two reasons.
1) moving stuff into faster high memory is "cheating" and not what I'm testing.
2) My Parkinsonism is so bad right now I'm not certain I trust myself soldering anything that detailed... Hoping that maybe come spring I'll be up for it, if not I might have to ask if anyone out there would be willing to do that for me. At least I have a nice 1" wide dual-tip 300 watt baton for pulling the old chips in one yank without snipping the legs.
In any case, if you care about speed on the Junior, stay the hell away from Int 21h (DOS) if you have a viable alternative... and since Turbo Pascal's CRT unit uses same, just another reason NOT to use CRT. Also a laugh that something that takes almost no time on a real PC was burying the game on the Junior.
I've got my own internal profiler that lets me break the code into slices while running them in their normal order. It lets me dial in things a normal profiler wouldn't tell me -- isolating spikes in execution in realtime that would be missed by 'classic' profiling methods.
I was thinking that blitting the screen updates was the big bottleneck, but isolated to it's own slice it wasn't taking any more or less time, but the slice it was originally in was eating up massive time... move it back and move out the refresh of the old data to the backbuffer... and that's not it either? The only thing left in there is keyboard... KEYBOARD? that makes no sense, wait... it is spiking every keypress! (or key-repeat); something that on a "real" PC takes effectively zero time (below the 2400ths of a second accuracy of my profiler) is taking almost a sixtieth of a second?
Then it hit me -- the Junior's keyboard process is REALLY convoluted.
1) messages are handled by the NMI.
2) Which the ISR uses a lookup table to turn its scancodes into PC/XT scancodes for "normal" software to even work.
3) I'm based on replicating TP's CRT keypressd and readkey -- which calls int 21h functions 0x0A and 0x0B... and DOS is in RAM.
4) Those int 21h calls themselves call int 16h in BIOS...
5) both of which means pushing/popping a lot of registers to RAM.
6) ... and RAM on a unexpanded PCJr. is slow as molassas since it's the same speed as Video RAM.
Add that mess altogether? OUCH. almost the same amount of time as an entire frame of video sucked down JUST by checking if a key has been pressed and reading it's value?
That's why this original code:
Code:
; function readkey:char;
pProcNoArgs readkey
mov ah, 0x07
int 0x21
retf
; function keypressed:boolean;
pProcNoArgs keypressed
mov ah, 0x0B
int 0x21
retf
Is a disaster on the Junior... taking longer when a key is actually pressed to run both of them than it takes to blit from the backbuffer to screen all 9 game sprites. (4 pellets, 4 ghosts, player) AND update the score.
So... thinking on how to fix the problem.
1) I was looking at it and int16h AH=0x00 returns PC/XT scancodes regardless of the underlying hardware... so why am I dicking with ASCII and char0/extended codes?
2) Int16h AH=01 (equivalent to keypressed) only checks if 0x0040:001A == 0x0040:001C, so rather than have the overhead of an INT involved, why not just check that my damned self?
3) the CASE statement to check what key is pressed would be faster if there were less values -- so how about a lookup table for the scancodes to translate ones that do the same thing -- like turning arrow keys into WASD. Sucks down a handful of RAM, but would also let me filter oddities like AT&T and Tandy numpad6 returning 0xFC when num-lock is off. (instead of the proper response of 0x77).
4) since the Jr. has a function to do it, disable key repeat. For good measure, set key repeat really low with a high delay for non-jr systems.
5) trap pointless repeated keystrokes and strip them out of the buffer
6) You know the Junior's memory scheme is crap when ROM is faster than RAM.
So the routine I came up with is this:
Code:
; function gameKey:char;
pProcNoArgs gameKey
xor ax, ax
mov es, ax
mov dh, [lastKey]
.loop:
mov cx, [es : 0x041A]
cmp cx, [es : 0x041C]
je .noKey
int 0x16
cmp ah, dh
je .loop
mov al, ah
mov bx, controlRemap
xlat
.noKey:
mov [lastKey], ah
retf
Since no key is assigned scancode zero, simply returning zero for no keypressed OR unrecognized key seems good enough. Oh, cute trick in there, since the BDA is in the bottom 64k, use segment zero to access it. Notice I store the real scancode for compare and not the XLAT one.
... and this simple change reduces keyboard handling back to being 'unnoticeable' just like on a real PC
This probably explains why int21h functions 0x09 and 0x06 inhales sharply upon the proverbial equine of short stature as well... strings sent through function 0x09 outputting to the screen slower than old school 300 baud modem communications.
I'd probably use my own output routines there if not for trying to keep the size under control... though it LOOKS like I may end up with some spare space. If once the game is fully functional and playable the memory footprint is less than 64k, I may go back and write something better for those too.
THEN, there was a really strange error; if I tried to play with the keyboard when joystick was enabled, the system went off to never-never land. Play with the joystick, it's fine... keyboard with joystick enabled, fine... joystick enabled play with keyboard, totally banjaxed... unable to recreate this on the Tandy 1000 or my Sharp PC7000 or in DOSBox...
See my first 'problem' above -- keyboard is handled by the NMI? The keyboard NMI expects a valid stack with enough room on it for whatever it is it thinks it's doing. My joystick routine was repurposing SP to store "0" for adc in an attempt to make 'inside the loop' faster since that routine doesn't need a stack... rather than disable the NMI, I just switched those back to "adc r16, 0" instead of "adc r16, sp"
Laughably changes elsewhere (thanks guys for noticing some of my flubs) tripled the output range of that routine anyways, so switching back to immediate zero is no biggy. No more crashes there -- and I'm getting a VERY usable range of 1..28 for the joystick reads junior, 0..94 on the 1000SX in "Slow".
Cute how PC NMI's don't try to use the stack, but Junior? Pfft.
... and thanks to these changes, I've dragged my codebase kicking and screaming into being fast enough on an unexpanded 128k Jr.
Of course I'm sitting here with sockets, 41256's and a 128k external expansion ... and not bothering for two reasons.
1) moving stuff into faster high memory is "cheating" and not what I'm testing.
2) My Parkinsonism is so bad right now I'm not certain I trust myself soldering anything that detailed... Hoping that maybe come spring I'll be up for it, if not I might have to ask if anyone out there would be willing to do that for me. At least I have a nice 1" wide dual-tip 300 watt baton for pulling the old chips in one yank without snipping the legs.
In any case, if you care about speed on the Junior, stay the hell away from Int 21h (DOS) if you have a viable alternative... and since Turbo Pascal's CRT unit uses same, just another reason NOT to use CRT. Also a laugh that something that takes almost no time on a real PC was burying the game on the Junior.
Last edited: