• Please review our updated Terms and Rules here

Help testing my next game's engine

deathshadow

Veteran Member
Joined
Jan 4, 2011
Messages
1,378
I don't currently have a true IBM CGA or a PC JR. for testing, and I need to know if what I'm trying to do will be too slow on said systems. I'm only aiming for 24 fps (which at 160x100 seems sufficient)... on my Tandy 1000 HX in 'slow' I'm getting around 23FPS under the worst case scenario (24 scrolling stars and 16 sprites)... which is close enough I'm not gonna sweat the difference.

http://www.deathshadow.com/downloads/SPRTEST1.RAR

If anyone out there with a STOCK PCJr, and a 4.77mhz 8088 with a REAL CGA card (the type that has snow) can report in on the results. (It dumps the output to a results.txt file in additon to showing it on screen) it would greatly help in figuring out if my concept is even viable.

For reference, the results from my 1000HX in "slow" (4.77mhz) are as follows...
Code:
CPU:2172
Test 1 - FPS:55.60
Test 2 - Excess at 24 FPS:186.00
Test 3 - FPS:37.80
Test 4 - Excess at 24 FPS:118.72
Test 5 - FPS:28.60
Test 6 - Excess at 24 FPS:52.31
Test 7 - FPS:23.00
Test 8 - Excess at 24 FPS: 0.00

In "fast" (7.16mhz) I get this:
Code:
CPU:3230
Test 1 - FPS:70.60
Test 2 - Excess at 24 FPS:275.29
Test 3 - FPS:48.20
Test 4 - Excess at 24 FPS:208.18
Test 5 - FPS:36.60
Test 6 - Excess at 24 FPS:140.74
Test 7 - FPS:29.40
Test 8 - Excess at 24 FPS:74.75

I'm hoping that the Jr will come in about equal to the 1k since they both only have 1 wait instead of 2 on the video memory thanks to the use of buffered (aka dual ported) RAM instead of single ported... The real CGA worries me since as a rule it's easily half the speed of the CGA implementation in a Tandy 1000.

SO long as test 5 is reporting more than 24fps, I should be in good shape. Test 7 is pretty much ice-skating uphill for that level hardware... but even then I'm only 1fps slower than my target framerate.
 
I'll give your test code a spin later, but I already determined experimentally the theoretical wait states for a genuine CGA on a 4.77MHz 8088. It's between 3 and 8 CPU cycles, depending on where in the CGA's cycle the access happened. There are 16 possible relative phases of the CGA and CPU clock, and they give 3, 4, 4, 4, 5, 5, 5, 6, 6, 6, 7, 7, 7, 8, 8 and 8 cycle wait states respectively. So if your code isn't (deliberately or accidentally) synchronized to the CGA's clock, there will be an average CGA wait state of about 5.8 CPU cycles.
 
It doesn't run on my PCjr so well for some reason.

I downloaded it multiple times with multiple disks, so I think I have eliminated a possibility of bad disks/bad downloads.

It starts testing the CPU for a bit, then is errors out with runtime error 158 at 0110:0091, then it fails to read the disk afterwards (sector not found).
 
I'd suspect bad floppy, or incompatible OS or something... that error is "invalid register operation", which usually means an invalid offset like [bx] ending up pointing at [bp+bx] -- but tracing the line for that error puts it as the file loading routine, which should work just fine since that section of code is:

Code:
constructor tileSet.init(dataFile:dosName);
var
	f:file;
begin
	name:=datafile;
	assign(f,name+'.DAT');
	reset(f,1);   { <<< 0110:0091 is here }
	size:=filesize(f);
	getmem(dataStart,size);
	dataEnd:=dataStart;
	inc(dataEnd,size);
	blockread(f,dataStart^,size);
	close(f);
end;

Most unusual. I'll double check the archive file to be sure, but I'd say somewhere between point A and Point B there's an issue.
 
I just uploaded a new copy at the same URL... I was calling the file load AFTER reprogramming the system timer for gameplay, and I know the Jr. is VERY finicky about the timer when it comes to reading/writing to disk... Moved the save/load of files out of inside the timer change, maybe that will fix the problem for you?

Though this is EXACTLY why I'm releasing tests like this one... hopefully that's one bug down.
 
My results from stock XT + IBM CGA:

Code:
CPU:2044
Test 1 - FPS:49.60
Test 2 - Excess at 24 FPS:162.50
Test 3 - FPS:36.60
Test 4 - Excess at 24 FPS:106.89
Test 5 - FPS:29.00
Test 6 - Excess at 24 FPS:51.43
Test 7 - FPS:23.80
Test 8 - Excess at 24 FPS: 0.00

Snow was somewhat prohibitive in viewing enjoyment; the earlier tests had less snow than the later tests, and some tests had so much snow that it would effectively prevent playing the game. Quick phone camera video is at ftp://ftp.oldskool.org/pub/misc/temp if you'd like to see what I saw. The video focuses on the upper right corner because there was a font/printing glitch every other update of that area, towards the end of the video I focus on it explicitly.

If want to be able to write a sprite at any time (ie. not wait for vert retrace) but still want to reduce the snow somewhat, maybe wait for horizontal retrace? There will still be snow, but it will be somewhat localized over on the left side of the screen.
 
Also, it just dawned on me that if you're willing to play with the MC6845 start address registers, you could do a hardware scroll of the entire playfield downward and only have to repaint the sprites and the top info bar (the one line of the info bar that moves downward can be replaced with a line of background+stars). This should work on all machines, even PCjr. It is demonstrated nicely in the game Prohibition, where the entire playfield is larger than the screen but it scrolls fullscreen at 60Hz because they're adjusting the start address registers and repainting only the new material. (You can snag Prohibition in ftp://ftp.oldskool.org/pub/misc/xtfiles.rar , it should be in the GAMES\TECH\PROHIBIT directory.)

It is possible that there is too much to repaint to make this technique work, but since you're erasing/drawing all the sprites anyway, I thought you'd want to know about it since it means you won't have to repaint all of the stars every frame, only one line of them. (And if you want them to blink, just don't turn off the blink bit ;-)

Also#2, I just remembered that you can't wait for horizontal retrace (see previous reply) on a PCjr, the bits aren't there. However, you happily don't have to, since PCjr doesn't have CGA snow.
 
That one worked, pretty graphics!

Stock PCjr save the memory/floppy expansion (extra floppy and 256k RAM). Intel 8088.

CPU:1120
Test 1 - FPS:45.80
Test 2 - Excess at 24 FPS:155.57
Test 3 - FPS:34.60
Test 4 - Excess at 24 FPS:100.12
Test 5 - FPS:27.80
Test 6 - Excess at 24 FPS:43.82
Test 7 - FPS:23.20
Test 8 - Excess at 24 FPS: 0.00
 
That's faster than "stock" as you're running the jrmem driver to enable the extra RAM, which is not as slow as the first 128K. For a true "stock" speed rating (and to simulate a 128K jr's speed), re-run the test without that driver loaded.
 
My response went AWOL or got deleted somehow... probably due to the PAINFUL forum speeds and recent increase in 500 errors, so let's try this again.

That's faster than "stock" as you're running the jrmem driver to enable the extra RAM, which is not as slow as the first 128K. For a true "stock" speed rating (and to simulate a 128K jr's speed), re-run the test without that driver loaded.
Check the cpu test -- 1120 is HALF what it should be, and the only excuse for that on a Jr is it running in the bottom 128, so I kinda doubt that. It's actually more of a RAM and CPU test since it's just a repeat until 5 seconds expire or ch is pressed... The frame rates are about what I was expecting out of a Jr due to less video wait states, which is where the REAL bottleneck is.

If anything, you're XT numbers seem a bit off to me -- is that a V20? You sure it's at 4.77mhz? Are you sure that's a REAL cga card? Those numbers just look wrong. (well, the low sprite count numbers look right, the high sprite count numbers look WAY too high)

Ever run CGA_COMP on a Jr? (now who wrote that again?) -- specifically the video bandwidth tests? You might be in for a surprise as the block read/block write tests actually come in 50% FASTER than a XT with a CGA! (the interleaved read/write tests come in at about 2/3rds if cga_comp is operating out of the bottom 128).

I don't have a Jr anymore to test on, but you want FUN in that department take my Tandy 1000HX as an example. According to CGA_COMP the results for a normal CGA 4.77mhz PC are:
Block Read: 246 KB/sec
Block Write: 298 KB/sec
Interlaced Read: 175 KB/sec
Interlaced Write: 170 KB/sec

My 1000HX in "Slow" - 4.77mhz
Block Read: 483 KB/sec
Block Write: 692 KB/sec
Interlaced Read: 248 KB/sec
Interlaced Write: 267 KB/sec

Back when I had a JR, running from the bottom memory it returned around 350ish for block read, 400ish for block write, and low 100's for the interlaced... forcing the code out the numbers jumped up into tandy 1000 territory.

Due to the lack of extra wait states on video memory -- which is why the PCJr is actually FASTER at reading/writing to video memory (especially on rep operation small enough to fit into the BIU) than a stock CGA card... the bus isn't locked as often with wait states as it has dual ported RAM.

Gah, scary thought -- could you imagine a PC JR with single ported/unbuffered RAM? Snow while code is running. (We'd be yelling "What is this, a ZX-80?)... even MORE wait states dragging code execution in that bottom RAM to a crawl?

In testing here between various machines, I noticed something odd... For 1000's I've got an SX and two HX setup here... the SX has the stock AMD 8088-2, one HX has the stock Seimens 8088-2, and the other HX recently got a V20 (Wonder where that came from?)... I thought the Seimens was just another 1:1 8088 knockoff under license, but the numbers don't reflect that. I ended up pulling all the chips and trying them in both SX and HX machines to verify it wasn't mainboard differences skewing my results. The "CPU Count" test from this little video test of mine showed something... odd.

SX Stock AMD 8088-2 - 2046
SX Seimens 8088-2 - 2163
SX NEC V20 - 2304

HX Stock AMD 8088-2 - 2051
HX Seimens 8088-2 - 2172
HX NEC V20 - 2342

I pulled out Nortons SI, as well as MIPS, and got similar skews in the numbers. The Seimens appears to be marginally faster! (I chalk up the HX being faster due to bios or other board differences). Not as much of a boost as the V20, but it's still interesting to see.

Also shows why all the old games that ran "unthrottled" with no timer control at all are so... bad across even systems operating at 4.77mhz.

The SX also gave some odd speed results -- this is with the V20:
CPU: 2304
Test 1 - FPS:48.00
Test 3 - FPS:36.20
Test 5 - FPS:28.60
Test 7 - FPS:23.60

Back on the AMD 8088-2:
CPU: 2046
Test 1 - FPS:42.00
Test 3 - FPS:33.70
Test 5 - FPS:24.80
Test 7 - FPS:22.00

Eerily low compared to my HX, or the numbers Trixter reported from the XT... Kind of strange as I didn't think there was that much difference between the SX and HX.

But again, that's why I'm putting this test out there, so I have an idea how much I can actually put on-screen at once and keep the frame rate playable, instead of getting my heart set on doing things that just aren't feasable.

Oh, and I'm aware of that bit of video corruption up top -- originally it ran all 16 sprites from the start, when I broke them into 4 pieces some of the initialization code is a bit off -- that's the 'erase the old location' code screwing up. NOt a big deal given this is only a test and has no impact on actual speed measurements.

Though I am a bit shocked at the amount of snow... I really shouldn't be as about 70% of each loop's clock cycles are spent blitting to the screen, but still...

Thankfully unlike Paku Paku where I needed 240 ticks/second for sound management, I can do my desired frame rate as my timer since I can make the sound however I want for the game instead of trying to mimic an existing game. (couldn't even put it into an interrupt as it firing during the blit to screen looked like arse).

I was looking at the hardware scrolling, but found issues between the video adapters with more than 16k of video RAM; it's also not entirely viable as I'm only going to be using 112px of width for the play area, with stats on the right much akin to paku paku -- or more specifically Silpheed... keeping that large a sidebar fixed with hardware scrolling just isn't an option... also all the calculations to subtract the offset from my sprites and to redraw the top ends up MORE work and slower than to just draw and erase the 24 pixels... Looks like a great technique for games like River Raid, but not so great for what I'm doing. (especially since... well.. I'm not going to give away the surprises just yet).

Though I really squeezed a lot of speed out of the stars by NOT tracking them via X/y but by their memory offset, and only allowing stars on every other pixel so I don't even need shifts -- it's just a single attribute write. (well, really 3 bytes of write per frame...)

I'm also going to be using MUCH larger back-buffers this time out so I can have sprites smoothly enter/exit the screen and to simplify/speed up address calculations. That 112 isn't coincidence, as it gives me 8px per side to reach 128... room enough for my sprites, and making the back-buffer address calculation a simple:

Code:
mov bx,x
mov ah,y
xor  al,al
add bx,ax
mov ax,spriteShift
les di,backBuffer
lds si,spriteBuffer
shr  bx,1
jnc  @noShift
add si,ax
@noShift
add di,bx

Which is a lot nicer than trying to calculate (x+y*160) and $FFFE; or more specifically
mov ah,y
xor al,al
shr ax,1
add bx,ax
shr ax,1
shr ax,1
add bx,ax

Which is pretty hefty.
 
Check the cpu test -- 1120 is HALF what it should be, and the only excuse for that on a Jr is it running in the bottom 128, so I kinda doubt that. It's actually more of a RAM and CPU test since it's just a repeat until 5 seconds expire or ch is pressed... The frame rates are about what I was expecting out of a Jr due to less video wait states, which is where the REAL bottleneck is.

I somehow missed the CPU score; I was looking only at the framerates. It now dawns on me that maybe your framerates were meant to be fixed (the last one certainly is).

If anything, you're XT numbers seem a bit off to me -- is that a V20? You sure it's at 4.77mhz? Are you sure that's a REAL cga card? Those numbers just look wrong. (well, the low sprite count numbers look right, the high sprite count numbers look WAY too high)

Way to turn it around. Yes it's a real XT with a real CGA, though I'm not sure how I'm supposed to convince you of that.

Ever run CGA_COMP on a Jr? (now who wrote that again?) -- specifically the video bandwidth tests? You might be in for a surprise as the block read/block write tests actually come in 50% FASTER than a XT with a CGA! (the interleaved read/write tests come in at about 2/3rds if cga_comp is operating out of the bottom 128).

The timings tests in CGA_COMP are broken -- sorry. They were an afterthought. Don't use those. My benchmark in progress is not broken, so use that instead.

I don't know where you got this idea that the PCjr has less wait states on memory, or that the video memory is dual-ported, but you need to throw those assumptions away. The first 128KB of a PCjr is *all* wait state, which is why all of it can be used as video ram, and why all of it is slow. The memory is not dual-ported, and CPU accesses to that memory are constantly getting blocked by the video circuitry. The reason there's no snow is because the blocking mechanism actually works on a PCjr, as opposed to IBM+CGA where it does not for 80-col modes.

Here are some relevant benchmark results from both an XT and a jr:

[UID9485166A30]
memory_test=3774
opcode_test=1753
vidram_test=2652
mem_ea_test=1935
3dgame_test=1851
score=4
machine=PC/XT (enhanced)
cpu=Intel 8088
cpuspeed=4.77 MHz
biosinfo=62X0851 COPR. IBM 1986 (01/10/86, rev. 1)
biosdate=19860110
bioscrc16=9485
videosystem=CGA
videoadapter=CGA


[UID7F5C71D]
memory_test=5926
opcode_test=3584
vidram_test=3373
mem_ea_test=4392
3dgame_test=3490
score=2
machine=PCjr
cpu=Intel 8088
cpuspeed=4.77 MHz
biosinfo=1504037 COPR. IBM 1981,1983PS (06/01/83, rev. 86)
biosdate=19830601
bioscrc16=7F5C
videosystem=CGA
videoadapter=IBM PCjr

The first is my 4.77MHz 8088 IBM PC/XT w/CGA. The second is my 4.77MHz 8088 128K IBM PCjr. In every single way, a stock PCjr is slower than a stock PC/XT. If you would like the source for the routines above so you can see what they're doing, just ask, but they're REP string instructions so there's nothing to fake. The meat of the memory block test is below, minus a few preamble/postamble instructions (the 4-digit number in "memory_test=5926" is the number of microseconds it takes the following code to execute):

Code:
;bufsize=257, deliberately odd.  Both buffers are in system memory.
    cld
    xor     ax,ax
    les     di,[buf1]
    mov     cx,bufsize
    shr     cx,1
    rep     stosw           {fill buf1 with 00h}
    adc     cx,0
    rep     stosb

    les     di,[buf2]
    lds     si,[buf1]
    mov     cx,bufsize
    shr     cx,1
    rep     movsw
    adc     cx,0
    rep     movsb           {typical copy routine that handles cx=odd number}

    sub     si,bufsize
    sub     di,bufsize      {reset buffer pointers}
    mov     byte ptr es:[di+bufsize-2],$ff
                            {put a target search byte "FF" at end of es:di buffer}
    mov     cx,bufsize
    mov     al,$ff
    repne   scasb           {should stop shortly before the end of the buffer}

    sub     di,bufsize-1    {reset buffer pointers}
    mov     cx,bufsize
    repe    cmpsb           {should stop one byte from the end of both buffers}

    sub     si,bufsize-1
    mov     cx,bufsize
    shr     cx,1
    rep     lodsw           {maximum transfer rate block memory read}

The vidram_test code is (again, minus a few preamble/postamble opcodes):

Code:
;screenarea=320.  buf1 in system ram.
    les     di,[buf1]
    lds     si,[screenseg]
    mov     cx,screenarea
    shr     cx,1
    cld
    rep     movsw           {copy screen ram to buffer}
    mov     ds,dx
    les     di,[screenseg]
    lds     si,buf1
    lodsb                   {simulate writing a single character+attr to the}
    stosw                   {screen from an ascii text buffer}
    lodsb                   {again, from odd address}
    stosw
    sub     si,2
    sub     di,4            {reset buffer pointers}
    mov     cx,screenarea
    shr     cx,1
    rep     movsw           {simulate restoring an entire saved text screen}

Hopefully the above dispels your notion that a PCjr is faster in any way than a PC+CGA.

I pulled out Nortons SI, as well as MIPS, and got similar skews in the numbers. The Seimens appears to be marginally faster! (I chalk up the HX being faster due to bios or other board differences). Not as much of a boost as the V20, but it's still interesting to see.

That is indeed very interesting; I have not seen that before. If you get the opportunity to run my benchmark stub on each of them, I'd be curious to see in what areas they differ.

I was looking at the hardware scrolling, but found issues between the video adapters with more than 16k of video RAM; it's also not entirely viable as I'm only going to be using 112px of width for the play area, with stats on the right much akin to paku paku -- or more specifically Silpheed... keeping that large a sidebar fixed with hardware scrolling just isn't an option...

When I was trying to write that "detect 32K CGA" code, I found that the extra 32K is disabled until you actually init one of those modes (which is why I couldn't come up with a reliable detection routine). So it would still work on all CGA cards. But if you plan on having a sidebar, then it won't work of course.

Performance thought: If your sprites are only 24 pixels, maybe you should look into compiled sprites for any sprite that doesn't touch a border (then fall back to a clipped sprite that does). Your test appears to be 4 different ways to updates the screen, but a compiled sprite always wins when you're not dealing with bitplanes.
 
making the back-buffer address calculation a simple:

Code:
mov bx,x
mov ah,y
xor  al,al
add bx,ax
mov ax,spriteShift
les di,backBuffer
lds si,spriteBuffer
shr  bx,1
jnc  @noShift
add si,ax
@noShift
add di,bx

Or, just use a lookup table. Stick it at the front of your sprite buffer, and load sprites after it; that way, you can address both with ds. Tables are usually addressed with bx, so "mov ax,[si+bx]" is fast, and "xlat" is faster (although xlat only loads a byte value so that doesn't help your address word calc).
 
Ok, I took my PCjr apart and made it really, really stock. No jrconfig, no extra memory, no extra floppy, nothing, just the 128k module.

CPU:1015
Test 1 - FPS:21.40
Test 2 - Excess at 24 FPS:0.00
Test 3 - FPS:15.80
Test 4 - Excess at 24 FPS:0.00
Test 5 - FPS:12.60
Test 6 - Excess at 24 FPS:0.00
Test 7 - FPS:10.40
Test 8 - Excess at 24 FPS:0.00
 
Last edited:
I somehow missed the CPU score; I was looking only at the framerates. It now dawns on me that maybe your framerates were meant to be fixed (the last one certainly is).
That's actually the minimum BUS response time 'locking' it. The more sprites there are the less CPU-bound the code ends up and the larger the chunk of time spent on copying backbuffer to screen.

Way to turn it around. Yes it's a real XT with a real CGA, though I'm not sure how I'm supposed to convince you of that.
Most unusual, numbers nowhere near what my experience told me to expect; but that's ok since that's what this test was for.

The timings tests in CGA_COMP are broken -- sorry. They were an afterthought. Don't use those. My benchmark in progress is not broken, so use that instead.
alrighty then.

I don't know where you got this idea that the PCjr has less wait states on memory, or that the video memory is dual-ported, but you need to throw those assumptions away. The first 128KB of a PCjr is *all* wait state, which is why all of it can be used as video ram, and why all of it is slow. The memory is not dual-ported, and CPU accesses to that memory are constantly getting blocked by the video circuitry. The reason there's no snow is because the blocking mechanism actually works on a PCjr, as opposed to IBM+CGA where it does not for 80-col modes.
I had thought (and been told) it's why clone CGa's and the Jr. don't have snow... but Mike's latest report in is consistent with what you are saying, though the response from the CPU part of my test makes no sense if the code wasn't running in the bottom 128.

;bufsize=257, deliberately odd. Both buffers are in system memory.
;screenarea=320. buf1 in system ram.
I would have thought that even a good timer like Abrash's zen in place, that would be too small a sample pool to get accurate results; but I was always taught that when benchmarking you never test a fixed number of iterations and instead bench based on how many iterations you can do over a fixed period of time, removing timer granularity from the equation.

Also your test doesn't actually seem to render to the visible page, and non-display mapped video RAM accesses faster than display mapped video memory... last I knew at least. I'd make the test actually show something just to be 100% sure things like value latches (EGA/VGA writing the same value twice is faster the second time) aren't getting in the way -- besides it would make the end user feel more like the test is doing something.

Hopefully the above dispels your notion that a PCjr is faster in any way than a PC+CGA.
Mike's latest report in convinced me, though that's most unusual and not consistent with what I had for code results back when I had a Jr... unless there was something screwy with the Jr. I had... that the cpu test was off the way it was when fully configured makes me wonder if it was running half-in and half-out of that bottom 128k window or something.

It also shows that there's no way no how my game will support a 128k Jr.

That is indeed very interesting; I have not seen that before. If you get the opportunity to run my benchmark stub on each of them, I'd be curious to see in what areas they differ.
Good idea that -- when it warms up mid-day in the garage I'll go down and give it a whirl.

Performance thought: If your sprites are only 24 pixels, maybe you should look into compiled sprites for any sprite that doesn't touch a border (then fall back to a clipped sprite that does). Your test appears to be 4 different ways to updates the screen, but a compiled sprite always wins when you're not dealing with bitplanes.
I'm not entirely certain what you mean by that... the sprites are 24 BYTES, but 48 pixels (except for the 24 stars which are 1 byte each)... I have no clue what a "compiled sprite" is, But adding any extra tests for edge-of-screen interactions seems like a waste of time when I could just make the backbuffer bigger than the display area... especially when I'm using tile based blits. (though backbuffer to screen would still 'need' edge checks).

Or, just use a lookup table. Stick it at the front of your sprite buffer, and load sprites after it; that way, you can address both with ds. Tables are usually addressed with bx, so "mov ax,[si+bx]" is fast, and "xlat" is faster (although xlat only loads a byte value so that doesn't help your address word calc).
XLAT is actually slower, for the reason you listed... If I cut out the extra code, what we're really talking about here is the Y calculation... by going with a 128 byte wide back-buffer it works out to:

mov ah,y
xor al,al
shr ax,1 { ax = y*128 }

XLAT would return a byte, meaning I'd still have to shift it -- and put an extra byte memory read in place...

mov bx,y
xlat
xor ah,ah
xchg ah,al
shr ax,1

Uhm... no.

I've rarely if ever found XLAT to be useful... or speedy... for much of anything. It WOULD be if it dealt with word-based widths, but without that the only thing it's really good for is... uhm... well there's... yeah... A byte sized data field is useless and would end up being just more code before running the SAME code. I mean, what am I gonna do, create an array 0..100 of byte that contains 0..100? I could interlace the table to store the word width, but that's even more work and a shift of BX with TWO XLAT -- all to avoid one shift? I think not.

Seriously, what is XLAT good for? I've never seen it used anyplace it wasn't just a waste of memory and speed compared to LODS... especially with it sucking down one of the few general purpose registers that could be better used to do something else.

MAYBE for the screen address calculation it might be semi-useful... If I stored the 100 valid offsets shifted right by two, since *160 would have the bottom two bits empty...

mov bx,si
mov di,x
add bx,y
xlat
xor ah,ah
shl ax,1
shl ax,1
add di,ax
and di,$FFFE


But with the 1 byte memory read on it I'm not convinced that would be worth sucking down the extra RAM compared to just doing the math.

mov di,x
mov al,y
xor ah,ah
shr ax,1
add di,ax
shr ax,1
shr ax,1
add di,ax
and di,$FFFE

Figuring in prefetch, it's probably a wash... or slower to use the xlat thanks to that pesky memory read... or that it would add a lot of overhead to get bx set to SI while still being able to read my x and y vars. (since those are passed on the stack)
 
Last edited:
I would have thought that even a good timer like Abrash's zen in place, that would be too small a sample pool to get accurate results; but I was always taught that when benchmarking you never test a fixed number of iterations and instead bench based on how many iterations you can do over a fixed period of time, removing timer granularity from the equation.

DRAM refresh can add some jitter, yes. The microsecond timings are for curiosity's sake, for emulator authors. The actual synthetic benchmark score ("Score=" in the output) is how many iterations of the five test suites can be executed in 50ms; that way, the score is still relevant as machines get so fast that the individual suites execute in under a microsecond.

Also your test doesn't actually seem to render to the visible page

Check the code again :) Your reaction is exactly what I wanted, to not notice that I was writing to the visible page. The full benchmark tool has a mode where the score is constantly calculated realtime and displayed on the screen; if it corrupted the display every time it ran, it would ruin the experience for the end-user. The drawback with the technique I use is that reading from display RAM is, on many (but not all) VGA cards, much slower than writing. I don't compensate for this at all in the benchmark (because doing so would skew in favor of VGA cards) so it is one of the few things about the test suite I'm not thrilled about. Most PC games made after 1984 don't read from video memory unless they're plotting a sprite directly to display ram.

Mike's latest report in convinced me, though that's most unusual and not consistent with what I had for code results back when I had a Jr... ...It also shows that there's no way no how my game will support a 128k Jr.

It's possible you have always used Jrs with a RAM expansion and the driver loaded. As for targeting a 128K PCjr, I wouldn't worry about it; most retrocomputists who use Jrs use expanded Jrs. Supporting a stock Jr. is more of a personal challenge than anything else. Since your game is action, I'd target 512K machines and burn up as much RAM as will help speed things up. Precompiled sprites, lookup tables, precalced music code output, the works.

I have no clue what a "compiled sprite" is, But adding any extra tests for edge-of-screen interactions seems like a waste of time when I could just make the backbuffer bigger than the display area...

Compiled sprites are clever and awesome. In a nutshell, a compiled sprite is machine code that draws that sprite -- it's the actual instructions that plot the pixels/bytes. Obvious Pros include speed and automatic transparent areas. Cons are size (machine code is roughly 4x larger than the raw sprite data itself) and no easy way to clip. But since your playfield is both virtual and larger than the screen area, you don't have to clip at all.

Here's a good intro to compiled sprites that directly relates to your programming environment: http://nondot.org/sabre/graphpro/sprite4.html
Here's another one: http://www.superscalar.org/gptricks/gp_sprc.html#desc

Seriously, what is XLAT good for? I've never seen it used anyplace it wasn't just a waste of memory and speed compared to LODS... especially with it sucking down one of the few general purpose registers that could be better used to do something else.

XLAT is used for replacing values with those indexed in a precalculated table (xlat="translate"). XLAT is a 1-byte 11-cycle instruction functionally equivalent to MOV AL,DS:[BX+AL] (note that [BX+AL] is not valid x86 code, I'm just illustrating how it works). This is the classic best-case use of XLAT:

Code:
  lds si,buffer ;also has translation table
  les di,buffer ;same as source
  mov bx,offset transtable
  mov cx,bufsize
@translate:
  lodsb
  xlat
  stosb
  loop @translate:

(Unrolled would be faster; I'm just illustrating a point.) The above construct can be used to translate an entire buffer to uppercase or lowercase, replace linedraw characters with +-|, keep values within a bound range, translate single digits to hex (and back), whatever you like, all based on the table that BX is pointing to at DS:BX.

It is not used for word-sized lookups. That's why I suggested mov ax, ds:[bx+si] instead.
 
XLAT is used for replacing values with those indexed in a precalculated table (xlat="translate"). XLAT is a 1-byte 11-cycle instruction functionally equivalent to MOV AL,DS:[BX+AL] (note that [BX+AL] is not valid x86 code, I'm just illustrating how it works).
I know what it is/how it works, I just can't figure out a scenario where I'd ever have a reason to use it.

This is the classic best-case use of XLAT:
for example...

1) xlat INSIDE the loop? painful at best and not sure why you'd do that...
2) I don't see any scenario where doing that would serve a purpose.

The above construct can be used to translate an entire buffer to uppercase or lowercase
Where I'd use a cmp or test rather than waste RAM...

replace linedraw characters with +-|
Sounds like a REAL waste of RAM...

keep values within a bound range
How would that work? Well, unless you're forcing it... no... ok, I just don't get that one.

translate single digits to hex (and back)
Ok, I could see using 16 bytes to avoid using cmp/jmp in this case... though the flipping back and forth between bl and al could get annoying... and not sure I'd ever need just one nybble translated... I mean, are two xlat really going to be that much better than using daa with adc? (both suck needing a shift by 4 though).
 
Compiled sprites are clever and awesome.
Ah, looking at what you linked to I'm used to those under another name -- immediate blitting; since that's using a fixed offset and immediate values for the data -- usually dismissed as useless on anything but 8 bit images... since with the need to read, and, or and then write again, on 4 bit or worse, 2 bit images with transparency it ends up like 32 bytes code per sprite pixel; hardly practical compared to the 6 to 8 bytes per pixel needed in 8 bit modes.

mov al,$0F /* mask */
and al,es:[di+2]
or al,$E0 /* yellow pixel/no pixel pairing */
mov es:[di+2],al

over and over with different offsets and values is not going to be that big an improvement over

lodsw
and al,es:[di]
or al,ah
stosb

At least, not if using a backbuffer in system ram and then direct blitting the back-buffer to screen... Those 12 clock EA's are gonna be murder... though when you have two non-transparent pixels side-by-side on the byte boundary or even better four, it could then see a speedup and not consume large amounts of codespace.

I'll have to play with that again, I've not seen or played with immediate blitting in like a decade... and certainly never done so in a 4 bit graphics mode.

Though as it is, I'm back at 128k as my target or PC/tandy, since I don't need a base-surface buffer for this game, just a backbuffer... which it turns out I was using a larger backbuffer than I needed on Paku Paku to keep the code simple -- so instead of two 8k buffers, I'm going to have a single 7k buffer.
 
Where I'd use a cmp or test rather than waste RAM...

Lookup tables are, by definition, the quintessential size vs. speed tradeoff. You were talking about speed of address calculation, so I was offering up the suggestion of using lookup tables. While XLAT does not immediately help you with that because you require word-sized values, I thought based on your comments that you didn't understand XLAT, so I thought I would help out since it's an invaluable speedup tool on 808x CPUs.

A test/branch (assuming reg,reg) can take either 7 cycles (no jump) or 24 cycles (jump). xlat always takes 11. Branches take roughly 4 cycles if they fall through, but 17 cycles if they jump, and another penalty for jumping is that the prefetch queue gets emptied. This is why Abrash spends several chapters on "Don't Jump!" (ie. thinking like a CPU and structuring your code to avoid branches as much as possible, or at least structure the jumps to favor the fall-through case) in Zen of Assembler.

Ok, I could see using 16 bytes to avoid using cmp/jmp in this case... though the flipping back and forth between bl and al could get annoying...

No flipping. You point bx to the table, then run through your buffer.

Let's think of this as a task where you have to translate a string of non-trivial length (say, 80 characters) to uppercase. Here's that loop again:

Code:
  lds si,buffer ;also has translation table
  les di,buffer ;same as source
  mov bx,offset transtable
  mov cx,bufsize
@translate:
  lodsb
  xlat
  stosb
  loop @translate:

If you only had to translate one character, the above is neither helpful nor fast. But for a loop where you have to run through more than a few characters, it is significantly faster. Once you do the setup (the first four lines in the example), all you have is the inner loop which loads a byte, translates it, and stores it back. On 808x there is simply no faster way to do this. The tradeoff, as you noted, is burning up 256 bytes for a translation table.

I am not trying to sell you on XLAT, just trying to get you to understand and think about it. Everything-to-uppercase is a simple example; how about everything-except-vowels-to-uppercase? You'd have to test for 5 different characters in a typical test/branch construct, but with lookup tables, you use exactly the same code I posted above; only the lookup table itself changes.
 
I took some time today to get a second bone stock PCjr working so I don't have to run through the hassle of taking apart and putting together my expanded one. Right now my expanded jr is the only 360k floppy machine that can reach the network, so the only way I have to write 360k disks. The bad part is I can't make another boot disk until I get my XT working.
 
What the? ANOTHER of my posts in this thread went awol... what gives? Thankfully I typed it into notepad2 first...

Was just thinking about compiled sprites and looking at the examples on those pages -- and while they might be acceptable on a 386/higher or even a 286... they are DISASTROUSLY bad on a 8088 the way they are coded.

How can I say this? Simple; using segment override + displacement means not only a MASSIVE EA calculation, it's a 6 byte instruction code so even best case scenario we're talking 8 clocks of hanging the BIU.

Let's take one of the examples:

Code:
                          Bytes   EU clocks   BIU fetch next
                                                    24
mov es:[di+62],33153        6         25             8
mov es:[di+64],33410        6         25             8
mov es:[di+66],33410        6         25             4
mov byte ptr[es:di+68],129  5         25             8
mov es:[di+378],33153       6         25             8
mov es:[di+380],33410       6         25             8
mov es:[di+382],33410       6         25             8
mov es:[di+384],33667       6         25

Works out to 47 bytes, 200 EU clocks, 76 BIU 'hung' clocks... 276 clocks total to run 'real world'. What if we were to just put the value into AX, and use STOSW/STOSB, thowing in additions as needed??

Code:
                          Bytes   EU clocks   BIU fetch next
                                                    16
add di,62                   4          4            12
mov ax,33153                4          4             0
stosw                       1         15             5
mov ax,33410                4          4             0
stosw                       1         15             0
stosw                       1         15             2
mov  al,129                 3          4             0
stosb                       1         11            16
add di,309                  4          4            12
mov ax,33153                4          4             0
stosw                       1         15             5
mov ax,33410                4          4             0
stosw                       1         15             0
stosw                       1         15             5
mov ax,33667                4          4             0
stosw                       1         15

That's 39 bytes, 148 EU Clocks, 73 BIU 'hung' clocks... so around 221 clocks... 20% faster while being less code.

Displacements suck.. segment overrides suck, immediate values suck.... opcodes larger than 4 bytes suck... so lets put them all together on the same line and then run them all back-to-back so the BIU is always full? I think not.

Though it's probably great on 286+ -- and the examples are OBVIOUSLY for VGA; so many people coding for VGA tended to say "screw anyone with less than a 386".

-- edit -- my math was off above somewhat. The example code should be 265 clocks EU+BIU combined, the rewrite using stosw should be 223 clocks.
 
Last edited:
Back
Top