• Please review our updated Terms and Rules here

Optimizing a QuickBasic game

I would also say judging by the difference on the Pentium that the floating point is likely causing a quite bit of slowdown, In my experience it is best to avoid floating point whenever you can on everything before the 486 DX.

I would say before Pentium.
The 486DX may have had an onboard FPU for the first time, it wasn't exactly fast. The Pentium's FPU was massively improved.
On a 486DX, fixedpoint math still reigns supreme in most cases.
 
Hey, I think I got the high score for this game at the game jam. Nice to see you improving it. :)
Neat discussion to read. I haven't used QB since before I learned any proper comp sci. As a kid I was unaware of what led to good or bad performance in QB or in DOS graphics in general. Makes sense that you want to reduce pixel drawing as much as possible.

I think I can implement "smart updates" for the enemies & player movement along the Z axis without too much pain, I'll start from there and see how much gains that gets me. The problem is when the player moves on X or Y *everything* on the screen shifts, so I'm forced to choose between either keeping a shadow mask of where everything was in memory (maybe not a problem?) or just shotgunning somewhat-oversized black boxes over it all. Unless there's another obvious way to do it that I'm missing?

I think the shotgun approach should outperform the CLS approach with pretty minimal work. Even if the boxes aren't aligned quite right at the beginning, you should get an idea of how much it helps.

In fact, you can probably see the FPS delta between CLS vs no screen clearing at all and get an idea of how much speedup is potentially possible with optimizing your redraw logic.
 
^^ Haha didn't you wrap the score around to -32768? (that bug has long been fixed BTW. :P )

FWIW the version I demoed at the end of the Jam was basically stuck at ~5 FPS on a 1GHz P3 (or an i7 Macbook running DOSBox) due to rushed & shoddy code. The fact that I'm now getting close to that on a 386DX makes me pretty pleased. This was my first hobby coding project in at least 15 years so yeah, I was pretty rusty when I started.

As much as I complain about having to code in Matlab/Octave for work, I'm REALLY missing the vector & matrix operations in this thing. Maybe I should use MMX. :P
 
Welp, tried a 'smart clear.' I wrote what I thought was a pretty neat routine that generates a "shadow" for all objects on-screen and saves that. Essentially when it calculates the screen projection of each object, it stores the upper-left & lower-right bounds in an array, and when it loops around and hits the draw routine again it blanks the object's previous location by drawing the smallest possible box that covers it.

It was slower.

Like, a lot slower. I lost half my speed on the 386 and about 10-15% on faster machines (5x86/100, P233MMX.)

Optimizing it by only drawing the shadows of things that move doesn't help much (I made a half-ass attempt at this) because when the player moves in X/Y (which is probably 80% of the gameplay!) you have to draw all 576 shadows anyway. This is naturally right when the framerate matters most.

It was also bugged to hell on first attempt, it doesn't seem to clear both video pages. TBH I'm not inclined to put any more time into fixing it.

Incidentally I wrote a really simple program to benchmark this on the 386 - clearing the screen ran at 27 iterations per second, while drawing 576 3x3 boxes (simulating a full-grid clear for both eyes in my program) ran at 3.6 iterations/sec. I don't think you guys who are used to blitting bits to vram in assembly appreciate how slow the drawing routines I'm using really are. :P

I think there are some gains to be had, but not this way. It doesn't know if any of the shadow elements overlap so some parts are getting blanked twice, but I don't have a good way to deal with that. It probably makes more sense to do a single box clear that extends from the upper left grid square to the lower right (taking advantage that the left & right eye projections of objects are always at the same Y value), but I estimate I'd still be covering 80% of the screen each frame, so not sure if it's going to help much. Maybe in DK1 mode.
 
Note that drawing 576 3x3 boxes per frame with very little else being done, maxes out at 3.6 FPS on the 386. My game is already getting 2.7 FPS and it essentially does that plus a bunch of other stuff. I'm actually getting close to the limit of how fast it can run without a major restructuring.

...restructuring like, right now I read through the z-buffer in order and draw to video memory. Should I instead be rendering a flat projection to an array in system memory and copying that to vram to avoid overwrites? (That's a bit of a pain as QB doesn't allow an array big enough to hold a video frame... I think I'd need 4 per frame to do that & at some point I'm gonna be getting out of memory errors because AFAIK it's not using EMS or XMS.)
 
I think the key problem with performance here is that pixel-oriented drawing is extremely slow in bitplane modes like the one you're using.
With 8088 MPH in the sprite part, I had a similar problem: there are multiple pixels packed into a single byte. If I were to do a pixel-exact clearing of each sprite, it would take a lot of read-modify-write operations to only clear the pixels belonging to the sprite.
Instead, I made a 'lossy' clear routine, which worked on byte-boundaries only, so it never had to do any read-modify-write. It just overwrote entire bytes with the background.

In your case, it may be even more of a win. A byte contains 8 pixels in 16-colour EGA/VGA modes. So trying to erase 3x3 areas is always going to be some read-modify-write operation.
If instead you just make a clear-routine that either overwrites an 8x3 area, or a 16x3 area, you can do it with simple byte and word-writes (you'd need to clear 2 adjacent bytes if the 3 pixels cross a byte-boundary).

Even so, it's difficult to say beforehand whether or not this will be a win in your specific case. Since you have 576 different objects, that is a lot of calls to clear-routines, and there may be considerable overdraw.
Sometimes it's just faster to do a bruteforce clearscreen than trying to be 'smart'.

In that case, the most obvious 'optimization' is just to reduce the visible window on screen somewhat. Somewhat like the 'letterboxing' done on movies. If you leave a border on the sides and/or on the top and bottom of the screen, where you never draw anything, that's just less pixels to update.
 
Welp, tried a 'smart clear.' I wrote what I thought was a pretty neat routine that generates a "shadow" for all objects on-screen and saves that. Essentially when it calculates the screen projection of each object, it stores the upper-left & lower-right bounds in an array, and when it loops around and hits the draw routine again it blanks the object's previous location by drawing the smallest possible box that covers it.

You're doing it wrong. Use dirty rectangles.

Essentially, you want to keep track of what needs to be erased and then erase everything optimally. Meaning, if you have 100 small sprites in a 160x120 area, you don't issue 100 tiny erase calls, but rather a single 160x120 call.
 
Last edited:
In SCREEN 13 using a filled box (LINE (0,0)-(319,199),0,BF) is much faster than using CLS. Might be worth a try in SCREEN 9 as well.

I looked at GRIDFLIP.BAS

The perspective calculations for the grid look extremely expensive. 4608 floating point operations (at a glance) per frame.
Code:
  FOR zb = 8 TO 1 STEP -1
    FOR B = 1 TO 6
      FOR A = 1 TO 6
        flatXL = (((Grid(zb, A, B).X - ReyeX) / (Grid(zb, A, B).Z - eyez)) * HVW) + (HVW / 2) + RiftFudge
        flatXR = (((Grid(zb, A, B).X - LeyeX) / (Grid(zb, A, B).Z - eyez)) * HVW) + (3 * HVW / 2) - RiftFudge
        flatY = (((Grid(zb, A, B).Y - eyeH) / (Grid(zb, A, B).Z - eyez)) * VH) + HVH
        flatXL = flatXL + eyeoffset
        flatXR = flatXR + eyeoffset
        flatY = flatY + offsetY
...

Since you only have 6x6 possible states for your projected coordinates, why not precalculate them to an integer lookup? The size should only be 20kb.
But even that would be unnescessary, since you only need to calculate the first Z-row of points in your grid and the distance to the next point. Then you can just use integer addition to get the next points at that height.

Something like this maybe:
Code:
  halfHVW = (HVW / 2)
  halfHVWx3 =  halfHVW * 3
  
  FOR zb = 8 TO 1 STEP -1
    baseXL = (((Grid(zb, 1, 1).X - ReyeX) / (Grid(zb, 1, 1).Z - eyez)) * HVW) + halfHVW + RiftFudge
    baseXR = (((Grid(zb, 1, 1).X - LeyeX) / (Grid(zb, 1, 1).Z - eyez)) * HVW) + halfHVWx3 - RiftFudge
    baseY  = (((Grid(zb, 1, 1).Y - eyeH) / (Grid(zb, 1, 1).Z - eyez)) * VH) + HVH
    distance = ((((Grid(zb, 2, 1).X - ReyeX) / (Grid(zb, 2, 1).Z - eyez)) * HVW) + halfHVW + RiftFudge) - baseXL
    ydist = -distance + offsetY
    
    FOR B = 1 TO 6
      ydist = ydist + distance
      xdist = -distance + eyeoffset
      flatY = baseY  + ydist
      
    FOR A = 1 TO 6
        xdist = xdist + distance        
        flatXL = baseXL + xdist
        flatXR = baseXR + xdist

...

Most of the objects are rendered with Qbasics unoptimized geometry primitives. Like CIRCLE (flatXR, flatY), 5, 15.
It might be much faster to prerender them, store them with GET, and draw them with PUT in-game.
 
Last edited:
All the variables referenced in those lines are INTs but I don't know how the interim result (the stuff in the brackets) gets calculated before it gets rounded

you should be able to use integer division then (backslash instead of slash), with the multiplication done before

Code:
 FlatXL = (Grid(zb, a, B).X - ReyeX) * HVW \ (Grid(zb, a, B).Z - eyez)

if proper rounding is important:

Code:
 FlatXL = ((Grid(zb, a, B).X - ReyeX) * HVWtimes2 \ (Grid(zb, a, B).Z - eyez) + 1) \ 2
 
You're doing it wrong. Use dirty rectangles.

Essentially, you want to keep track of what needs to be erased and then erase everything optimally. Meaning, if you have 100 small sprites in a 160x120 area, you don't issue 100 tiny erase calls, but rather a single 160x120 call.

Thanks! I looked into that, I can probably do it with six rectangles that cover each grid row and leave the black spaces between alone (there aren't actually complete black lines between grid rows that often), but I'm not even sure that's necessary.

Considering this & Scali's earlier post, I wrote a "smart-sizing" routine that clears the screen with one rectangle, it sets the dimensions based on the upper left & lower right point each frame and intelligently resizes it if you zoom out, so you can shrink the screen for a performance boost. I did that, and got this (ignore the blue, it's a debug feature so I could see where the rectangle was being drawn.)

3fps.jpg


Yeah, it's a milestone. :P Considering the maximum speed of the draw routine without doing a screen clear, geometry calculation, or any page flipping is 3.6 FPS, I'm definitely hitting a point of diminishing returns. I think I'll go with the single-box smart-sizing clear for now, it still needs some tweaks but seems to be the best compromise on this system.

(FWIW I don't really expect to see much improvement on this machine but it's nice to test on because it's the slowest thing I have. I'm targeting 10FPS on a 486DX, or 15-20 on a DX2/DX4, which I'm already getting.)
 
Last edited:
In SCREEN 13 using a filled box (LINE (0,0)-(319,199),0,BF) is much faster than using CLS. Might be worth a try in SCREEN 9 as well.

This is true in mode 9 too (see above), even if I clear the full screen, but the difference isn't as dramatic. Resizing the clear box dynamically definitely helps.

I looked at GRIDFLIP.BAS

The perspective calculations for the grid look extremely expensive. 4608 floating point operations (at a glance) per frame. [...]

Since you only have 6x6 possible states for your projected coordinates, why not precalculate them to an integer lookup? The size should only be 20kb.

The only problem with this approach is I left the geometry completely variable on purpose, so that users can adjust for their headset (or I can add support for extra headsets easily by creating a new geometry profile) and I don't want to sacrifice that. You can change all kinds of things in-game - the zoom level, eye spacing, stereo balance, even the grid position on-screen.

Otherwise I could just use simple + & - ops for *all* X/Y movements essentially turning it into a 2D parallax scroll. I think I prefer it to be a real 3D engine conceptually though.

But even that would be unnescessary, since you only need to calculate the first Z-row of points in your grid and the distance to the next point. Then you can just use integer addition to get the next points at that height.

Something like this maybe:
Code:
  halfHVW = (HVW / 2)
  halfHVWx3 =  halfHVW * 3
  
  FOR zb = 8 TO 1 STEP -1
    baseXL = (((Grid(zb, 1, 1).X - ReyeX) / (Grid(zb, 1, 1).Z - eyez)) * HVW) + halfHVW + RiftFudge
    baseXR = (((Grid(zb, 1, 1).X - LeyeX) / (Grid(zb, 1, 1).Z - eyez)) * HVW) + halfHVWx3 - RiftFudge
    baseY  = (((Grid(zb, 1, 1).Y - eyeH) / (Grid(zb, 1, 1).Z - eyez)) * VH) + HVH
    distance = ((((Grid(zb, 2, 1).X - ReyeX) / (Grid(zb, 2, 1).Z - eyez)) * HVW) + halfHVW + RiftFudge) - baseXL
    ydist = -distance + offsetY
    
    FOR B = 1 TO 6
      ydist = ydist + distance
      xdist = -distance + eyeoffset
      flatY = baseY  + ydist
      
    FOR A = 1 TO 6
        xdist = xdist + distance        
        flatXL = baseXL + xdist
        flatXR = baseXR + xdist

...

Wow, thanks for looking into it so much! If I'm reading this right, you've essentially made a hybrid version that calculates the upper left and then bases everything else off that? That should work with my geometry adjustment controls too. I'm gonna give this a try & see how it does.

I also think I can get away with mirroring the right eye from the left (just the projected grid positions, not the objects on the grid) and eliminate all the floating point calculations for flatXR. *** EDIT: that got me a bunch of speed, but of course doesn't work with the way the game board moves around to react to the player movements. And that's done as part of the 3D geometry calculation, so there's no good 2D way to do any mirroring. The board isn't symmetrical in X unless the player is in the exact middle, which can't happen. I should remember how my own game is written.

Most of the objects are rendered with Qbasics unoptimized geometry primitives. Like CIRCLE (flatXR, flatY), 5, 15.
It might be much faster to prerender them, store them with GET, and draw them with PUT in-game.
I thought you couldn't use GET & PUT with double-buffering? I.e. you can't PUT things onto the offscreen buffer

The original plan was for the player & enemy sprites to be 3D wireframe objects, but obviously that didn't materialize (I don't have a very easy way to "project" things off the grid, especially in Z.)
 
Last edited:
The only problem with this approach is I left the geometry completely variable on purpose, so that users can adjust for their headset (or I can add support for extra headsets easily by creating a new geometry profile) and I don't want to sacrifice that.

The lookup could be recalculated when those variables would be adjusted. In case the single Z-row + integer addition works well enough, you would only need (2 x 6 x 6 x 3) == 216 integers stored per recalc. That should still be well under a second even on a slow machine.

Although I can understand features regarding VR could easily grow beyond that.

I thought you couldn't use GET & PUT with double-buffering? I.e. you can't PUT things onto the offscreen buffer
Ah, I didn't know that.
 
Happy new year guys! :) Working on this again.

Here's a new version if anyone wants to try it out. It's somewhat of a release candidate for 1.3.

Binary (DOS only for now): https://www.dropbox.com/s/5w7gz517jkhij6k/GRID3D21.EXE?dl=0
Source: https://www.dropbox.com/s/nckiq4klgkv2lfx/GRID3D21.BAS?dl=0

I've basically taken all your suggestions and worked them into this one, to varying degrees. The last HUGE improvement was using iterative addition-based geometry calculation based on Mangis's routine above (thanks TONS for that!) I'm still not pre-calculating anything, but this was a nice compromise in flexibility and ease of implementation for pretty big gains. I'm now seeing ~4.5 FPS on the 386DX, which is what the original game jam version ran at on a P3/1000 :shock: and far faster than the "theoretical" maximum I calculated with the old render routine. Also does an easy 40-60FPS on my Pentium MMX (with vsync disabled).

Optional command line parameters:
LOWRES, NOVSYNC (self-explanatory)

I'm pretty proud of how far this has come. On slow systems you can use the LOWRES parameter and shrink the playfield (or use Oculus DK1 mode) for big speed gains. It scales well.

Some maybe-useful debug keys:
~ - debug info drop-down
! - displays framerate when you quit with ESC (resets if you die)

[ ] - change eye spacing
( ) - change stereo balance (essentially changes the distance between the eyes POST render)
, . - zoom game board in/out
7/8/9/0 - nudge game board around the screen

There are some other secret debug keys that do silly things and a "secret" VR mode (WIP) that shouldn't be too hard to find.

If you give this a try and note any bugs or things that look like bugs, I would LOVE to hear them!

I particularly had trouble with the screen-clearing routine not clearing everything when messing with the geometry, and rounding pixel coordinates causing compound errors over time. I'm actually not doing rounding anymore beyond what PSET does when fed a decimal value, it wasn't necessary. Both those issues should be fixed (unless you zoom in so much it wraps around to negative zoom.)
 
Last edited:
If anyone has a CyberMaxx or Virtual-I/O iGlasses headset (or a VFX1 so I can confirm the geometry profile I made on mine is right for all of them) and wants to help me add support for these devices, I would love to hear from you.
 
Last edited:
I did try writing an "optimized PSET" in assembly based on this example code but it ended up being slower than using the draw routines. What happens is every time it plots a pixel it has to jump to a SUB, declare some variables, iterate a loop, etc. It seems LINE and BOX are more optimized than calling an assembly routine once for each pixel. :P (I mostly expected that based on your comments above, but wanted to try for myself.)
I just stumbled across this thread 8 years later, and wish I had been there at the time to engage. This looks like it was a really fun time
:)

I want to share some additional information, even though I'm sure you're probably way beyond caring about this at this point. It has to do with why PSET is so slow in modes like SCREEN 12. It's because of an optimization done in the hardware. It was necessary because the RAM used by the video adapter wasn't fast enough to just directly supply pixels at the rate needed to rasterize the screen. The adapter operates on a "dot clock", and every time there's a dot clock, it advances to the next pixel. But here's the thing: the memory banks that supply the pixel data can only supply one bit of data per dot clock. (They supply 8 bits at a time, but the reads can't be done fast enough to satisfy more than 1 bit per dot clock.) A pixel, though, is 4 bits wide. So how is this reconciled? The VGA adapter has four RAM banks that operate in parallel. Each time a read is done, each one of them supplies a byte of data. That means that each read is 32 bits wide -- but effectively only happens once every 8 dot clocks. Each of these banks is called a "plane".

There are different modes for how they get recombined. CGA came up with a very unique way to interleave the data. It's called "shift interleave", and it alternates between planes from pixel to pixel using a pair of planes. It also alternates which planes from scan line to scan line.

Code:
First plane:  00224466
Second plane: 11335577

y = 0, 2, 4, ...: planes 0 and 2
y = 1, 3, 5, ...: planes 1 and 3

Heck of a thing :-)

Mode 13h (SCREEN 13) takes it the opposite direction, and is as direct and uncomplicated as it is possible to be. It activates a mode called "chain 4" which remaps the memory's addressing system. In chain 4 mode, the RAM chip addressing is completely rewired so that when you ask for the byte at offset 0, it actually reads bits 7 and 6 from each of planes 0, 1, 2, 3 and strings them together. That is to say, the first 4 bits of the byte are spread across the planes, and then the next 4 bits of the byte are spread across the planes again one bit along. All of the bits are done like this transparently behind the scenes, so you can just read or write any byte of the 64000 pixels in SCREEN 13 with no crazy math or extra work.

On the back-end, this works with the 4-bits-per-dot-clock system previously described in the following way: The effective dot clock is halved (which is why there are 320 columns rather than 640), but the full dot clock rate is used for reads. It reads half of each byte on each real dot clock (4 bits, 1 from each plane), which means it pulls in two of these 4-bit groups for each effective dot clock. It then recombines them on the effective dot clock to make the full 8 bits per pixel.

Code:
Plane 0: ABCDEFGH
Plane 1: ABCDEFGH
Plane 2: ABCDEFGH
Plane 3: ABCDEFGH
                        .----- read in the even dot clock
                        |   .- read in the odd dot clock
Pixel 0 at A000:0000: AAAABBBB
Pixel 1 at A000:0001: CCCCDDDD
Pixel 2 at A000:0002: EEEEFFFF
Pixel 3 at A000:0003: GGGGHHHH

After each second read finishes, it's got the full 8 bits to send to the RAMDAC to look up the RGB colour in the palette to send down the wire to the monitor.

You don't generally have to worry about this, because the remapping means that your view of VRAM is a flat, consecutive sequence of pixel bytes. How the hardware is handling it isn't really important for programming it. :-) But, it helps to understand the world the VGA is living in: It really, fundamentally works by combining a bit from each of the planes to reconstitute pixels. It does this because that's how it can stream the data fast enough to generate a 640x480 signal 70 times per second.

So, bringing this full-circle, this provides the context needed to understand what's going on with pixels in modes like SCREEN 12. Unlike SCREEN 13, there's no remapping of the memory in SCREEN 12 to make things into friendly linear bytes-per-pixels. The host sees what's actually in the VGA planes as they are. That means that every byte you read/write has both bits from multiple pixels (8 different pixels) and is only part of any one of those pixels (1 out of the 4 bits that make up the pixel).

Code:
         (one byte)   at memory address
Plane 0: 01234567     A000:0000 when read map is set to plane 0
Plane 1: 01234567     A000:0000 when read map is set to plane 1
Plane 2: 01234567     A000:0000 when read map is set to plane 2
Plane 3: 01234567     A000:0000 when read map is set to plane 3

Logical pixel (0, 0): 0000 combined from all 4 planes
Logical pixel (1, 0): 1111 combined from all 4 planes
Logical pixel (2, 0): 2222 combined from all 4 planes
etc.

In order to randomly PSET a single pixel on the screen, then, you have to account for both of these things: You figure out the plane offset of the pixel, and then you have to read what's there in order to leave the adjacent pixels untouched. Then, you swap out just the bit for your pixel and write it back. Now you've written just 1 of the bits for your pixel, so you have to do it again for the 2nd bit on plane 1, the 3rd bit on plane 2 and the 4th bit on plane 3.

This is an incredibly inefficient process. No amount of optimization is going to make writing a single pixel fast. To compare:

- In SCREEN 13, writing a pixel looks like this:

Code:
o = y * 320 + x
POKE o, colour ' just one write :-)

- In CGA modes like SCREEN 1, writing a pixel is a bit more involved:

Code:
IF y AND 1 = 0 THEN plane = (x AND 1) * 2 ELSE plane = 1 + (x AND 1) * 2
o = plane * 8192 + y * 40 + x \ 2
shiftfactor = 1
FOR i = x AND 3 TO 3: shiftfactor = shiftfactor * 4: NEXT i ' Could be a LUT
bitmask = 3 * shiftfactor
existingdata = PEEK(o) ' MAIN BOTTLENECK: one read
existingdata = existingdata AND NOT bitmask ' clear just your pixel
existingdata = existingdata OR (colour * shiftfactor) ' replace it with the new bits
POKE o, existingdata ' MAIN BOTTLENECK: one write

- But, when you get to the planar modes like SCREEN 12:

Code:
o = y * 80 + x \ 8
vrambit = 1
FOR i = x AND 7 TO 7: vrambit = vrambit * 2: NEXT i ' Could be a LUT

pixelbit = 1
FOR plane = 0 TO 3 ' here's where the hurt comes
  OUT &H3CE, 4: OUT &H3CF, plane ' enable reading only from the current plane
  OUT &H3CE, 8: OUT &H3CF, pixelbit ' enable writing only to the current plane

  packeddata = PEEK(o) ' even though it's the same o, this is reading different bytes because of switching between planes

  IF colour AND pixelbit = 0 THEN
    packeddata = packeddata AND NOT vrambit ' zero just the bit for the pixel we're writing
  ELSE
    packeddata = packeddata OR vrambit ' set just the bit for the pixel we're writing
  END IF

  POKE o, packeddata ' now write the updated byte back, having changed one bit in it

  pixelbit = pixelbit * 2 ' pixelbit always has exactly one bit set, and that bit is number 'plane'
NEXT plane

There might be some optimizations possible with different write modes. The VGA has 4 of them, and I think some operate directly in the card on the last byte read. But this should make it patently clear why there is a vast performance difference between PSET in SCREEN 13 and PSET in SCREEN 12.

* SCREEN 13: write one byte
* SCREEN 12: 4 port I/O, read byte, write byte, 4 port I/O, read byte, write byte, 4 port I/O, read byte, write byte, 4 port I/O, read byte, write byte

...just to set one pixel. Ouch.

If you want to see significant performance improvements, then it's going to end up being algorithms that figure out how to write more than one pixel at once. For instance, you can make a "horizontal line" function that takes into account the fact that multiple target pixels are in a given byte and combines the update operations. You still need to cycle through the planes, which requires 16 port I/O ops generically, but you can do that just once for the entire span, rather than per-pixel. Then when writing the data, for bytes where all 8 pixels are being overwritten, you don't need to read the existing byte at all -- just one POKE and you've set the bit for 8 pixels in one go.

You can then use that horizontal line function as the basis for other algorithms, such as drawing lines or ellipses by decomposing them into horizontal spans rather than pixels.

QBASIC's LINE and CIRCLE functions work in exactly this way, allowing them to be insanely faster than setting the equivalent pixels with PSET.

Another optimization is that if you're writing the same bit to multiple planes, you can actually enable writing the same data in parallel to those planes in a single write operation. There are limitations to when this is useful, but for instance it can be used to very quickly blank the screen.

Another optimization is, if you're drawing a sprite, you can do an algorithm that only switches through the planes once for the entire sprite (or indeed the entire screen, if you're drawing multiple sprites).

It all comes down to ways to combine the heavy operations -- do multiple things per plane switch, and where possible, multiple pixels per memory read/write.
 
Last edited:
Back
Top