I did try writing an "optimized PSET" in assembly based on
this example code but it ended up being
slower than using the draw routines. What happens is every time it plots a pixel it has to jump to a SUB, declare some variables, iterate a loop, etc. It seems LINE and BOX are more optimized than calling an assembly routine once for each pixel.

(I mostly expected that based on your comments above, but wanted to try for myself.)
I just stumbled across this thread 8 years later, and wish I had been there at the time to engage. This looks like it was a really fun time
I want to share some additional information, even though I'm sure you're probably way beyond caring about this at this point. It has to do with
why PSET is so slow in modes like SCREEN 12. It's because of an optimization done in the hardware. It was necessary because the RAM used by the video adapter
wasn't fast enough to just directly supply pixels at the rate needed to rasterize the screen. The adapter operates on a "dot clock", and every time there's a dot clock, it advances to the next pixel. But here's the thing: the memory banks that supply the pixel data can only supply one
bit of data per dot clock. (They supply 8 bits at a time, but the reads can't be done fast enough to satisfy more than 1 bit per dot clock.) A pixel, though, is 4 bits wide. So how is this reconciled? The VGA adapter has
four RAM banks that operate in parallel. Each time a read is done, each one of them supplies a byte of data. That means that each read is 32 bits wide -- but effectively only happens once every 8 dot clocks. Each of these banks is called a "plane".
There are different modes for how they get recombined. CGA came up with a very unique way to interleave the data. It's called "shift interleave", and it alternates between planes from pixel to pixel using a pair of planes. It also alternates which planes from scan line to scan line.
Code:
First plane: 00224466
Second plane: 11335577
y = 0, 2, 4, ...: planes 0 and 2
y = 1, 3, 5, ...: planes 1 and 3
Heck of a thing
Mode 13h (SCREEN 13) takes it the opposite direction, and is as direct and uncomplicated as it is possible to be. It activates a mode called "chain 4" which remaps the memory's addressing system. In chain 4 mode, the RAM chip addressing is completely rewired so that when you ask for the byte at offset 0, it actually reads bits 7 and 6 from each of planes 0, 1, 2, 3 and strings them together. That is to say, the first 4 bits of the byte are spread across the planes, and then the next 4 bits of the byte are spread across the planes again one bit along. All of the bits are done like this transparently behind the scenes, so you can just read or write any byte of the 64000 pixels in SCREEN 13 with no crazy math or extra work.
On the back-end, this works with the 4-bits-per-dot-clock system previously described in the following way: The effective dot clock is
halved (which is why there are 320 columns rather than 640), but the full dot clock rate is used for reads. It reads half of each byte on each real dot clock (4 bits, 1 from each plane), which means it pulls in two of these 4-bit groups for each effective dot clock. It then recombines them on the effective dot clock to make the full 8 bits per pixel.
Code:
Plane 0: ABCDEFGH
Plane 1: ABCDEFGH
Plane 2: ABCDEFGH
Plane 3: ABCDEFGH
.----- read in the even dot clock
| .- read in the odd dot clock
Pixel 0 at A000:0000: AAAABBBB
Pixel 1 at A000:0001: CCCCDDDD
Pixel 2 at A000:0002: EEEEFFFF
Pixel 3 at A000:0003: GGGGHHHH
After each second read finishes, it's got the full 8 bits to send to the RAMDAC to look up the RGB colour in the palette to send down the wire to the monitor.
You don't generally have to worry about this, because the remapping means that your view of VRAM is a flat, consecutive sequence of pixel bytes. How the hardware is handling it isn't really important for programming it.

But, it helps to understand the world the VGA is living in: It really, fundamentally works by combining a bit from each of the planes to reconstitute pixels. It does this because that's how it can stream the data fast enough to generate a 640x480 signal 70 times per second.
So, bringing this full-circle, this provides the context needed to understand what's going on with pixels in modes like SCREEN 12. Unlike SCREEN 13, there's no remapping of the memory in SCREEN 12 to make things into friendly linear bytes-per-pixels. The host sees what's actually in the VGA planes as they are. That means that every byte you read/write has both bits from multiple pixels (8 different pixels)
and is only part of any one of those pixels (1 out of the 4 bits that make up the pixel).
Code:
(one byte) at memory address
Plane 0: 01234567 A000:0000 when read map is set to plane 0
Plane 1: 01234567 A000:0000 when read map is set to plane 1
Plane 2: 01234567 A000:0000 when read map is set to plane 2
Plane 3: 01234567 A000:0000 when read map is set to plane 3
Logical pixel (0, 0): 0000 combined from all 4 planes
Logical pixel (1, 0): 1111 combined from all 4 planes
Logical pixel (2, 0): 2222 combined from all 4 planes
etc.
In order to randomly PSET a single pixel on the screen, then, you have to account for both of these things: You figure out the plane offset of the pixel, and then you have to
read what's there in order to leave the adjacent pixels untouched. Then, you swap out
just the bit for your pixel and write it back. Now you've written
just 1 of the bits for your pixel, so you have to do it again for the 2nd bit on plane 1, the 3rd bit on plane 2 and the 4th bit on plane 3.
This is an incredibly inefficient process. No amount of optimization is going to make writing a single pixel fast. To compare:
- In SCREEN 13, writing a pixel looks like this:
Code:
o = y * 320 + x
POKE o, colour ' just one write :-)
- In CGA modes like SCREEN 1, writing a pixel is a bit more involved:
Code:
IF y AND 1 = 0 THEN plane = (x AND 1) * 2 ELSE plane = 1 + (x AND 1) * 2
o = plane * 8192 + y * 40 + x \ 2
shiftfactor = 1
FOR i = x AND 3 TO 3: shiftfactor = shiftfactor * 4: NEXT i ' Could be a LUT
bitmask = 3 * shiftfactor
existingdata = PEEK(o) ' MAIN BOTTLENECK: one read
existingdata = existingdata AND NOT bitmask ' clear just your pixel
existingdata = existingdata OR (colour * shiftfactor) ' replace it with the new bits
POKE o, existingdata ' MAIN BOTTLENECK: one write
- But, when you get to the planar modes like SCREEN 12:
Code:
o = y * 80 + x \ 8
vrambit = 1
FOR i = x AND 7 TO 7: vrambit = vrambit * 2: NEXT i ' Could be a LUT
pixelbit = 1
FOR plane = 0 TO 3 ' here's where the hurt comes
OUT &H3CE, 4: OUT &H3CF, plane ' enable reading only from the current plane
OUT &H3CE, 8: OUT &H3CF, pixelbit ' enable writing only to the current plane
packeddata = PEEK(o) ' even though it's the same o, this is reading different bytes because of switching between planes
IF colour AND pixelbit = 0 THEN
packeddata = packeddata AND NOT vrambit ' zero just the bit for the pixel we're writing
ELSE
packeddata = packeddata OR vrambit ' set just the bit for the pixel we're writing
END IF
POKE o, packeddata ' now write the updated byte back, having changed one bit in it
pixelbit = pixelbit * 2 ' pixelbit always has exactly one bit set, and that bit is number 'plane'
NEXT plane
There might be some optimizations possible with different write modes. The VGA has 4 of them, and I think some operate directly in the card on the last byte read. But this should make it patently clear why there is a
vast performance difference between PSET in SCREEN 13 and PSET in SCREEN 12.
* SCREEN 13: write one byte
* SCREEN 12: 4 port I/O, read byte, write byte, 4 port I/O, read byte, write byte, 4 port I/O, read byte, write byte, 4 port I/O, read byte, write byte
...just to set one pixel. Ouch.
If you want to see significant performance improvements, then it's going to end up being algorithms that figure out how to write more than one pixel at once. For instance, you can make a "horizontal line" function that takes into account the fact that multiple target pixels are in a given byte and combines the update operations. You still need to cycle through the planes, which requires 16 port I/O ops generically, but you can do that just once for the entire span, rather than per-pixel. Then when writing the data, for bytes where all 8 pixels are being overwritten, you don't need to read the existing byte at all -- just one POKE and you've set the bit for 8 pixels in one go.
You can then use that horizontal line function as the basis for other algorithms, such as drawing lines or ellipses by decomposing them into horizontal spans rather than pixels.
QBASIC's LINE and CIRCLE functions work in exactly this way, allowing them to be insanely faster than setting the equivalent pixels with PSET.
Another optimization is that if you're writing the same bit to multiple planes, you can actually enable writing the same data in parallel to those planes in a single write operation. There are limitations to when this is useful, but for instance it can be used to very quickly blank the screen.
Another optimization is, if you're drawing a sprite, you can do an algorithm that only switches through the planes once for the entire sprite (or indeed the entire screen, if you're drawing multiple sprites).
It all comes down to ways to combine the heavy operations -- do multiple things per plane switch, and where possible, multiple pixels per memory read/write.