I've been writing some Z80 sprite-drawing code recently, trying to get it as fast as possible. Right now the fastest I've gotten drawing with transparency is at 1/4th the speed of the same image being drawn by a set of unrolled ldir-s (no transparency). Nevertheless, for the target system this is still a bit limited!
I can give you a quick-and-dirty explanation of how my sprite routine works. I basically divided things up in all possible cases... based on 16-bit words. Something like:
- Starting on even or odd scanline (CGA uses separate bitplanes for even and odd scanlines)
- Starting on even or odd x-coordinate (there are 2 pixels packed in a byte in the mode I use)
- All pixels in word opaque
- All pixels in word transparent
- Some pixels in word opaque/transparent
- All pixels in byte opaque
- All pixels in byte transparent
- Some pixels in byte opaque/transparent
I then coded hand-optimized assembly 'templates' for each case. Then I derived some heustistics for when to select which variation for the fastest/smallest possible code (sometimes it is faster to process things per word, other times it is faster to process per byte. Not a problem you would have on Z80).
Then I made a 'compiler' for this: I load a bitmap, and have the compiler automatically generate the proper blocks of code for each case, inserting the proper pixel/masking data into the 'template', with a few peephole optimizations added (eg, if you have multiple opaque bytes/words next to eachother, it merges the pointer updates to a single instruction)
So basically the sprite code is 'perfect' hand-optimized code for drawing sprites with transparency.