Yeah, that'd be the problem with a PUSH / SHIFT / POP thing is the cycle counts probably blow away any "optimization" you may have earned originally. Another idea is to try a less apparent operation to make up for a shift, if the shift is greater than four bits. Now, I've been doing 6502 asm as late, and haven't ever really done 8086 assembler (though I mean to get into that sooner than later!), but I'm thinking if you needed to do something like a logical shift right six bits, you could use a rotate left instruction accompanied by an AND instead...
Starting Value: 11011100
Target: Shifted right six bits = 00000011
11011100 -> ROL -> 10111001 -> ROL -> 01110011
01110011 -> AND #3 -> 00000011
So two ROLs and an AND, instead of 6 SHRs (as apparently is the 8086 inst.) This should also work in reverse for a large left shift.
That's just an example, I don't know what kind of shifts you're dealing with. But if the amount is greater than 4, and it's a LOGICAL not arithmetic shift, this type of trick will prove valuable. Saving, Loading, Restoring a register probably isn't going to net that much performance.
Also, you mentioned a 768 byte copy. Is that done every frame or just in a while? Because if it's every frame, you MAY get a little bit of performance boost by unrolling that loop. (E.g. write 8 bytes 96 times instead of 1 byte 768 times.)
Finally, porting to EGA or CGA -- shouldn't be IMPOSSIBLE, though I don't know if the "snow" problems of a CGA might cause trouble. The big deal is you'll pretty much have to rewrite the rendering code to deal with the different byte packing in 16 (2 pixels per byte) or 4 (4 pixels per byte) colors. That would be some pretty serious reprogramming. But, if you're dedicated enough...