• Please review our updated Terms and Rules here

Real mode memory copy optimization

snq

Experienced Member
Joined
Mar 29, 2009
Messages
164
Location
Sweden, way up north
I've taken up oldschool programming again and I'm wondering if there are any optimizations to do a regular memory copy?
I'm concentrating on performance on a 286 and so far rep movsw seems to be as fast as it gets. I've tried unrolling the loop, reading 8 bytes at a time into 4 registers and then writing them, some alignment stuff, but it seems nothing beats the regular rep movsw. In fact whatever I do seems to be significantly slower.

On modern machines there's plenty of ways to improve performance over rep movsd, and it just seems to be too easy if rep movsw is actually the fastest method on a 286.

Anyone got some tricks up their sleeve here?
 
I think it may have something to do with how those old CPUs are adressing memory. I don't know about the 286, but the 8088 generally uses four (+waitstates) clock cycles to process one byte of code. With loops it has to read several instructions from memory in order to copy and store one byte, while with rep movsw it only needs to read the instructions once, thus saving a LOT of unnessecary (and slow) memory reads.

On todays processors, memory is handled differently, and I guess internal L1 cache is the main reason why loops may be faster than rep movsw/movsd.
 
True, the 286 not having any cache would probably explain it. I guess any of the unrolling etc optimizations would not work on anything lower than a 486.

What if (hypothetically) one of the pointers is misaligned while the other is aligned. No way to get both of them aligned using a movsb before doing our rep movsb, but we can change which one of them is aligned that way. Which one would be the prefered one?
 
It's more than that, per. Starting with the 386, instructions such as MOVS were microcoded, while the more elementary ones were hardcoded. Thus, it can be faster to do a "DEC CX / JNZ ..." instead of "LOOP ...". There have been articles written over the years about the differences in optimization techniques over the years.

ISTR that REP MOVS on the 8088 and 8086 re-issued the instruction (like the Z80 LDIR) on every repetition, so as to be interruptible. The 186 and later treats the REP MOVS as a single instruction. Motorola did the same thing when going from the 68000 to the 68010--the '10 sees one of a class of "loopable" memory-reference instructions, followed by a DBcc loop backwards to the loopable instruction, the CPU will quit interpreting instructions and execute the instruction until the loop termination conditions are satisfied.
 
Back
Top