Real mode memory copy optimization

snq · Aug 9, 2010

I've taken up oldschool programming again and I'm wondering if there are any optimizations to do a regular memory copy?
I'm concentrating on performance on a 286 and so far rep movsw seems to be as fast as it gets. I've tried unrolling the loop, reading 8 bytes at a time into 4 registers and then writing them, some alignment stuff, but it seems nothing beats the regular rep movsw. In fact whatever I do seems to be significantly slower.

On modern machines there's plenty of ways to improve performance over rep movsd, and it just seems to be too easy if rep movsw is actually the fastest method on a 286.

Anyone got some tricks up their sleeve here?

Chuck(G) · Aug 9, 2010

rep movsw is about as fast as it gets on a 286. However, make sure your source and target are aligned on a word boundary.

per · Aug 9, 2010

I think it may have something to do with how those old CPUs are adressing memory. I don't know about the 286, but the 8088 generally uses four (+waitstates) clock cycles to process one byte of code. With loops it has to read several instructions from memory in order to copy and store one byte, while with rep movsw it only needs to read the instructions once, thus saving a LOT of unnessecary (and slow) memory reads.

On todays processors, memory is handled differently, and I guess internal L1 cache is the main reason why loops may be faster than rep movsw/movsd.

snq · Aug 9, 2010

True, the 286 not having any cache would probably explain it. I guess any of the unrolling etc optimizations would not work on anything lower than a 486.

What if (hypothetically) one of the pointers is misaligned while the other is aligned. No way to get both of them aligned using a movsb before doing our rep movsb, but we can change which one of them is aligned that way. Which one would be the prefered one?

Chuck(G) · Aug 9, 2010

It's more than that, per. Starting with the 386, instructions such as MOVS were microcoded, while the more elementary ones were hardcoded. Thus, it can be faster to do a "DEC CX / JNZ ..." instead of "LOOP ...". There have been articles written over the years about the differences in optimization techniques over the years.

ISTR that REP MOVS on the 8088 and 8086 re-issued the instruction (like the Z80 LDIR) on every repetition, so as to be interruptible. The 186 and later treats the REP MOVS as a single instruction. Motorola did the same thing when going from the 68000 to the 68010--the '10 sees one of a class of "loopable" memory-reference instructions, followed by a DBcc loop backwards to the loopable instruction, the CPU will quit interpreting instructions and execute the instruction until the loop termination conditions are satisfied.

Real mode memory copy optimization

snq

Experienced Member

Chuck(G)

25k Member

per

Veteran Member

snq

Experienced Member

Chuck(G)

25k Member