• Please review our updated Terms and Rules here

Cool stuff in text mode

By the way, I don't feel like I should give advises to a master, but I hope some of this may improve that already excellent code. On the line 1316:

Code:
    asm mov cl,30
    _loop: //Update Lines
        asm movsw            
        asm movsw
        asm movsw
        asm movsw
        asm movsw
        asm movsw
        asm movsw
        asm movsw
        asm movsw
        asm movsw
        asm movsw            
        asm movsw
        asm movsw
        asm movsw
        asm movsw
        asm movsw
        asm movsw
        asm movsw
        asm movsw
        asm movsw
        asm add di,84-40
        asm sub si,40
        asm loop _loop

maybe could be faster if reduced to something like this:

Code:
    asm mov bl,30
    asm sub ch,ch
    _loop: //Update Lines
        asm mov cl,20
        asm rep movsw
        asm add di,84-40
        asm sub si,40
        asm dec bl
        asm jnz _loop

REP MOVSW uses to be faster than a succession of movsw, if only because there are much less instructions to be fetched. The bad side is we need CX to use LOOP, and also to feed REP. How to solve this conflict? We can use a spare register and mimic the LOOP instruction by using DEC and taking advantage of the FLAGS with JNZ. Not as efficient as LOOP but in my opinion its pretty close, anyway the lightning speed of REP MOVSW compensates it.

An alternative could be also something like this: backing up CL with the very fast XCHG, in order to be used with both REP and LOOP.

Code:
    asm mov cl,30
    asm sub ch,ch
    _loop: //Update Lines
        asm xchg bl,cl
        asm mov cl,20
        asm rep movsw
        asm add di,84-40
        asm sub si,40
        asm xchg cl,bl
        asm loop _loop
 
By the way, I don't feel like I should give advises to a master, but I hope some of this may improve that already excellent code. On the line 1316:

Code:
    asm mov cl,30
    _loop: //Update Lines
        asm movsw           
        asm movsw
        asm movsw
        asm movsw
        asm movsw
        asm movsw
        asm movsw
        asm movsw
        asm movsw
        asm movsw
        asm movsw           
        asm movsw
        asm movsw
        asm movsw
        asm movsw
        asm movsw
        asm movsw
        asm movsw
        asm movsw
        asm movsw
        asm add di,84-40
        asm sub si,40
        asm loop _loop

maybe could be faster if reduced to something like this:

Code:
    asm mov bl,30
    asm sub ch,ch
    _loop: //Update Lines
        asm mov cl,20
        asm rep movsw
        asm add di,84-40
        asm sub si,40
        asm dec bl
        asm jnz _loop

REP MOVSW uses to be faster than a succession of movsw, if only because there are much less instructions to be fetched. The bad side is we need CX to use LOOP, and also to feed REP. How to solve this conflict? We can use a spare register and mimic the LOOP instruction by using DEC and taking advantage of the FLAGS with JNZ. Not as efficient as LOOP but in my opinion its pretty close, anyway the lightning speed of REP MOVSW compensates it.

An alternative could be also something like this: backing up CL with the very fast XCHG, in order to be used with both REP and LOOP.

Code:
    asm mov cl,30
    asm sub ch,ch
    _loop: //Update Lines
        asm xchg bl,cl
        asm mov cl,20
        asm rep movsw
        asm add di,84-40
        asm sub si,40
        asm xchg cl,bl
        asm loop _loop
Thanks!.
That code is to update big chunks of tiles for the twister, and it is very fast as it is now, but your code is smaller, it looks nicer.

The part I'd love to optimize more is the sprite drawing, to use this for games.
 
At a first sight, I think there are a few things that can be tried to speed up the code.

For example, this function

Code:
void Enable_TileData_Write(){
    asm mov dx,0x03C4
    asm mov ax,0x0402    //Enable plane 0100 (2 - glyphs)
    asm out dx,ax
    asm mov ax,0x0404    //Sequential memory access
    asm out dx,ax
    asm mov dx,0x03CE
    asm mov ax,0x0204    //Read plane (2 - glyphs)
    asm out dx,ax
    asm mov ax,0x0005
    asm out dx,ax
    asm mov ax,0x0406    //Select VRAM A0000h-AFFFFh, Chain O/E OFF; keep text mode
    asm out dx,ax
};

if converted to a macro, would save a branch, that is, the CALL, the RET and C's stack frame PUSHes and POPs.

Code:
#define Enable_TileData_Write() \
    asm mov dx,0x03C4;\
    asm mov ax,0x0402 ;\   //Enable plane 0100 (2 - glyphs)
    asm out dx,ax;\
    asm mov ax,0x0404;\    //Sequential memory access
    asm out dx,ax;\
    asm mov dx,0x03CE;\
    asm mov ax,0x0204 ;\   //Read plane (2 - glyphs)
    asm out dx,ax;\
    asm mov ax,0x0005;\
    asm out dx,ax;\
    asm mov ax,0x0406;\    //Select VRAM A0000h-AFFFFh, Chain O/E OFF; keep text mode
    asm out dx,ax

I also would try to avoid as much as possible (it's not always possible) any calls inside a tight loop.

I would also avoid using memcpy or other C's standard library procedures inside critical code. IIRC Borland's C compilers don't inline the memcpy. They use CALLs. If using large memory models it's even worse as the addresses are a complete double word and must be pushed and retrieved from the stack.

Code:
    for (i = 0; i < 6;i+=2){
        memcpy((byte *)(0xA0000000+tilepos),&VGA[SpBKG[i+sprpos]<<5],8); tilepos+=32;
        memcpy((byte *)(0xA0000000+tilepos),&VGA[SpBKG[i+sprpos+6]<<5],8); tilepos+=32;
        memcpy((byte *)(0xA0000000+tilepos),&VGA[SpBKG[i+sprpos+12]<<5],8); tilepos+=32;
    }

I would try to convert this code to a pure assembler equivalent (LDS,LES,MOVS, etc.), while also avoiding, when possible, the use of memory variables on critical sections. You could copy the value to a register and after use the register for all subsequent operations (if they are available). The memory variables are quite slow when used on loops. For example, I had a code that converted a bitmap format into another one that lasted 140 seconds using a FOR loop with local variables on an 8 mhz machine. A pure assembler version I had to do after, which only used registers after retrieving the values from variables the first time, did the same job in just 8 seconds on the same machine.
 
Thanks, I left some code in c because I saw it was fast enough, or I didn't know how to do it better than the compiler. Also the demo is using the huge memory model just because I wanted all data to be inside the exe.
 
I didn't know this could be done:
Code:
#define Enable_TileData_Write() \
It's awesome :). I use C because it is easy to read, so this is a big improvement, even if that function is only called once per frame.

for (i = 0; i < 6;i+=2){
memcpy((byte *)(0xA0000000+tilepos),&VGA[SpBKG[i+sprpos]<<5],8); tilepos+=32;
memcpy((byte *)(0xA0000000+tilepos),&VGA[SpBKG[i+sprpos+6]<<5],8); tilepos+=32;
memcpy((byte *)(0xA0000000+tilepos),&VGA[SpBKG[i+sprpos+12]<<5],8); tilepos+=32;
}
[/CODE]

I didn't know how to do this in assembly, my resulting asm function was bigger than the asm produced by the compiler.
 
Last edited:
Back
Top