Cool stuff in text mode

Mills32 · Mar 28, 2022

Hi, I just finished a tiny demo for MS-DOS, it only requires 8088 4.77, VGA and Adlib.

Source code: https://github.com/mills32/Tiny-Demo

My code is not very clear, but i'm sure it will inspire someone to make something with it, or improve the code because my sprite simulation is not very good I think.

Caluser2000 · Mar 28, 2022

It looks and sounds great. Well done!

carlos12 · Mar 29, 2022

This is superb! Thank you for sharing it along with its source. It's very inspiring.

carlos12 · Mar 29, 2022

By the way, I don't feel like I should give advises to a master, but I hope some of this may improve that already excellent code. On the line 1316:

Code:

    asm mov cl,30
    _loop: //Update Lines
        asm movsw            
        asm movsw
        asm movsw
        asm movsw
        asm movsw
        asm movsw
        asm movsw
        asm movsw
        asm movsw
        asm movsw
        asm movsw            
        asm movsw
        asm movsw
        asm movsw
        asm movsw
        asm movsw
        asm movsw
        asm movsw
        asm movsw
        asm movsw
        asm add di,84-40
        asm sub si,40
        asm loop _loop

maybe could be faster if reduced to something like this:

Code:

    asm mov bl,30
    asm sub ch,ch
    _loop: //Update Lines
        asm mov cl,20
        asm rep movsw
        asm add di,84-40
        asm sub si,40
        asm dec bl
        asm jnz _loop

REP MOVSW uses to be faster than a succession of movsw, if only because there are much less instructions to be fetched. The bad side is we need CX to use LOOP, and also to feed REP. How to solve this conflict? We can use a spare register and mimic the LOOP instruction by using DEC and taking advantage of the FLAGS with JNZ. Not as efficient as LOOP but in my opinion its pretty close, anyway the lightning speed of REP MOVSW compensates it.

An alternative could be also something like this: backing up CL with the very fast XCHG, in order to be used with both REP and LOOP.

Code:

    asm mov cl,30
    asm sub ch,ch
    _loop: //Update Lines
        asm xchg bl,cl
        asm mov cl,20
        asm rep movsw
        asm add di,84-40
        asm sub si,40
        asm xchg cl,bl
        asm loop _loop

Mills32 · Mar 29, 2022

carlos12 said:
By the way, I don't feel like I should give advises to a master, but I hope some of this may improve that already excellent code. On the line 1316:

Code:

asm mov cl,30 _loop: //Update Lines asm movsw asm movsw asm movsw asm movsw asm movsw asm movsw asm movsw asm movsw asm movsw asm movsw asm movsw asm movsw asm movsw asm movsw asm movsw asm movsw asm movsw asm movsw asm movsw asm movsw asm add di,84-40 asm sub si,40 asm loop _loop

maybe could be faster if reduced to something like this:

Code:

asm mov bl,30 asm sub ch,ch _loop: //Update Lines asm mov cl,20 asm rep movsw asm add di,84-40 asm sub si,40 asm dec bl asm jnz _loop

REP MOVSW uses to be faster than a succession of movsw, if only because there are much less instructions to be fetched. The bad side is we need CX to use LOOP, and also to feed REP. How to solve this conflict? We can use a spare register and mimic the LOOP instruction by using DEC and taking advantage of the FLAGS with JNZ. Not as efficient as LOOP but in my opinion its pretty close, anyway the lightning speed of REP MOVSW compensates it.

An alternative could be also something like this: backing up CL with the very fast XCHG, in order to be used with both REP and LOOP.

Code:

asm mov cl,30 asm sub ch,ch _loop: //Update Lines asm xchg bl,cl asm mov cl,20 asm rep movsw asm add di,84-40 asm sub si,40 asm xchg cl,bl asm loop _loop

Thanks!.
That code is to update big chunks of tiles for the twister, and it is very fast as it is now, but your code is smaller, it looks nicer.

The part I'd love to optimize more is the sprite drawing, to use this for games.

carlos12 · Mar 29, 2022

At a first sight, I think there are a few things that can be tried to speed up the code.

For example, this function

Code:

void Enable_TileData_Write(){
    asm mov dx,0x03C4
    asm mov ax,0x0402    //Enable plane 0100 (2 - glyphs)
    asm out dx,ax
    asm mov ax,0x0404    //Sequential memory access
    asm out dx,ax
    asm mov dx,0x03CE
    asm mov ax,0x0204    //Read plane (2 - glyphs)
    asm out dx,ax
    asm mov ax,0x0005
    asm out dx,ax
    asm mov ax,0x0406    //Select VRAM A0000h-AFFFFh, Chain O/E OFF; keep text mode
    asm out dx,ax
};

if converted to a macro, would save a branch, that is, the CALL, the RET and C's stack frame PUSHes and POPs.

Code:

#define Enable_TileData_Write() \
    asm mov dx,0x03C4;\
    asm mov ax,0x0402 ;\   //Enable plane 0100 (2 - glyphs)
    asm out dx,ax;\
    asm mov ax,0x0404;\    //Sequential memory access
    asm out dx,ax;\
    asm mov dx,0x03CE;\
    asm mov ax,0x0204 ;\   //Read plane (2 - glyphs)
    asm out dx,ax;\
    asm mov ax,0x0005;\
    asm out dx,ax;\
    asm mov ax,0x0406;\    //Select VRAM A0000h-AFFFFh, Chain O/E OFF; keep text mode
    asm out dx,ax

I also would try to avoid as much as possible (it's not always possible) any calls inside a tight loop.

I would also avoid using memcpy or other C's standard library procedures inside critical code. IIRC Borland's C compilers don't inline the memcpy. They use CALLs. If using large memory models it's even worse as the addresses are a complete double word and must be pushed and retrieved from the stack.

Code:

    for (i = 0; i < 6;i+=2){
        memcpy((byte *)(0xA0000000+tilepos),&VGA[SpBKG[i+sprpos]<<5],8); tilepos+=32;
        memcpy((byte *)(0xA0000000+tilepos),&VGA[SpBKG[i+sprpos+6]<<5],8); tilepos+=32;
        memcpy((byte *)(0xA0000000+tilepos),&VGA[SpBKG[i+sprpos+12]<<5],8); tilepos+=32;
    }

I would try to convert this code to a pure assembler equivalent (LDS,LES,MOVS, etc.), while also avoiding, when possible, the use of memory variables on critical sections. You could copy the value to a register and after use the register for all subsequent operations (if they are available). The memory variables are quite slow when used on loops. For example, I had a code that converted a bitmap format into another one that lasted 140 seconds using a FOR loop with local variables on an 8 mhz machine. A pure assembler version I had to do after, which only used registers after retrieving the values from variables the first time, did the same job in just 8 seconds on the same machine.

Mills32 · Mar 29, 2022

Thanks, I left some code in c because I saw it was fast enough, or I didn't know how to do it better than the compiler. Also the demo is using the huge memory model just because I wanted all data to be inside the exe.

Mills32 · Mar 29, 2022

I didn't know this could be done:

Code:

#define Enable_TileData_Write() \

It's awesome

. I use C because it is easy to read, so this is a big improvement, even if that function is only called once per frame.

carlos12 said:
for (i = 0; i < 6;i+=2){
memcpy((byte *)(0xA0000000+tilepos),&VGA[SpBKG[i+sprpos]<<5],8); tilepos+=32;
memcpy((byte *)(0xA0000000+tilepos),&VGA[SpBKG[i+sprpos+6]<<5],8); tilepos+=32;
memcpy((byte *)(0xA0000000+tilepos),&VGA[SpBKG[i+sprpos+12]<<5],8); tilepos+=32;
}
[/CODE]

I didn't know how to do this in assembly, my resulting asm function was bigger than the asm produced by the compiler.

Cool stuff in text mode

Mills32

Experienced Member

Caluser2000

Banned

carlos12

Experienced Member

carlos12

Experienced Member

Mills32

Experienced Member

carlos12

Experienced Member

Mills32

Experienced Member

Mills32

Experienced Member

VCF West	Aug 01 - 02 2025,	CHM, Mountain View, CA
VCF Midwest	Sep 13 - 14 2025,	Schaumburg, IL
VCF Montreal	Jan 24 - 25, 2026,	RMC Saint Jean, Montreal, Canada
VCF SoCal	Feb 14 - 15, 2026,	Hotel Fera, Orange CA
VCF Southwest	May 29 - 31, 2026,	Westin Dallas Fort Worth Airport
VCF Southeast	June, 2026	Atlanta, GA