neilobremski
Experienced Member
I am learning and working on a quadrilateral DDA texture mapper for CGA that uses the 8x8 monochrome BIOS characters for texels, for use in an optimized version of Magenta's Maze. Reading Mats and Abrash's tutorials [SUP][1][/SUP] have given me new insights and I have fought not to scrap my triangle/line routines just yet until I make this next step.
Stepping is really the name of the game: adding the deltas found by linear interpolation. It has been difficult to integrate the term interpolate into my brain's common math glossary. I understood it at a basic level but never played with it enough for it to become an inherent capability. This is one of the things I have had to remedy while learning texture mapping. Interpolation is the key to quickly determining what texel to read at a given pixel location.
Today my goal is simply to draft an inner loop which reads texels and writes pixels across a single scanline. This assumes that interpolation has been done and the registers are used like so:
My target machine is a Tandy 1000 HX which runs on an 8088-based core with a puny 8-bit bus and 4 byte prefetch queue. My intention is therefore to limit memory access and instruction size as much as possible. As a sad side effect there is a lot of shifting and masking that is slower (for clocks) on newer processors. Just to give you an idea, a simple PUSH takes up to 15 cycles on the 8088 and only three on the 286.
The U and V texture coordinates are stored in register BX and their deltas (Ustep and Vstep) in BP. These are 3:5 fixed point numbers [SUP][2][/SUP] that when added together result in a step along the scanline within the texture map.
I'm hoping over time that I can optimize this with more tricks but first let me explain the pieces ...
First, the whole part of V (0 - 7) is put in CL and tested against a copy of the previous V (notated as O) in DH. If these are different then the texture byte containing the current row of texels must be loaded from memory. That is done more rarely considering the tiny texture size means that mostly this routine will be scaling up rather than down.
Given that V is merely used to determine the texture byte containing the current row, it is discarded afterwards to free up CX. The whole part of U (0 - 7) is put in CL and used to rotate a copy of the texture byte in CH so that the current texel is in bit 0, e.g. the first bit.
Now here I played around a lot with jumps and no-jumps, various ways to twiddle the bits. I ended up using a jump because from everything I can tell, it is less egregious on cycles on the fall through than my other methods. This means that textures will draw faster the more sparse they are; I believe it's the right thing to do for BIOS characters.
The texture colors are stored split in DL where the upper nibble is the foreground color and the lower is the background color. I'm only using the upper 2 and the lower 2 bits depending on whether a texel is lit, but this gives me some breathing room for switching to 16 colors later. Of course, by then I'll probably rewrite this dang thing from scratch.
Placing the new color bits is always done in the lower 2 bits of the pixel byte P after it has been shifted twice to the left. This lets the loop write in as many pixels as necessary and also necessitated using SI to store the X1,X2 pair where the former is incremented on each iteration. Whenever the last two bits of X1 are zero (X1 % 4), the current pixel byte P is written and the video pointer is incremented using STOSB.
Finally, the first and last CGA memory bytes of the scanline add some complication because they may only partially overlap existing pixels. So there's code in there to initialize AL for the first byte if not starting on a byte boundary. And if X1 ends within a pixel byte, then there's a lot of fun and funky shifting and masking to merge P with the existing pixels of that last byte.
Footnotes:
[SUP][1][/SUP]. Mats Byggmastar's FATMAP.TXT and Michael Abrash's "Pooh and the Space Station".
[SUP][2][/SUP]. This is a maximum fractional capability of 1/32 which is very low. Textures can thus not be enlarged more than 32x or shrunk smaller than 1/32 of their original size. The fractional bits also make steps a bit uneven which results in "hairiness" but this is okay given the context of the mapping (ASCII characters printed on 3D squares).
Stepping is really the name of the game: adding the deltas found by linear interpolation. It has been difficult to integrate the term interpolate into my brain's common math glossary. I understood it at a basic level but never played with it enough for it to become an inherent capability. This is one of the things I have had to remedy while learning texture mapping. Interpolation is the key to quickly determining what texel to read at a given pixel location.
Today my goal is simply to draft an inner loop which reads texels and writes pixels across a single scanline. This assumes that interpolation has been done and the registers are used like so:
- ES:DI = vidptr
- DS = texture (8 bytes) starting at offset 0
- AX/AH = T (texel)
- AX/AL = P (pixel)
- BX/BH = U (3:5 fixed point)
- BX/BL = V (3:5 fixed point)
- CX/CH = scratch
- CX/CL = scratch
- DX/DH = O (previous V)
- DX/DL = C (4:4 foreground:background)
- BP = Ustep,Vstep (3:5,3:5 fixed point)
- SI = X1,X2
My target machine is a Tandy 1000 HX which runs on an 8088-based core with a puny 8-bit bus and 4 byte prefetch queue. My intention is therefore to limit memory access and instruction size as much as possible. As a sad side effect there is a lot of shifting and masking that is slower (for clocks) on newer processors. Just to give you an idea, a simple PUSH takes up to 15 cycles on the 8088 and only three on the 286.
Code:
MOV DH, FF ; 800 force load of texel on first pixel
MOV CX, SI ; 802
AND CH, 3 ; 804
JZ 1812 ; 807 >TEX_PIX_LOOP
MOV AL, ES:[DI] ; 809
SHL CH, 1 ; 80C
MOV CL, CH ; 80E
SHR AL, CL ; 810 P = *(ES:DI) >> (X1 % 4) * 2
;
MOV CX, BX ; 812 :TEX_PIX_LOOP
ROL CX, 1 ; 814
ROL CX, 1 ; 816
ROL CX, 1 ; 818
AND CH, 7 ; 81A CH = V % 8
CMP CH, DH ; 81D
JNE 185A ; 81F if (V != O) >TEX_PIX_LOAD
NOP ; 821
;
AND CL, 7 ; 822 CL = U % 8 :TEX_PIX_TEXEL
INC CL ; 825
MOV CH, AH ; 827
ROL CH, CL ; 829
MOV CL, DL ; 82B
AND CH, 1 ; 82D
JZ 1836 ; 830 if (!texel) >TEX_PIX_DRAW
ROL CL, 1 ; 832
ROL CL, 1 ; 834
;
AND CL, 03 ; 836 :TEX_PIX_DRAW
SHL AL, 1 ; 839
SHL AL, 1 ; 83B P <<= 2
OR AL, CL ; 83D P |= ((C <<ROL (((T <<ROL (U+1)) & 1) * 2)) & 3)
;
MOV CX, SI ; 83F :TEX_PIX_MOVE
INC CH ; 841
MOV SI, CX ; 843
CMP CH, CL ; 845
JAE 186A ; 847 >TEX_PIX_END
;
MOV CX, BP ; 849 :TEX_PIX_STEP
ADD BH, CH ; 84B U += Ustep
ADD BL, CL ; 84D V += Vstep
MOV BX, CX ; 84F
;
TEST CH, 3 ; 851 :TEX_PIX_BYTE
JNZ 1812 ; 854 >TEX_PIX_LOOP
STOSB ; 856 *(ES:DI++) = P
JMP 1812 ; 857 >TEX_PIX_LOOP
NOP ; 859
;
MOV DH, CL ; 85A :TEX_PIX_LOAD
MOV CL, CH ; 85C
XOR CH, CH ; 85E
XCHG BX, CX ; 860
MOV AH, [BX]; 862 T = *(DS:V)
XCHG CX, BX ; 864
XCHG DH, CL ; 866 O = V
JMP 1822 ; 868 >TEX_PIX_TEXEL
;
AND CH, 3 ; 86A :TEX_PIX_END
JNZ 1886 ; 86D >TEX_PIX_LAST
MOV AH, ES:[DI] ; 86F
MOV CL, CH ; 872
SHL CL, 1 ; 874
MOV CH, FF ; 876
SHR CH, CL ; 878
AND AH, CH ; 87A
NEG CL ; 87C
ADD CL, F8 ; 87E
SHL AL, CL ; 881 P <<= ((4 - (X1 % 4)) * 2)
OR AL, AH ; 883 P |= *(ES:DI) & (0xFF >> (X1 % 4) * 2)
NOP ; 885
;
STOSB ; 886 :TEX_PIX_LAST
The U and V texture coordinates are stored in register BX and their deltas (Ustep and Vstep) in BP. These are 3:5 fixed point numbers [SUP][2][/SUP] that when added together result in a step along the scanline within the texture map.
I'm hoping over time that I can optimize this with more tricks but first let me explain the pieces ...
First, the whole part of V (0 - 7) is put in CL and tested against a copy of the previous V (notated as O) in DH. If these are different then the texture byte containing the current row of texels must be loaded from memory. That is done more rarely considering the tiny texture size means that mostly this routine will be scaling up rather than down.
Given that V is merely used to determine the texture byte containing the current row, it is discarded afterwards to free up CX. The whole part of U (0 - 7) is put in CL and used to rotate a copy of the texture byte in CH so that the current texel is in bit 0, e.g. the first bit.
Now here I played around a lot with jumps and no-jumps, various ways to twiddle the bits. I ended up using a jump because from everything I can tell, it is less egregious on cycles on the fall through than my other methods. This means that textures will draw faster the more sparse they are; I believe it's the right thing to do for BIOS characters.
The texture colors are stored split in DL where the upper nibble is the foreground color and the lower is the background color. I'm only using the upper 2 and the lower 2 bits depending on whether a texel is lit, but this gives me some breathing room for switching to 16 colors later. Of course, by then I'll probably rewrite this dang thing from scratch.
Placing the new color bits is always done in the lower 2 bits of the pixel byte P after it has been shifted twice to the left. This lets the loop write in as many pixels as necessary and also necessitated using SI to store the X1,X2 pair where the former is incremented on each iteration. Whenever the last two bits of X1 are zero (X1 % 4), the current pixel byte P is written and the video pointer is incremented using STOSB.
Finally, the first and last CGA memory bytes of the scanline add some complication because they may only partially overlap existing pixels. So there's code in there to initialize AL for the first byte if not starting on a byte boundary. And if X1 ends within a pixel byte, then there's a lot of fun and funky shifting and masking to merge P with the existing pixels of that last byte.
Footnotes:
[SUP][1][/SUP]. Mats Byggmastar's FATMAP.TXT and Michael Abrash's "Pooh and the Space Station".
[SUP][2][/SUP]. This is a maximum fractional capability of 1/32 which is very low. Textures can thus not be enlarged more than 32x or shrunk smaller than 1/32 of their original size. The fractional bits also make steps a bit uneven which results in "hairiness" but this is okay given the context of the mapping (ASCII characters printed on 3D squares).