• Please review our updated Terms and Rules here

My variation on masked blitting in Mode X

PgrAm

Experienced Member
Joined
Sep 28, 2011
Messages
276
Location
Toronto, Canada
Hey all I've been experimenting on different ways to implement transparent sprites in mode X and I came up with a method that I haven't seen before so I thought I might share.

The image is stored in a standard planar format, with palette index zero representing a transparent pixel, and blitted in the following manner:

Code:
outb(SC_INDEX, MAP_MASK);
for each plane:
{
	mov cx, VGA_SEGMENT
	mov es, cx
	lds si, planeData 	;the bitmap data 
	mov ah, planeMask	;the bit mask to enable this plane
	mov bx, bmpHeight 	;image height
	mov dx, SC_DATA	
	
	rowLoop :
	mov cx, bytes_per_line ;load the pixel counter

	pixLoop:
	xor al, al 		;AL = 0
	cmp al, ds : [si] 	;if ds:[si] > 0 CF = 1 else CF = 0
	sbb al, al		;AL = 0 - CF, AL = 0 if 0, 0xFF if 1
	and al, ah		;combine with the plane mask
	out dx, al		;output the new mask setting
	movsb			;plot the pixel
	loop pixLoop		;continue as long as there are pixels left

	add si, lineDiff	;add the remaining distance to the edge of the bitmap
	add di, screenDiff	;add the remaining distance to the edge of the screen
	dec bx			;continue as long as there are rows left
	jnz rowLoop
}

The outer parts are psuedocode because the code I wrote this for is complicated in a way that isn't really relevant here. The interesting idea is generating a mask based on the pixels value and using the vga card mask to avoid branching. This could also work with a pre-computed mask but I didn't really want to waste the memory for that in my application.

Advantages:
- No branching required
- clipping is fairly straightforward
- can be stored in the same format as opaque sprites

Disadvantages:
- Reads the each pixel twice, I tried keeping the data around in a register but juggling registers made it slower on my target (286) but might be worth it on 8088.
- Requires an out instruction for each pixel, slow on protected mode 386/486

Well that's the gist of it, I'm curious to hear what you guys think of it or if there are any improvements you can think of?
 
I like the SBB trick too, but the overhead to eliminate a single jump seems not to be worth it.
Accessing video registers or memory is usually slower than normal RAM, so it's better to avoid this.

Code:
	jmp pixLoop

	align 2

skipPixel:
	inc di		;2
	dec cx		;2
	jz endPixL	;3 (assuming CX > 0)
pixLoop:
	lodsb		;5
	test al,al	;2
	jz skipPixel	;3/8
	stosb		;3
	loop pixLoop	;9
endPixL:

According to the 286 manual, this should take 22 clock cycles in both paths. Your loop takes 30, and writes to both video RAM and I/O every time, so VGA wait states will have a greater effect. Maybe Trixter or Scali could offer further comments?
 
Wouldn't compiled sprites be faster/easier?

Faster yes... easier at mode X resolutions? Comes down to the sprite size.

A method I use from time to time is to store first a word width offset based on the screen size, then how many bytes to write, then the data for a 'section'. This is highly inefficient if you have dithering across planes, but brutally efficient for images with only a few holes in it. To that end I often have the first word act as a indicator to say which encoding I'm using.

Code:
	lodsw
.segmentLoop
	add di, ax
	lodsw
	mov  cx, ax
	rep movsb
	lodsw
	or   ax, ax
	jnz  .segmentLoop

Being the heart of it. I load the first pass on the assumption no check is needed (also allowing the first offset to be zero), add it to DI, load the count of how many to output, then rep movsb the data over. Load the next offset, if it's non-zero keep going. This method also works well in plain-old mode 13.

Again, if every other byte is write / don't write this can be slow, but if you have more than 2 bytes of non-transparency one after the other this is WAY faster. Again a header byte can be used to alternate between this and a more conventional "0 as transparent" technique.

It can also result in the sprites being much smaller in memory since any run of more than 4 transparent bytes ends up 4 bytes.
 
I did several tests on my 286 and even made a partially unrolled version which used lodsw and stosb, this made the average cycles closer to the version with the branch however given my testing it ended up still being slower, by a significant margin, than using a branch, probably due to the time required to fill the prefetch queue given the increase in code size. Anyways now I'm just using the old branch version, but I haven't given up on this method yet, I may still come up with a way to improve it. If you were interested here's the 2x unrolled version:

Code:
rowLoop :
	mov cx, bytes_per_line
	shr cx, 1
	jz lastByte
pixLoop:
	lodsw
	mov bl, al
	xor al, al
	cmp al, bl
	sbb al, al
	and al, bh
	out dx, al
	mov al, bl
	stosb
	xor al, al
	cmp al, ah
	sbb al, al
	and al, bh
	out dx, al
	mov al, ah
	stosb
	loop pixLoop
lastByte:
	test bytes_per_line, 1
	jz endLine
	xor al, al
	cmp al, ds : [si]
	sbb al, al
	and al, bh
	out dx, al
	movsb
endLine:
	add si, lineDiff
	add di, screenDiff
	dec heightCount
	jnz rowLoop
 
closer to the version with the branch however given my testing it ended up still being slower, by a significant margin, than using a branch
Well that's a lot of code... and if it's slower on a 286 it's going to be hell on a 8088. Remember, fetch is the enemy so if you can do it in less code, it's almost always faster. ESPECIALLY if there are wait states on the memory. There's a reason on 286's that 0 wait paid such high dividends.

Ballparking that code you're lookiing at 100 to 150 clocks per pixel inside the loop just from all the operations -- upwards of twice that on a 8088. Given that something more like this "inside the loop":

Code:
.loop:
	lodsb
	or  al, al
	jz  .next
	mov  es:[di], al
.next:
	inc  di
	loop .loop

... is going to come in well under 30 clocks when the jz is not taken, and only 10 more clocks when it is? Yeah... that. Even on a 8088 using a jump inside it is 61 clocks accounting for the BIU. That means -- in theory -- a 4.77mhz 8088 running the jump version would be faster than a 6mhz AT running the "dick around with the ports" approach.
 
Back
Top