8088 optimization question (eliminating branches)

VileR · Nov 10, 2019

Consider the following scenario: let "background" and "foreground" be two layers of visual data, composed of 16-bit values. I'd like to merge them as follows - for each offset, if foreground contains zero, it is considered transparent (=use the background value). Otherwise, use the value in foreground.

Something like this:

Code:

for (i=0; i<size; i++)
{
  merged[i] = (foreground[i] == 0) ? background[i] : foreground[i];
}

I'm looking to implement this in 8086 assembly. If I test each word value for zero and use conditional jumps, it's straightforward... but also very, very slow.
Is there a faster, more optimized way to do this? Hopefully without branching, if at all possible? (The choice of zero to represent transparency is arbitrary, BTW).

One way is to "compile" the layers, as in the "compiled sprites" technique sometimes used for fast animation... i.e. instead of plain data, generate the layers as code that performs the merging in-place (with the data as immediate values), and simply does nothing in the 'transparent' case. But that wouldn't work so well, because my use-case requires working with arbitrary pieces of each layer (so the code wouldn't have a fixed entry point, and the expense of compensating for this could nullify a lot of the gain).

Chuck(G) · Nov 10, 2019

My first and easiest step would be unrolling the loop, or at least a few iterations of it. It is possible to do the selection without branching, but I'm not sure that it will save you many cycles. Running the loop in reverse and initializing the index to its final value gets rid of a compare at the bottom of the loop (decrement and loop until zero).

What are the possible values of "foreground"?

VileR · Nov 10, 2019

Chuck(G) said:
My first and easiest step would be unrolling the loop, or at least a few iterations of it. It is possible to do the selection without branching, but I'm not sure that it will save you many cycles. Running the loop in reverse and initializing the index to its final value gets rid of a compare at the bottom of the loop (decrement and loop until zero).

True, some unrolling will get rid of one major cycle-eater (loop - 18 cycles), but doing a conditional jump on each word is still 16 cycles a pop (if taken; 4 otherwise). I have the nagging feeling that I'm missing some trick that could be a major time-saver... if it's possible to do the selection without branching, as you say, please do share your idea - it might help. I haven't started on the implementation yet; just doing the math first to see if it's feasible at an acceptable speed.

My "compiled" approach could sort of work, I think, if the do-nothing (transparency) cases constituted single jumps to the nearest do-something (nonzero) case, which would mov the foreground value as an immediate. That would take care of arbitrary entry-points, but arbitrary exit-points would still need self-modification, or more conditional checks... a bit hairy, but possibly still faster than a branch on each value?

What are the possible values of "foreground"?

Any 16-bit value. 0000h through FFFFh are all valid - 0000h is sort of an arbitrary choice for transparency, but thinking about it, every value with a high byte of 00h could also be considered transparent (if that makes things easier). These are text-mode character/attribute pairs, if it helps visualize what I'm trying to do.

Chuck(G) · Nov 10, 2019

Well, I'm willing to look at the code that you've come up with and suggest speedups.

Your code would be two instructions on a vector machine.

Dwight Elvey · Nov 10, 2019

If you have a limited number of items to analyze and they are numerically close, it makes sense to make a jump table. This way you only make one offset calculation and then one or two jumps, depending on coding.
Another way to look at things is to efficiently encode things as bits. One can then do a series of ANDs, ORs and NOTs to create logical branching without actually branching. Remember, a 32 bit processor has 32 states possible treating each bit as a possible state. It has billions of states of using combinations of bits. Do remember that most compilers ( like C ) think of logical compares as branching statements. The thinking is that if you write your code well, you'll put things that are most often or most important for speed at the beginning of the branching statements. Still, if you have a case statement with more than 8 possible branches, there is usually a better way to handle it.
I think both Chuck and I are wondering what transparent means to decision making ( a branch ).
Dwight

Chuck(G) · Nov 10, 2019

FWIW, if you're really looking to do things branch-less, consider the following bit of code (ax has the value to be tested):

Code:

    xor     bx,bx
    sub    bx,ax
    sbb    bx,bx

(bx) will be FFFF if ax is nonzero; 0000 otherwise. You can then use boolean operations (AND) to select the value. But I'm not sure that it will be any faster than a short jump.

commodorejohn · Nov 10, 2019

Yeah, I don't think going through major contortions to avoid a short branch is really worth the trouble until you get into heavily pipelined CPUs (Pentium or higher, 486 to a lesser extent,) where the cost for a missed branch becomes heavier. Still, the trick Chuck mentions is a nifty one, especially since x86 is one of the architectures where move operations don't set the flags, so any test requires a separate instruction. Might as well fold the test and select into the same set of operations, in that case. Lemme see, example implementations:

Code:

	; branch-less version
	std			; prepare to decrement!
	; pre-condition: DS and ES are set to the segments for source & dest.
	; limitation: foreground & background must be in the same segment
	mov di, DESTINATION	; destination address (end of destination)
	mov bx, FOREGROUND	; foreground address (start of foreground - 2)
	mov bp, BACKGROUND	; background address (start of background - 2)
	mov si, SIZE		; field size
loop	mov ax, [bx + si]	; get foreground value
	mov cx, [bp + si]	; get background value
	xor dx, dx		; clear DX
	sub dx, ax		; set DX to -AX
	sbb dx, dx		; set DX to (AX == 0) ? 0x0000 : 0xFFFF
	and cx, dx		; pass background only if foreground = 0
	or ax, cx		; select whichever was not zeroed
	stosw			; place it in the destination field
	sub si, 2		; decrement the source index
	jnz loop		; continue
	
	; branching version
	std			; prepare to decrement!
	; pre-condition: DS and ES are set to the segments for source & dest.
	; limitation: foreground & background must be in the same segment
	mov di, DESTINATION	; destination address (end of destination)
	mov bx, FOREGROUND	; foreground address (start of foreground - 2)
	mov bp, BACKGROUND	; background address (start of background - 2)
	mov si, SIZE		; field size
loop	mov ax, [bx + si]	; get foreground value
	test ax, ax		; is foreground zero?
	jnz .st			; if no, store what we have
	mov ax, [bp + si]	; if yes, get background value
.st	stosw			; place it in the destination field
	sub si, 2		; decrement the source index
	jnz loop		; continue

Note that the branching version is actually three instructions shorter. Whether it's faster in practice? I dunno, try it and see.

It would be easier still to optimize if this were a destructive operation (i.e. background = (foreground == 0) ? background : foreground.) But depending on your application, that might necessitate making a separate copy of the destination layer anyway.

Chuck(G) · Nov 10, 2019

I don't know if it would shorten things any, given your problem, but that 3-instruction snippet can be simplified to 2, thus testing the contents of ax->ffff if nonzero, 0 otherwise.

Code:

    neg     ax
    sbb     ax,ax

FWIW, "neg" is an oft-overlooked instruction. It can be used to set the carry flag if a register is nonzero. Of course, it destroys the content of the register.

commodorejohn · Nov 10, 2019

Yeah...problem is, you still need the value of AX, unless (as mentioned) you're doing a destructive copy.

Chuck(G) · Nov 10, 2019

That's what I thought--hence, the first version.

The second version is handy when computing FORTRAN logical IFs and other such nonsense. F77, IIRC, didn't allow for "shortcutting" compound IFs--you computed the values of all sub-expressions. It's easy to do a magnitude comparison, just subtract and a sign-extended right shift.

e.g. IF (I.GT.J.AND.K.NE.0)...

VileR · Nov 10, 2019

Thanks for the replies - appreciated!

@commodorejohn: a destructive operation is perfectly good actually; I should've made my example overwrite the background. Re: cost of branching, on the 8088 the major hit may just be the conditional jump itself (16 cycles when taken). Losing the prefetch queue is a lesser concern, but then again fetching instructions from RAM is still 4 cycles/byte.

I believe I'll unroll major chunks of the loop regardless of implementation, so the bigger worry would be the cost of processing each element. And if we do a destructive copy, the difference doesn't look like much: we can skip one read in the branching version, but the non-branching version would still needs two reads + more arithmetics (NEG, SBB, AND, OR)... and since reads are also expensive (12 for LODSW, more for MOV) the branching version might actually have a slight edge. Guess I'll have to try both and see.

Trixter · Nov 11, 2019

Remember reenigne's rules of optimization:

1. Speed is roughly measured in number of I/Os, including the instruction opcode bytes.
2. Always profile the code, as there are non-obvious factors that can affect #1. (This is an Abrash rule as well

To those, I'll add my own:

1. Branchless is only faster if you're eliminating more I/Os than the worst-case branch (see reenigne #1).
2. REP MOVS/STOS (with mov cx,num setup) break-even is 4 output words, so if you're storing/copying 8 bytes or more, do the REP; if less, see if you can get away with individual MOVS/STOS.

(Don't know how large some opcodes are? Run debug, a)ssemble, type some instructions, hit enter twice, then u)nassemble to check their size.)

Looking at commodorejohn's implementation, the branchless version is clever, but it only eliminates one jump so the jumping version should be faster. Worst case on the jump is 16 cycles, but the branchless operations take up 10 bytes (40 cycles).

Writing directly to video memory ;-) ;-) ;-) will minimize the speed gains of one over the other, but I still think the branching one will be faster once you benchmark it.

VileR · Nov 11, 2019

Hah, thank you for those. I wasn't prepared for the I/Os sneaking up on me - seems like both of your First Rules are validated by the conclusions in this thread.

I routinely do the DEBUG -a/-u dance for instruction sizes; however, shameful as it is, my idea of "profiling" still boils down to modifying the border color for each interesting block of code and guesstimating things visually. You could say I'm not yet attuned to the ways of Zen.

My original problem isn't very relevant any longer, because the premise and scope of what I'm trying to do have changed (and much for the better, in terms of speed). ;-) Not that the discussion hasn't helped- eye-openers are always a good thing.

As for doing operations like these directly to video memory... oh, not by a long shot. That would just wrap around to pure evil. :D

Chuck(G) · Nov 12, 2019

Are you interested in some profiling routines? I've got some that I wrote years ago.

My first one was resident in a CDC 6600 PPU--a PP can directly read the CP P-counter, so it was mostly a matter of bookkeeping.

reenigne · Nov 12, 2019

Nothing shameful about profiling by modifying the border colour - I quite often do that too, and I have all sorts of tools for measuring how long a bit of code takes down to the cycle.

As for the original question - I haven't generally found branchless techniques to be particularly helpful on 8088. The cost of a taken branch (and refilling the prefetch queue) is high compared to an untaken branch but by the time you've executed a couple of 2-byte instructions to do the same thing in a branchless way, you've lost any advantage. Definitely try to arrange things so that the untaken branch is the common case if you can (but not at the expense of adding an extra unconditional jump).

VileR · Nov 12, 2019

Chuck(G) said:
Are you interested in some profiling routines? I've got some that I wrote years ago.

My go-to would've been Abrash's Zen Timer, which is quite general-purpose, though if you have your own to share I'm always interested in looking at different approaches.

reenigne said:
I haven't generally found branchless techniques to be particularly helpful on 8088. The cost of a taken branch (and refilling the prefetch queue) is high compared to an untaken branch but by the time you've executed a couple of 2-byte instructions to do the same thing in a branchless way, you've lost any advantage. Definitely try to arrange things so that the untaken branch is the common case if you can (but not at the expense of adding an extra unconditional jump).

Yep, that's what I'm beginning to realize. I should really go back to some of my existing code and apply some of these insights - especially in 'precalc' situations where I wasn't too rigorous... and perhaps still had an inertial tendency to optimize for size, rather than speed. The latter is definitely trickier.

commodorejohn · Nov 13, 2019

The advice about rearranging the clauses of a branch is also very good. I've lost count of the number of times I've been trying to get my head around the best way to do something and realized that if I just switch around which of the possibilities I'm testing is the fall-through case, everything makes so much more sense.

dreNorteR · Nov 14, 2019

I believe this is the fastest branchless way to do it:

Code:

	; DS:SI => foreground
	; DS:BX => background
	; ES:DI => output
	; CX = length

	; make BX relative to SI
	sub bx, si
	dec bx
	dec bx			;BX = bg_array - fg_array - 2

	mov bp, 1		;constant
	cld

l1	lodsw			;load foreground into AX
	cmp ax, bp		;set carry if transparent
	sbb dx, dx		;DX = mask
	and dx, [bx + si]	;DX = (fg == 0? bg : 0)
	or ax, dx		;AX = either fg or bg
	stosw
	loop l1

But the most straightforward branching version should be faster in any case:

Code:

l1	lodsw
	test ax,ax
	jnz l2
	mov ax, [bx + si]
l2	stosw

* 2 bytes shorter: -8 clocks
* jump taken, mov skipped: +16 -19

8088 optimization question (eliminating branches)

VileR

Veteran Member

Chuck(G)

25k Member

VileR

Veteran Member

Chuck(G)

25k Member

Dwight Elvey

Veteran Member

Chuck(G)

25k Member

commodorejohn

Veteran Member

Chuck(G)

25k Member

commodorejohn

Veteran Member

Chuck(G)

25k Member

VileR

Veteran Member

Trixter

Veteran Member

VileR

Veteran Member

Chuck(G)

25k Member

reenigne

Veteran Member

VileR

Veteran Member

commodorejohn

Veteran Member

dreNorteR

Experienced Member