8088/8086 Branch optimization

mbbrutman · Dec 19, 2010

Here is a little brain teaser .. First, the original code:

next = (next+1) % (RINGBUFFER_SIZE);

Next is an unsigned 16 bit int. The idea here is that there is a ring buffer and when next goes past the end of the ring buffer we need it to wrap around and become zero.

The compiler generates a DIV instruction for this which is nice and compact, and leaves the remainder in a register. But the DIV instruction is horrible - something like 80 cycles to compute.

So the next, improved version of the code looks like this:

next++;
if ( next == RINGBUFFER_SIZE ) next = 0;

And this is what the code gen looks like:

cmp
jne skiparound
mov next, 0

skiparound:

It is a few more bytes of code, but the jump at worst case takes around 16 cycles. This is a win.

Now to make this even faster, consider the following - if the jump is taken it costs approximately 16 cycles. But if the jump is not taken, the cost is only four cycles. You want to optimize for 'branch not taken'. Well even though the condition that I checked for above (next is at the end of the ring buffer) is less likely than next is a safe value, the generated code is opposite because the jmp instruction has to branch around the infrequently used code.

So the challenge is, how to rewrite this in such a way that the generated code causes a branch less often than it does now.

I want the compiler to generate the branch opposite of what it does now - it should branch if next == RINGBUFFER size to somewhere else, set next to zero there, and then execute another branch back. That would make the common case fall through the branch, which is far faster, at the cost of doing two branches in the uncommon case. (One branch to get you to the extra code, and one branch to get you back.) But I can't think of a clever way to do this without dropping into assembler.

(The situation with 'if-then-else' constructs is even worse. Even if you code the branch the correct way to make the common case fall though, you still need to jump around the else leg. This seems to be the way the compilers like to generate the code.)

Any ideas on how to write this in C without dropping into assembly?

Regards,
Mike

Chuck(G) · Dec 19, 2010

Any reason why you can't make the buffer size a power of 2? Then, the wrapping operation is just an AND...

mbbrutman · Dec 19, 2010

Thought of that, but want to keep the code somewhat generic. If I start imposing restrictions like that somebody is going to get burned ...

And if you think about it, the problem is more general. In this case if it was really important I can do it. But in the general case, how does one hint a branch such that the compiler optimizes for branch not taken (which is best for the hardware) when the compiler views the world differently and generally wants to generate a branch to skip around the little used code?

Chuck(G) · Dec 19, 2010

Without seeing the whole picture, I should remind you that C does have a goto statement. e.g.:

Code:

  next++;
  if ( next != RING_BUFFER_SIZE) goto foof2;
foof1:
  ....

foof2:
  next = 0;
  goto foof1;

If I were running on a 286 with a barrel shifter, I'd just sign-extend the result of the comparison and use ANDs and ORs to select the correct value without using jumps at all.

mbbrutman · Dec 19, 2010

I've thought about the goto as well and at worst case I would implement that instead of reverting to ASM. The goto is slightly more readable, but still ugly.

On newer compilers there are mechanisms like "builtin_expect" that allow the programmer to tell the compiler if the condition being tested is likely or not, and the compiler to generate better code. In the case above it would be trivial for the compiler to move the code out of line (past the epilog of the function) and generate the jumps/gotos to make it work. The performance impact can be quite dramatic even factoring in the extra branches - in the case above if the condition is true one in 10 times that would give you more than a 2x speedup. And that is just for fixing the branching behavior; getting rid of the original DIV instruction is the real score.

I guess it's time to start scratching around the Open Watcom code. These optimizations are as valid on 8088s as they are today and it doesn't seem that hard to implement.

As for the 286 .. this is 8088 code that is 286/386 friendly. So I'm trying to keep data and branch targets aligned, but it has to run on the 8088 too. Getting rid of branch penalties will help on all of the variants.

Chuck(G) · Dec 19, 2010

My problem with your brain-teaser is that you don't provide much context. Are you trying to buffer blocks of data? If so, you need only check to see if there's sufficient space for an entire block, transfer it, and then update buffer points after the transfer.

If it's just a simple character, then you can certainly do what you're after with a simple if...else structure.

More than 40 years ago, on CDC SCOPE, we defined our circular buffers thusly (translating to C, as it didn't exist then):

Code:

unsigned char *first, *in, *out, *last;

"first"was the address of the first byte of the buffer, "last" was the last+1 address of the buffer. "in" = "out" = "first" signified an empty buffer. If you were adding to the buffer, you'd increment "in" until it was equal to "last", then set it to "first". No fooling with divides, or subscripts.

You'd have to limit your buffers to less than 65KB to make this work and if far pointers were used, you'd be advised to reduce the segment values for the minimum offset values.

The interesting thing about this ancient bit of code was that it was used to communicate between an I/O processor and the CPU. As long as the CPU could keep up with the I/O processor, the process kept issuing I/O operations. You could copy an entire reel of tape to disk this way by issuing only a single I/O request. Otherwise, no OS intervention.

wiwa64 · Dec 19, 2010

How does your compiler (WATCOM?) translate this:

if(next < RINGBUFFER_SIZE - 1) next++; else next = 0;

or this:

next = (next < RINGBUFFER_SIZE - 1) ? next + 1 : 0;

if it happens to be at the end of a function you could try something like this:

...
next++;
if(next < RINGBUFFER_SIZE) return;
next = 0;
return;

And if optimazation is really so crucial, what's wrong with some inline assembler?

But as Chuck already suggested, choosing a power of 2 for the buffer size would certainly be the most efficient way to do it. Even if you want to maintain some flexibility and allow e.g. to define the size of the buffer at program start-up or in some kind of configuration file, you could still round-up the value to the next higher power of two before actually useng it.

mbbrutman · Dec 19, 2010

Chuck(G) said:
My problem with your brain-teaser is that you don't provide much context. Are you trying to buffer blocks of data? If so, you need only check to see if there's sufficient space for an entire block, transfer it, and then update buffer points after the transfer.

<snip>

Forget the buffers for a moment .. that would be suitable for another thread if I thought it was a problem. The problem is how to code if statements in such a way such that the common case that you expect falls through (which is faster) while the not-as-common path takes the penalty for any jump.

What I'm seeing is that the compilers are generally falling through the branch (which is faster) if the condition is true, and jumping if the condition is false. If you are testing for a condition that is not hit as often (like needing to wrap around the end of a circular buffer) then you inadvertently penalize the common case. That seems to be a bad thing.

The solution would be to code something like your goto example above, where the uncommon case is forced to take an explicit jump somewhere else, do what it needs to do, and then take a second jump back. It's two jumps instead of the one jump that is currently generated, but it doesn't penalize the commonly used case.

I'm having a problem following the CDC example above. When you increment "in" until it is "last" and then set it to "first", are you making a test with a compare and jump there or just relying on overflowing a n-bit register to get the wraparound effect? My ring buffer was fairly small - 5 to 10 entries, and I wanted the size to be up to the programmer. (I have changed it and made it a power of two, and I'm using AND now to do the wraparound, which is faster at the expense of generality. But the general problem of getting better branch performance by getting the commonly used code out of line still exists.)

mbbrutman · Dec 19, 2010

WiWa:

For a simple if-then statement the pattern is:

Code:

if ( condition ) {
  <true leg code>
}

gets compiled as:

Code:

test condition
conditional branch if false around <true leg code>
<true leg code>

So here if the condition is true you fall through to the <true leg code>. If your condition is uncommon then you usually branch around the true leg code, which is a performance penalty.

For an if-then-else statement the general pattern seems to be:

Code:

if ( condition ) {
  <true leg code>
}
else {
  <else leg code>
}

gets compiled as

Code:

test condition
if false jump to else leg code
<true leg code>
jump to <next basic block>
<else leg code>
<next basic block>

So here we see that if the condition is true, <true leg code> falls though the first jump (four cycles) and then has to make an unconditional jump around the else leg code. That's 20 cycles. If the condition was false then you endure a 16 cycle branch, but that's it.

So without doing anything extraordinary, if you have an if-then statement then try to ensure the condition is true more often than false. If it is false more often than true you are going to incur a performance penalty.

For an if-then-else construct it is less overhead to have your more frequently run code in the else leg.

The question comes up because in my original code I was using if-then to handle a pointer wrapping around the end of a buffer. On circular buffer wrapping around is uncommon compared to not wrapping around, and I was trying to avoid the branch overhead.

I've since said 'screw it' and am just requiring that the buffer size be a power of two. That lets me use an AND to handle the wrap around; the penalty for always doing the AND (which may not be necessary) is less than the penalty for the jump. (Assuming all other things being equal. Having operands in storage might change this.)

Mike

Chuck(G) · Dec 19, 2010

mbbrutman said:
Forget the buffers for a moment .. that would be suitable for another thread if I thought it was a problem. The problem is how to code if statements in such a way such that the common case that you expect falls through (which is faster) while the not-as-common path takes the penalty for any jump.

The problem with your example is that the penalty is a double-one -- a jump to the exception and then a jump back from it.

My point was that cases don't exist in isolation. For example, your code could achieve the desired goal quite simply if the context were provided. For example, if you were simply storing a byte in a buffer, your code would look like this:

Code:

  if ( next != (RING_BUFFER_SIZE-1))
  {
    buf[++next] = what;
   }
  else
  {
    *buf = what;
    next =1;
  }

Of course, you still pay a penalty to jump around the "else" clause. If this were a function, you could insert a return as the last statement of both branches of the if().

The CDC example described used the instruction set of machine. No compare instructions nor condition codes. Rather we had "compare register Bx with By and branch if greater or equal" types of instructions. One could also branch on the sign of a register or whether the register was zero or not. The systems also used an instruction cache, so branches out-of-cache were expensive, while branches taken in-cache cost little more than branches not taken. Getting rid of condition codes makes instruction scheduling much easier.

My point was that why do you bother using an index rather than a pointer? The index would seem to be just another thing to get in the way of the optimizer.

mbbrutman · Dec 19, 2010

Chuck(G) said:
The problem with your example is that the penalty is a double-one -- a jump to the exception and then a jump back from it.

If the case is rare enough then the double penalty is easily justified. Let's look at the ring buffer case, where you need to detect wrapping around the end of the buffer. On a 10 element buffer taking a double penalty for jumps when you are wrapping is cheap (32 cycles once every 10 times) compared to the way the code is now, where 9 out of 10 times you are taking a 16 cycle jump instead of a four cycle jump.

That's not a problem - that's a performance tradeoff I want to make.

Chuck(G) said:
My point was that cases don't exist in isolation. For example, your code could achieve the desired goal quite simply if the context were provided. For example, if you were simply storing a byte in a buffer, your code would look like this:

Code:

if ( next != (RING_BUFFER_SIZE-1)) { buf[++next] = what; } else { *buf = what; next =1; }

Of course, you still pay a penalty to jump around the "else" clause. If this were a function, you could insert a return as the last statement of both branches of the if().

The CDC example described used the instruction set of machine. No compare instructions nor condition codes. Rather we had "compare register Bx with By and branch if greater or equal" types of instructions. One could also branch on the sign of a register or whether the register was zero or not. The systems also used an instruction cache, so branches out-of-cache were expensive, while branches taken in-cache cost little more than branches not taken. Getting rid of condition codes makes instruction scheduling much easier.

My point was that why do you bother using an index rather than a pointer? The index would seem to be just another thing to get in the way of the optimizer.

The CDC instructions that combined comparison with branching might have gotten rid of the condition codes from the pipeline, but had to increase the cycle time of the machine or the relative number of cycles for those instructions. Unless the cycle time was already so long because of another instruction; then you can tuck extra goodies into your less complex instructions. But your combined 'compare and branch' in the same instruction seems expensive to implement and must have influenced cycle time.

Excluding the cache penalty, does the CDC make taken branches more expensive than non-taken branches like x86 does?

I don't think the indexes into an array vs. a pointer makes a difference in the code gen. I have looked at it both ways. A good C compiler looks as an array with an index as a pointer anyway.

mbbrutman · Dec 19, 2010

And the code in question ... the current version is using the AND trick, but you can see that I still have a branch that I'd like to get rid of. This one is harder to guess the performance on though - it's quite possible to try to dequeue a pointer when the queue is empty, or to try to put too many on. (It's not as clear cut as knowing that you only wrap around the end of the buffer 1 in n times.)

Code:

// RINGBUFFER_SIZE must be a power of 2.

#define RINGBUFFER_SIZE TCP_SOCKET_RING_SIZE
#define RINGBUFFER_MASK (RINGBUFFER_SIZE-1)



class RingBuffer {

  public:

    uint16_t  first; // Index to first item to be dequeued.
    uint16_t  next;  // Index to next place to enqueue an item.

    // Entries can tell us if we are empty or full.  It is easier to maintain
    // and use a counter than it is to do compare the indexes.
    uint16_t  entries;

    void    *ring[ RINGBUFFER_SIZE ];


    // Do not do this unless you know the number of entries is already zero.
    inline void init( ) { first = next = entries = 0; }

    inline int16_t enqueue( void *data ) {
      if ( entries == RINGBUFFER_SIZE ) return -1;
      ring[next] = data;
      next++;
      next = next & RINGBUFFER_MASK;
      entries++;
      return 0;
    }

    inline void *dequeue( void ) {
      if ( entries == 0 ) return NULL;
      uint16_t i = first;
      first++;
      first = first & RINGBUFFER_MASK;
      entries--;
      return ring[i];
    }

    inline void *peek( void ) {
      if ( entries == 0 ) {
        return NULL;
      }
      else {
        return ring[first];
      }
    }

    inline uint16_t hasRoom( void ) { return ( entries < RINGBUFFER_SIZE ); }

};

The code is inline, so 'returns' are not really function exits but more of a 'ok, put this value in the register for the next op.

One pleasant surprise - the original code that used the modulo operator generated very nice code when a power of 2 was used. It basically realized that it could AND with a mask and did the optimization automatically! (It would not have been able to do that if the size wasn't known at compile time.)

Chuck(G) · Dec 20, 2010

Okay. Note that if you use separate pointers for "in" and "out", you don't have to worry about multitasking issues. The number of entries is simply equal to in-out or queue_size + (in-out). The queue is empty if in==out; the queue is full if in == wrap(out-1). Since the in pointer is modified only by routines putting data into the queue and out by routines taking it out, there's no collision between routines (including updating entry count).

I don't know if that matters or not.

mbbrutman · Dec 20, 2010

Chuck(G) said:
Okay. Note that if you use separate pointers for "in" and "out", you don't have to worry about multitasking issues. The number of entries is simply equal to in-out or queue_size + (in-out). The queue is empty if in==out; the queue is full if in == wrap(out-1). Since the in pointer is modified only by routines putting data into the queue and out by routines taking it out, there's no collision between routines (including updating entry count).

I don't know if that matters or not.

Multi-tasking is no issue here, but I love ring buffers in multi-tasking environments because you can usually update the head and tail independently from two separate tasks. And if you put them in their own cache lines you avoid cache misses when the producer and consumer try to update the head and tail.

(My lower level packet handling code does something similar - the packet driver adds on one side while the networking code reads from the other. That does require my code to disable interrupts when removing a packet, but that's about the only place in the code where I have to worry about locking.)

As you note the number of entries can be computed using arithmetic instead of maintaining a separate counter, but the counter serves two purposes - it's easier to increment and decrement than it is to do the arithmetic and worry about the wrap around situation. The other thing it does is make the condition where first == next unambiguous. (It means empty.) If the queue were full and you didn't have a counter then first == next could also happen when the queue was full, not just when it is empty. Having a counter keeps me from having to have the smarts to figure out if it is full or empty when first == next.

Moral of the story .. branches are bad. And division is far worse.

Chuck(G) · Dec 20, 2010

In the (first, in, out, last) scheme, you use a "guard byte", such that the only time in==out, the queue is empty; i.e. you can never advance in such that it equals out.

Do you store things byte-by-byte, or do you have some knowledge of the length of the datum to be stored in advance? If that's the case, then you needn't perform the check repeatedly for each byte.

I do miss the "conditional skip" in the x86 instruction set. It would make the issue with the branch become negligible.

Branches aren't necessarily bad; it depends on the architecture. For example the STRETCH used a branch-prediction mechanism that assumed that branches would be taken, and so began fetching from the branch target in its look-ahead. Taken branches were cheap. Or take the i860; a conditional branch has an instruction following it that's executed regardless of whether or not the branch is taken. This allows for overlapping the branch with the following instruction, hiding the overhead.

The 8088 is what it is; a slow primitive CPU. That being said, if I were coding assembly, your original problem could be coded with no jumps at all--and quite easily.

Suppose that CX = queue size and BX = next queue position. Then your problem becomes:

Code:

       inc     bx                      ; bump next
       cmp     cx,bx                   ; size - next
       sbb     ax,ax                   ; if carry set, ax = ffff, else ax = 0
       and     bx,ax                   ; bx = either next+1 or 0

I don't think you'll ever get a C code generator to be smart enough to generate that; although I have seen the ? operator occasionally implemented using the cmp-sbb trick.

mbbrutman · Dec 20, 2010

I've done variations on the guard byte, but maintaining the counter of items in the queue turned out to be cheaper. All of these ideas have a common thread - get to known states such that conditional logic is not necessary. If you can design the data structure to eliminate the conditional logic, you are probably better off at run-time, even if the start-up cost or storage cost is more.

The code above stores pointers to data structures. The data structures are either buffers that need to be transmitted, buffers that have already been sent and are awaiting acknowledgment, or buffers that have been received but not processed yet. Nothing is done by single bytes.

"Conditional skip" would be awesome .. most of the current architectures have something like that. Cell (my specialty) had the ability to compute both sides of a condition and select the committed one using a logical op, which was far better than branch overhead on that CPU.

The i860 mechanism you mention is known as a branch delay slot - the instruction after a branch always gets executed because it is in the pipeline already. I think that MIPs pioneered that; MIPs is always mentioned when branch delay slots are used. One thing I'll disagree on is that branches aren't necessarily bad .. they are always bad. If you have branch prediction hardware you can limit the chances on getting hit and if you have speculation you can throw enough pipeline resources at it to make it look like no penalty even if you do mispredict.. But you've used quite a bit of chip real-estate to do that.

And of course anything probably lower than a Pentium or Pentium Pro doesn't do these things. But their relative penalty is not as bad as it is on the 8088/8086.

I like the SBB trick ... that's a keeper.

Chuck(G) · Dec 20, 2010

Most modern MCUs have skip in some variation; ARM, of course, takes the conditional-execution to its logical conclusion. But even on the x80 architecture, a "Skip on condition" would have been easy to implement. While the Pentium gave us SETxx to avoid a branch in some HLLs, there still is no conditional move or load. The ironic thing is that instructions were added long after they would have given the most "bang for the buck". Pipelining and caching, with multiple issue enables one to hide a lot of branch latency--and strangely enough, the use of condition codes now sets up an artificial level of dependency--that it requires that the instruction that the branch uses as a condition be logically close to the branch.

Ah well, we work with what we've got...and x86 is nothing so much as an enhanced 8008 when you get down to it.

CP/M User · Dec 22, 2010

It's just a thought I had, sometimes if I don't like using 'IF' statements in Pascal, I may use "repeat....until", "while" statements or "for" statements - "while" can be used to jump over code since it first checks the condition within, "for" is only any good if you really want looping and how much of it and "repeat....until" will take you within a looping state and won't exit until the condition of it has been met! "C" offers kinds of methods so perhaps they maybe worth considering. The other useful method which "C" has a way of dealing with is "case" statements. In Pascal "case" statements allow you to check a condition, if an event has occured it will go within that and depending on the conditions set will apply the code relevant to the condition.

But I bet you've got it all figured out! :D

Chuck(G) · Dec 23, 2010

Unfortunately, it's very common that the source code doesn't exert much control over how the object is generated. What Mike is asking C to do is generate 2 jumps where it would normally generate only one. Most C code generators are written to treat conditional comparisons as if all results of the comparison bear equal weight. So, you'd think that Mike could satisfy his requirement by saying something like if ( index != BUFFER_SIZE);else index = 0; or some such thing, but in fact, the code generated will be equivalent to if (index == BUFFER_SIZE) index = 0;

As a matter of fact, my C compiler (Microsoft) generates exactly the same code for:

(a == b) ? c : d; and
(a != b) ? d : c:

Moral: Sometimes there's a price to be paid for using an HLL.

CP/M User · Dec 23, 2010

Chuck(G) wrote:

Moral: Sometimes there's a price to be paid for using an HLL.

Naturally, sadly the answer to this problem maybe on this page!

This was my solution to number crunching in Turbo Pascal 3 on a Z80, the result appears to be slower than the straight-forward Pascal!

Just perhaps it could be optimised a tad so my routine just uses one variable instead of a second, though I guess it's a question of how fast is fast enough! :D

8088/8086 Branch optimization

Associate Cat Herder

25k Member

Associate Cat Herder

25k Member

Associate Cat Herder

25k Member

Experienced Member

Associate Cat Herder

Associate Cat Herder

25k Member

Associate Cat Herder

Associate Cat Herder

25k Member

Associate Cat Herder

25k Member

Associate Cat Herder

25k Member

Veteran Member

25k Member

Veteran Member