• Please review our updated Terms and Rules here

8088/8086 Branch optimization

My gripe with a lot of current thought is that it's held that everything can be solved with fast enough hardware; that being aware of the generated code isn't important.

Maybe so, but I don't agree. I like to know what's going on.
 
Chuck(G) wrote:

My gripe with a lot of current thought is that it's held that everything can be solved with fast enough hardware; that being aware of the generated code isn't important.

Agreed, the difference I had from the same program I generated on a 16Mhz computer and 4Mhz computer was the difference of it running perfectly to running too slow. Admittibly the 16Mhz computer had a 32bit processor with a 16bit bus which might have helped it along Vs. an 8bit computer! There were some obvious disadvantages to the 8bit version as well which hinder the speed of it though. But even on a 4Mhz computer, writing efficent code is difficult, 16Mhz is a nice speed for run of the mill stuff though - even though writing real optimised code on any system would be difficult (unless you're a real programmer! :wink:). But 2Ghz is simply overkill! :eek:nfire:

Maybe so, but I don't agree. I like to know what's going on.

Not sure what you're referring to. I agree if your referring to "Real Programmers..." which may have some truth to it, though I saw it for some crackpot doing their nana! :curse: I just pretend to play the harp! :harp: And kill their OS/370 when their not lookin'! :killcomputer::happy3:
 
My take on this has always been that since the code generated varies by compiler, if you really care about what the compiler is outputting you should drop to asm. I know you aren't asking for that -- but really if you care about the machine language that's built from your code, C, Pascal and most any other language is the wrong tool for the job.

Probably why I always liked the inline assembler in borland flavors of C and Pascal -- it makes it SO easy to pop in and out of ASM to 'optimize inside the loop'.
 
Probably why I always liked the inline assembler in borland flavors of C and Pascal -- it makes it SO easy to pop in and out of ASM to 'optimize inside the loop'.

Sometimes, even an inline assembler won't do the job. Consider this scenario:

You want to write a utilitiy that can be run on any x86 platform from 8088 to Pentium, but you want to use the CPU instruction set extensions if they exist, including 32-bit or 64-bit registers--but you don't want the compiler to generate code for the later processors by default. In other words, you want to make the decision at run-time as to what the hardware is and then use the code that works best for it.

I know of no compiler that allows this. C, Pascal, PL/I or any other language.
 
Sometimes, even an inline assembler won't do the job. Consider this scenario:

You want to write a utilitiy that can be run on any x86 platform from 8088 to Pentium, but you want to use the CPU instruction set extensions if they exist, including 32-bit or 64-bit registers--but you don't want the compiler to generate code for the later processors by default. In other words, you want to make the decision at run-time as to what the hardware is and then use the code that works best for it.

I know of no compiler that allows this. C, Pascal, PL/I or any other language.

I have to ask, why would you want to to that instead of creating a single executable with multiple paths depending on which CPU running the software. The multiple executable will fatter but I doubt it would add as much complexity or size as a general purpose runtime adaptive engine.
 
I'm curious, what does watcom (I've not used that one in like a decade) output for:

if ((next+=1)=ringBuffer_size) next=0;

I think that's the one approach we've not seen yet in this thread. probably no different than the second example in OP's post, but it's worth a shot... I do remember watcom having very few conditional optimizations -- it often performed lots of memory operations instead of taking the time to say "hey, i'm working on this memory location ten times in a row, MAYBE I should dump that in a register" like other compilers do.

I remember watcom not exactly being the best at optimizations though -- For the next bit I'm assuming EA's are calced as displacement+BASE (9 clocks) and are seg:imm) and that rb_size is a constant/define and compiled as immediate.

Code:
//                    type     clocks    bytes
inc  next          // mem      24        4
cmp  next,rb_size  // mem,imm  19        6
jne  skip          //          16 or 4   2
mov  next,0        // mem,imm  19        4 
skip:

That's 69 clocks most of the time, 66 when the buffer pointer loops.. and takes 16 bytes. Very typical of the output of a lot of high level compilers. (6809 compilers go NUTS with the memory references like that!)

Which is RIPE for optimization... like say...

Code:
//                    type     clocks    bytes
mov  ax,next       // reg,mem  17        4
inc  ax            // reg      3         1
cmp  ax,rb_size    // reg,imm  4         3
jl   skip          //          16 or 4   2
xor  ax,ax         // reg,reg  3         1
skip:
mov  next,ax       // mem,reg  17        4

Which is 57 clocks most passes, 48 clocks when it loops... and takes 15 bytes. Saves one byte, 12 clock cycles on most passes and 18 clocks on the loop. (I used the jl so if something else accidentally updates it via overflow or other means it's safeguarded)

Just occurred to me that this is a perfect example of why I tend to make my buffers exponents of two... why? Let's say you make a 128 byte ring buffer.

next=(next+1) && $7F;

no jumps, no divide needed, there's not a lot of room for the compiler to screw that one up.

Code:
//                    type     clocks    bytes
inc  next          // mem      24        4
and  next,$007F    // mem,imm  26        6

vs.

Code:
//                    type     clocks    bytes
mov  ax,next       // reg,mem  17        4
inc  ax            // reg      3         1
and  ax,$007F      // reg,imm  4         4
mov  next,ax       // mem,reg  17        4

Which comes down to one of the classic machine language questions -- speed or size? First one there is 10 bytes but takes 50 clocks, second one is 13 bytes but only takes 41 clocks.

Wow, I've not thought along these lines in a LONG time.

-- edit -- What the? This forum skin doesn't use a monospace font for CODE blocks?!?
 
Last edited:
I have to ask, why would you want to to that instead of creating a single executable with multiple paths depending on which CPU running the software. The multiple executable will fatter but I doubt it would add as much complexity or size as a general purpose runtime adaptive engine.

Suppose (and this was my case) you have a 300KB executable designed to boot on any system on any common media. That means that you have to boot from a 360K floppy as well as a 2.88MB one. Asking the user to carry multiple versions is ridiculous.

And this goes to another bit about optimization. Usually, there are very few routines that need this treatment because they represent execution "hot spots"--i.e., you get a lot of improvement for a very little effort and added space. Attached is an example:

Code:
	page	,132
	title	CRC32 - 32-Bit CRC generator

	.8086

	ifndef	MODEL
	.model	small,c
	else
%	.model	MODEL,c
	endif

;*	These routines generate a 32-bit CRC coefficient table and calculate
;	CRC sums based on it.  The CRC is based on the standard IEEE polynomial.
;

COEFF32_VAL_HI	 equ	 0EDB8h		 ; high-order part of coefficient term
COEFF32_VAL_LO	 equ	 08320h		 ; low order part of coefficient term
CRC_BUFFER_SIZE	 equ	 1024		 ; Size of CRC buffer needed for tables

	.code

_fmalloc	proto	howmuch:word	; malloc interface

CRC32_Table_Ptr dd	0		; pointer to generator table

Is_386		db	0		; nonzero if 386 or better hit
CRC_Initialized db	0		; nonzero if CRC initialized
 
;	Short-form tables used by the InitCRC routine.

CRC32_ShortForm label	dword
	dd	0edb88320h, 076dc4190h, 03b6e20c8h, 01db71064h
	dd	00edb8832h, 0076dc419h, 0ee0e612ch, 077073096h


;*   InitCRC - Generate the CRC lookup tables.
;    -----------------------------------------
;
;   We return -1 if no memory is available for the generator, 0 otherwise.
;   This routine calls _fmalloc.
;

InitCRC		proc	far public
	local	scratch:qword
	push	di
	cmp	cs:CRC_Initialized,0
	jne	InitCRC24		; if already initialized, return status
	invoke	_fmalloc,CRC_BUFFER_SIZE
	mov	es,dx
	mov	di,ax			; save location
	or	ax,dx
	mov	ax,-1
	jz	InitCRC24		; if no memory
	cld
	mov	word ptr cs:CRC32_Table_Ptr,di
	mov	word ptr cs:CRC32_Table_Ptr+2,es     ; stash the table pointer
	les	di,cs:CRC32_Table_Ptr
	xor	cx,cx			    ;  counter
InitCRC6:
	xor	ax,ax
	xor	dx,dx			    ;  clear accumulator
	lea	bx,word ptr CRC32_ShortForm
	mov	cl,ch
InitCRC8:
	test	cl,cl
	jns	InitCRC10		;  if no bit
	xor	ax,cs:[bx]		;  low order
	xor	dx,cs:[bx+2]		;  high order
InitCRC10:
	add	bx,4
	shl	cl,1
	jnz	InitCRC8		;  compute a term
	stosw				;  stow low order
	mov	ax,dx
	stosw				;  stow high order
	inc	ch
	jnz	InitCRC6		;  loop...

;	Determine if we have a 386 or better...

	mov	Is_386,0
	push	sp
	pop	ax
	cmp	ax,sp			; see if the same
	jne	InitCRC22		; if not 2/386
	.486
	sgdt	scratch			; stash the GDT
	.8086
	mov	al,byte ptr scratch+5
	test	al,al
	jnz	InitCRC22
	mov	Is_386,1
InitCRC22:
	xor	ax,ax
InitCRC24:
	pop	di
	ret
InitCRC		endp


;   MakeCRC32 - Make a 32-bit CRC sum.
;   ----------------------------------
;
;   unsigned long MakeCRC32( void _far *buf, unsigned int len)
;

MakeCRC32	proc	far public init:dword, buf:far ptr, len:word
	cmp	cs:CRC_Initialized,0
	jnz	MakeCRC1
	invoke	InitCRC
	test	ax,ax
	jns	MakeCRC1		; if okay
	ret				; just exit if failure

MakeCRC1:
	cld
	cmp	Is_386,0
	jnz	Make386			; if a 386 or better

	push	ds
	push	si
	push	di
	mov	ax,word ptr init
	mov	dx,word ptr init+2 ;  get old accumulator
	les	si,buf			;  buffer
	lds	di,cs:CRC32_Table_Ptr	;  table
	mov	cx,len			;  length
	jcxz	MakeCRC4		;  if just getting
MakeCRC2:
	mov	bl,es:[si]
	inc	si

	xor	bl,al
	sub	bh,bh
	add	bx,bx
	add	bx,bx			 ;  * 4

	mov	al,ah
	mov	ah,dl
	mov	dl,dh
	xor	dh,dh			;  right shift 8

	xor	ax,ds:[di+bx]
	xor	dx,ds:[di+bx+2]
	loop	MakeCRC2		;  do it

MakeCRC4:
MakeCRC8:
	pop	di
	pop	si
	pop	ds			;  restore registers
	ret

;	32-bit code--lots faster.

	.486

Make386:
	push	ds
	push	si
	push	di
	mov	edx,init
	lds	si,buf			;  buffer
	les	di,cs:CRC32_Table_Ptr	;  table
	mov	cx,len			;  length
	shr	cx,2			;  do doublewords
	jcxz	Make3864		;  if just getting
	lodsd
Make3862:
	movzx	ebx,al
	xor	bl,dl
	shr	edx,8
	xor	edx,es:[edi+4*ebx]
	movzx	ebx,ah
	shr	eax,16
	xor	bl,dl
	shr	edx,8
	xor	edx,es:[edi+4*ebx]
	movzx	ebx,al
	xor	bl,dl
	shr	edx,8
	xor	edx,es:[edi+4*ebx]
	movzx	ebx,ah
	xor	bl,dl
	shr	edx,8
	xor	edx,es:[edi+4*ebx]
	lodsd				; overlap the fetch
	loop	Make3862		; loop for another quadword
	sub	si,size dword		; back up 4 bytes!

;	Handle individual bytes here.

Make3864:
	mov	cx,len			;  length
	and	cx,3			; just the last three bytes
	jcxz	Make3868		;  if just getting
Make3866:
	movzx	bx,byte ptr ds:[si]
	inc	si
	xor	bl,dl
	shr	edx,8
	shl	bx,2
	xor	edx,es:[di+bx]
	loop	Make3866		;  do it

Make3868:
	mov	eax,edx
	shr	edx,16			; so we have ax:dx to return
	jmp	MakeCRC8

	.8086

MakeCRC32      endp


	end
 
@deathshadow

AFAIK, xor ax, ax is a 2 byte instruction. And your cycle count seems to be a bit off. I get 59/66 in your first example (typo I guess?)

Also, I don't understand the use of JL. If next goes negative we are hosed anyway so there's no point in using a signed Jcc here - it just adds to confusion. Besides, trying to cover up for possible bugs elsewhere is not what I would call good programming practice. </nitpicking> :D

This was my take on the original problem. DL is next and DH is 0 on entry.
Code:
		CMP DL, PACKET_RB_SIZE-1
		XCHG DH, DL
		JE @F
		XCHG DH, DL
		INC DX
@@:		MOV [_Buffer_next], DL

Needless to say Chuck(G)'s variant blows mine away.
 
AFAIK, xor ax, ax is a 2 byte instruction. And your cycle count seems to be a bit off. I get 59/66 in your first example (typo I guess?)
What I get for pulling numbers out of my backside instead of looking it up :D -- good catch.

Also, I don't understand the use of JL. If next goes negative we are hosed anyway so there's no point in using a signed Jcc here - it just adds to confusion. Besides, trying to cover up for possible bugs elsewhere is not what I would call good programming practice. </nitpicking> :D
Normally I'd agree, but with things like buffers I'm overly paranoid about range-checking. Call it a lesson learned by observing things like twenty year old bugs in the BSD codebase, flaws in the JPEG decoder, and how even something simple like null terminated strings can turn around and bite you in the backside. (there's a reason I'm a fan of byte-length first strings - xor ah,ah; lodsb; mov cx,ax; rep movsb;)

It also ends up no more or less bytes and no more or less execution time, so I put the range-checking in there. It's also why I prefer even byte-sized buffers with an "AND". You never have to even think about it if you use AND.

Or were you just referring to my using JL instead of JB? I rarely make the distinction in my head when dealing with words -- if it was bytes, that extra bit usually matters. Words, generally not as much of a concern.

This was my take on the original problem. DL is next and DH is 0 on entry.
An interesting approach assuming the buffer is byte-sized... you know how you said using JL made it 'more complex'? (compared to what, JNE?) -- to me XCHG does that. To me that feels like two extra instructions that shouldn't even be needed...

Though being we're talking about a buffer pointer, maybe we should be talking BL and/or BX that way it would work with XLAT? (that hinges on what one is doing with the buffer I guess)

Needless to say Chuck(G)'s variant blows mine away.
... and he made some REALLY good points in that last post. Such as:

(in response to krebizfan's notion of including branching to multiple CPU combinations)
Suppose (and this was my case) you have a 300KB executable designed to boot on any system on any common media. That means that you have to boot from a 360K floppy as well as a 2.88MB one. Asking the user to carry multiple versions is ridiculous.

Much less the overhead added to the program during startup to figure out just what optimized version to run! What are you going to do include versions of every routine for EVERY processor iteration? That's not just 'fatter' it ends up ridiculous if you're talking about targeting 8088, 8086, 80186, 80188, 80286, 80386, 80486, Pentium... shall I go on?

And this goes to another bit about optimization. Usually, there are very few routines that need this treatment because they represent execution "hot spots"--i.e., you get a lot of improvement for a very little effort and added space.
The classic line -- optimize inside the loop. Something that needs to execute once every sixty seconds probably isn't worth obsessing over every clock, so you byte-size optimize those -- something that executes multiple loops 12,800 times a second like a 115,200 baud bit-banging buffer that's where you put the effort in...

... and if nothing else, the branching logic is more overhead -- even if you make it a dynamic call. More CALLS == BAD. CALL BAD, JMP BAD... you get a lot more bang for your buck minimizing the use of those than you would ever get by optimizing versions for each processor.

Again, why I'd use a AND and a power of two sized buffer -- though sometimes you don't have control over your buffer size.

Funny to talk about buffers -- I was just dealing with trying to speed up BIOS keyboard routines (or more specifically turbo pascals calling them) and ended up writing my own to replace both keypressed and readkey. I'm also writing my own INT09 handler since I'd like to maintain a live keymap.

The BIOS keyboard buffer is interesting to play with as it has five values you need to deal with in the BIOS data area (segment $0040)...

The head and tail:
$0040:001A buffer head (word)
$0040:001C buffer tail (word)

are offsets from $0040 into the buffer, meaning they range USUALLY from 32..80. Problem is you cannot rely on that being fixed as there are two more values:

$0040:0080 buffer start (word)
$0040:0082 buffer end (wors)

... and on some systems those can be many times larger. Used to be some BIOS' let you change the size of the buffer, and there were TSR's that did it too.

So to test for a keypress, you just check if head=tail.


Code:
function keypressed:boolean; assembler;
asm
	xor  ax,ax
	mov  es,ax
	mov  bx,es:[$041A]  {keybuffer_head}
	cmp  bx,es:[$041C]  {keybuffer_tail}
	je   @retval
	not  ax
	@retval:
	{ tp7 ASM exit values are in AL, which we set to zero at start! }
end;

Which works pretty good... though I may have made a 'pointless' optimization in there -- re-using AX for ES and using the segment overlap to my advantage. I can't do that in the readkey function since keyBuffer_head is based off $0040, and doing an ADD before accessing it would be a lot slower than just letting segments and offsets do their job.

Replicating TP's readkey function means looping until the head doesn't equal the tail, reading the value from head, testing for extended keycodes, sending zero and stripping the extension off it,

Code:
function readkey:char; assembler;
asm
	mov  dx,$0040      { BIOS Data RAM Segment }
	mov  es,dx
	jne  @readKeyDone
@readKeyWait:
	mov  di,es:[$001A] { keybuffer_head }
	cmp  di,es:[$001C] { keybuffer_tail }
	je   @readKeyWait
	mov  al,es:[di]
	test al,$80
	jz   @notExtendedKey
	and  al,$7F
	mov  es:[di],al
	xor  al,al { return zero, aka extended key flag }
	ret
@notExtendedKey:
	cli
	add  di,2
	cmp  di,es:[$0082] { keyBuffer_end }
	jne  @bufferUpdate
	mov  di,es:[$0080] { keyBuffer_start }
@bufferUpdate:
	mov  es:[$001A],di { keyBuffer_head }
	sti
end;

Function works just like TP's readkey and updates the buffer pointers just like calling INT $16 function 0 does. Took me a bit to realize "hey dumbass you need to turn interrupts off" when dealing with changing the pointers, since if ISR_09 fires while we're playing with the buffer pointers... I should PROBABLY be loading the full word and sending the actual scancode for the extended key, but TP doesn't seem to do that properly either so I'll keep it as is.

It's kind of a unique buffer system BIOS has set up. Of course I'm also writing this in parallel for BP7's protected mode, where I'm going to install my own workalike ISR09 that just happens to use the BIOS data area in an identical fashion to the real BIOS. That will be cute to deal with though since there's no segment overlap to 'help' with the math.

It is also a good way to illustrate 'pointless optimizations' -- I could probably shave a couple clocks out of there by using AX instead of DI and doing an XCHG for where I actually read from the buffer -- the accumulator is always faster at memory operations... but with something called so infrequently really worth spending the extra couple bytes on squeezing extra clocks out of?

Though I'm also wondering if I should consider setting DS as my segment so I can use XLAT with BX as that would be a flat 11 clocks when I read the value into AL -- but the screwing around with setting up for that might be more effort than it's worth. (which has most always been XLAT's problem).
 
If you use a first,last,in,out scheme like int 16h does, you don't need to fool with disabling interrupts. First and Last don't ever change; and the keyboard ISR manipulates only the "in" and a routine reading manipulates only the "out".

This is, of course, subject to a couple of restrictions. First, that the routine taking data out of the buffer must read the data before updating the "out" pointer. Second, the ISR putting data into the buffer must store the data before updating the "in" pointer. Otherwise, you have some race conditions.

The quote about "hot spots" is mine; in a former life on big iron, I was given an unrestricted hand in improving performance of a then-supercomputer system and product set. The result was judged by management to be rousing success and I learned a great many things in the process, not the least of which was to take care not to tread on toes too heavily.

One of these days, I should write a tutorial on how to improve performance, as it seems to me that an inordinate amount of effort is spent in the wrong places.
 
Or were you just referring to my using JL instead of JB?
Yes.
I rarely make the distinction in my head when dealing with words -- if it was bytes, that extra bit usually matters. Words, generally not as much of a concern.
Maybe you should make that distinction. Using signed Jcc's for unsigned vars and vice versa is probably one of the most common reasons for bugs in ASM code (if not THE most common).
An interesting approach assuming the buffer is byte-sized... you know how you said using JL made it 'more complex'? (compared to what, JNE?) -- to me XCHG does that. To me that feels like two extra instructions that shouldn't even be needed...
I didn't say it made it more complex, I said it adds to confusion. When I read code that works with unsigned vars and suddenly stumbles upon a signed Jcc I start to wonder what I've missed, you know? And reading other peoples code can be hard enough as it is anyway.

About the XCHG instructions, one jump taken costs more cycles than the two XCHG instructions and the jump not taken together so they are simply there to avoid doing the jump on (almost) every pass. Should be pretty obvious. ;)
 
Back
Top