AFAIK, xor ax, ax is a 2 byte instruction. And your cycle count seems to be a bit off. I get 59/66 in your first example (typo I guess?)
What I get for pulling numbers out of my backside instead of looking it up :D -- good catch.
Also, I don't understand the use of JL. If next goes negative we are hosed anyway so there's no point in using a signed Jcc here - it just adds to confusion. Besides, trying to cover up for possible bugs elsewhere is not what I would call good programming practice. </nitpicking> :D
Normally I'd agree, but with things like buffers I'm overly paranoid about range-checking. Call it a lesson learned by observing things like twenty year old bugs in the BSD codebase, flaws in the JPEG decoder, and how even something simple like null terminated strings can turn around and bite you in the backside. (there's a reason I'm a fan of byte-length first strings -
xor ah,ah; lodsb; mov cx,ax; rep movsb;)
It also ends up no more or less bytes and no more or less execution time, so I put the range-checking in there. It's also why I prefer even byte-sized buffers with an "AND". You never have to even think about it if you use AND.
Or were you just referring to my using JL instead of JB? I rarely make the distinction in my head when dealing with words -- if it was bytes, that extra bit usually matters. Words, generally not as much of a concern.
This was my take on the original problem. DL is next and DH is 0 on entry.
An interesting approach assuming the buffer is byte-sized... you know how you said using JL made it 'more complex'? (compared to what, JNE?) -- to me XCHG does that. To me that feels like two extra instructions that shouldn't even be needed...
Though being we're talking about a buffer pointer, maybe we should be talking BL and/or BX that way it would work with XLAT? (that hinges on what one is doing with the buffer I guess)
Needless to say Chuck(G)'s variant blows mine away.
... and he made some REALLY good points in that last post. Such as:
(in response to krebizfan's notion of including branching to multiple CPU combinations)
Suppose (and this was my case) you have a 300KB executable designed to boot on any system on any common media. That means that you have to boot from a 360K floppy as well as a 2.88MB one. Asking the user to carry multiple versions is ridiculous.
Much less the overhead added to the program during startup to figure out just what optimized version to run! What are you going to do include versions of every routine for EVERY processor iteration? That's not just 'fatter' it ends up ridiculous if you're talking about targeting 8088, 8086, 80186, 80188, 80286, 80386, 80486, Pentium... shall I go on?
And this goes to another bit about optimization. Usually, there are very few routines that need this treatment because they represent execution "hot spots"--i.e., you get a lot of improvement for a very little effort and added space.
The classic line -- optimize inside the loop. Something that needs to execute once every sixty seconds probably isn't worth obsessing over every clock, so you byte-size optimize those -- something that executes multiple loops 12,800 times a second like a 115,200 baud bit-banging buffer that's where you put the effort in...
... and if nothing else, the branching logic is more overhead -- even if you make it a dynamic call. More CALLS == BAD. CALL BAD, JMP BAD... you get a lot more bang for your buck minimizing the use of those than you would ever get by optimizing versions for each processor.
Again, why I'd use a AND and a power of two sized buffer -- though sometimes you don't have control over your buffer size.
Funny to talk about buffers -- I was just dealing with trying to speed up BIOS keyboard routines (or more specifically turbo pascals calling them) and ended up writing my own to replace both keypressed and readkey. I'm also writing my own INT09 handler since I'd like to maintain a live keymap.
The BIOS keyboard buffer is interesting to play with as it has five values you need to deal with in the BIOS data area (segment $0040)...
The head and tail:
$0040:001A buffer head (word)
$0040:001C buffer tail (word)
are offsets from $0040 into the buffer, meaning they range USUALLY from 32..80. Problem is you cannot rely on that being fixed as there are two more values:
$0040:0080 buffer start (word)
$0040:0082 buffer end (wors)
... and on some systems those can be many times larger. Used to be some BIOS' let you change the size of the buffer, and there were TSR's that did it too.
So to test for a keypress, you just check if head=tail.
Code:
function keypressed:boolean; assembler;
asm
xor ax,ax
mov es,ax
mov bx,es:[$041A] {keybuffer_head}
cmp bx,es:[$041C] {keybuffer_tail}
je @retval
not ax
@retval:
{ tp7 ASM exit values are in AL, which we set to zero at start! }
end;
Which works pretty good... though I may have made a 'pointless' optimization in there -- re-using AX for ES and using the segment overlap to my advantage. I can't do that in the readkey function since keyBuffer_head is based off $0040, and doing an ADD before accessing it would be a lot slower than just letting segments and offsets do their job.
Replicating TP's readkey function means looping until the head doesn't equal the tail, reading the value from head, testing for extended keycodes, sending zero and stripping the extension off it,
Code:
function readkey:char; assembler;
asm
mov dx,$0040 { BIOS Data RAM Segment }
mov es,dx
jne @readKeyDone
@readKeyWait:
mov di,es:[$001A] { keybuffer_head }
cmp di,es:[$001C] { keybuffer_tail }
je @readKeyWait
mov al,es:[di]
test al,$80
jz @notExtendedKey
and al,$7F
mov es:[di],al
xor al,al { return zero, aka extended key flag }
ret
@notExtendedKey:
cli
add di,2
cmp di,es:[$0082] { keyBuffer_end }
jne @bufferUpdate
mov di,es:[$0080] { keyBuffer_start }
@bufferUpdate:
mov es:[$001A],di { keyBuffer_head }
sti
end;
Function works just like TP's readkey and updates the buffer pointers just like calling INT $16 function 0 does. Took me a bit to realize "hey dumbass you need to turn interrupts off" when dealing with changing the pointers, since if ISR_09 fires while we're playing with the buffer pointers... I should PROBABLY be loading the full word and sending the actual scancode for the extended key, but TP doesn't seem to do that properly either so I'll keep it as is.
It's kind of a unique buffer system BIOS has set up. Of course I'm also writing this in parallel for BP7's protected mode, where I'm going to install my own workalike ISR09 that just happens to use the BIOS data area in an identical fashion to the real BIOS. That will be cute to deal with though since there's no segment overlap to 'help' with the math.
It is also a good way to illustrate 'pointless optimizations' -- I could probably shave a couple clocks out of there by using AX instead of DI and doing an XCHG for where I actually read from the buffer -- the accumulator is always faster at memory operations... but with something called so infrequently really worth spending the extra couple bytes on squeezing extra clocks out of?
Though I'm also wondering if I should consider setting DS as my segment so I can use XLAT with BX as that would be a flat 11 clocks when I read the value into AL -- but the screwing around with setting up for that might be more effort than it's worth. (which has most always been XLAT's problem).