• Please review our updated Terms and Rules here

Some new code, joystick and BCD

I'd be interested in seeing it, I just can't waste the time to read each axis individually. Takes too long.

You can modify it to read all four bits if you'd rather. The timer-based code is this:

Code:
@timer_method:
  mov    bl,which_bit   {mask for desired bit}
                        {Channel 0, Latch Counter, Rate Generator, Binary}
  mov    bh,iMC_Chan0+iMC_LatchCounter+iMC_OpMode2+iMC_BinaryMode
  mov    cx,j_maxtimer  {maximum compare value for inner loop below}
  mov    al,bh          {Begin building timer count}
  mov    di,$FFFF       {value to init the one-shots with}
  pushf                 {Save interrupt state}
  cli                   {Disable interrupts so our operation is atomic}
  out    43h,al         {Tell timer about it}
  in     al,40h         {Get LSB of timer counter}
  xchg   al,ah          {Save it in ah (xchg accum,reg is 3c 1b}
  in     al,40h         {Get MSB of timer counter}
  popf                  {Restore interrupt state}
  xchg   al,ah          {Put things in the right order; AX:=starting timer}
  xchg   di,ax          {load AX with 1's, while storing AX into DI for further comparison}
  out    dx,al          {write all 1's to start the one-shots}
@read:
  mov    al,bh          {Use same Mode/Command as before (latch counter, etc.)}
  pushf                 {Save interrupt state}
  cli                   {Disable interrupts so our operation is atomic}
  out    43h,AL         {Tell timer about it}
  in     al,40h         {Get LSB of timer counter}
  xchg   al,ah          {Save it in ah for a second}
  in     al,40h         {Get MSB of timer counter}
  popf                  {Restore interrupt state}
  xchg   al,ah          {AX:=new timer value}
  mov    si,di          {copy original value to scratch}
  sub    si,ax          {subtract new value from old value}
  cmp    si,cx          {compare si to maximum time allowed}
  ja     @nostick       {if above, then we've waited too long -- blow doors}
  in     al,dx          {if we're still under the limit, read all eight bits}
  test   al,bl          {check axis bit we care about}
  jnz    @read          {loop while the bit tested isn't zero yet}
  jmp    @joy_exit      {si holds number of timer ticks gone by}

Full pascal unit is here: ftp://ftp.oldskool.org/pub/misc/code/JOYSTICK.PAS with other code in that directory to provide other things such as interrupt control, simulation of a vertical retrace interrupt, and hooking the actual PCjr hardware VINT.

Never quite grasped how to do a divide by 10 with shifts...

Enough handling is required that it would probably be faster to just multiply by the reciprocal instead (MUL is faster than DIV on x86) and it might be possible to optimize the MUL itself via shift-and-add. More info: http://www.hackersdelight.org/divcMore.pdf and also https://en.wikipedia.org/wiki/Division_algorithm#Division_by_a_constant

The maximum number of times you'll DIV a dword is 10; dubious that this is worth spending time on.
 
Sorry for butting in, but are you after the remainder or the quotient? There are shortcuts to getting the remainder and a few cute moves to get the quotient.

Some things just aren't obvious--take Morton "magic numbers" for de-interleaving bits.
 
Sorry for butting in, but are you after the remainder or the quotient?
Yes...

as in both.

To make it simple to explain what's being done, here's pascal code of doing what I need done if I'm going to store the score as longint:

Code:
procedure longToBCD(l:longint; var bcd:array[0..7] of byte);
var
  a:longint;
  t:word;
begin
	a := l;
	for t := 0 to 7 do begin
		bcd[t] := a % 10;
		a := a div 10;
	end;
end;

Really that's the key -- I need the result as a BCD array (or even just an ascii string as "and al, $0F" isn't any more rocket science than the "or al, $30" or "add al, $30" you'd use to turn it into ascii) with leading zeros for the graphics routine to show it... which is why just working in BCD in the first place seems the most efficient way, no matter how 'fast' a DWORD add might be, the conversion takes so ridiculously long by comparison.

.. and again, it's what you do on 8 bit targets for longer scores for a reason. No matter what I try, the best dword to 8 byte BCD conversion I can find is ten times slower than just doing a unpacked BCD add. It's just too damned complicated to try switching back and forth... unless someone knows a really good trick for doing it I can't find.

If someone has something actually faster as working code, I'm all ears... same for if you know a better way to do a BCD add of 8 bytes to 8 bytes.

-- edit -- the big godsend being that AAA is 8 clocks, when every other BCD function seems to be 60 to 80 clocks... Probably because internally it's not a divide, it's a sub, adc and another sub.... though AAM might be an option for longint conversion.
 
Last edited:
So you want to implement a fast version of the Chinese Remainder theorem?

Have you looked into the "Magic 3" conversion?

I stumbled across it not long ago, and it looked promising.

If you're looking for a fast divide-by-10, this might be of interest.
 
Last edited:
This is the best I've been able to come up with for doing longint to string:
Code:
function longToSt8(n:longint):st8; assembler;
asm
    les  di, @result
    mov  ax, $0008
    mov  es:[di],al
    add  di, ax
    mov  dx, word ptr n+2
    mov  ax, word ptr n
    mov  bx, 10000
    div  bx { ax=high 0..9999, dx=low 0..9999 }
    mov  si, ax
    mov  ax, dx
    mov  cx, 4
    mov  bx, 10
    std
    
@loop1:
    xor  dx, dx
    div  bx
    xchg al, dl
    or   al, $30
    stosb
    mov  al, dl
    loop @loop1
    
    mov  ax, si
    mov  cx, 4
    
@loop2:
    xor  dx, dx
    div  bx 
    xchg al, dl
    or   al, $30
    stosb
    mov  al, dl
    loop @loop2

    cld
end;

For comparison, here is the actual TP int2str source code:

Code:
; Convert integer to string
; In    DX:AX = Value
;       ES:DI = String end pointer
; Out   CX    = String length
;       ES:DI = String pointer


Int2Str:


        MOV     CX,DI
        MOV     SI,10
        MOV     BX,DX
        OR      BX,BX
        JNS     @@1
        NEG     BX
        NEG     AX
        SBB     BX,0
        CALL    @@1
        DEC     DI
        MOV     ES:[DI].b0,'-'
        INC     CX
        RET
@@1:    XOR     DX,DX
        XCHG    AX,BX
        DIV     SI
        XCHG    AX,BX
        DIV     SI
        ADD     DL,'0'
        CMP     DL,'0'+10
        JB      @@2
        ADD     DL,'A'-'0'-10
@@2:    DEC     DI
        MOV     ES:[DI],DL
        MOV     DX,AX
        OR      DX,BX
        JNE     @@1
        SUB     CX,DI
        RET
 
Last edited:
FINALLY got it working in the inline assembler

Great to see you finally got something working! ;) Did you also try my version? It should be slightly faster although probably not fast enough to make up for that 10% difference. Also, I forgot to mention that the table and functions should be word aligned (do the inline assembler in TP support ALIGN 2 or EVEN?). Might not matter on an 8 bit bus machine though I seem to recall reading somewhere that it does. In any case, it does matter on machines with a 16 bit bus.

Your shr/jc routine can be improved by replacing this;
Code:
@test_2_6:
	shr  al, 1
	jc   @char_6

with this;
Code:
@test_2_6:
	jnz  @char_6

and this;
Code:
@test_3_7:
	shr  al, 1
	jc   @char_7

to this;
Code:
@test_3_7:
	jnz  @char_7

Alternatively, you can use the parity flag to reduce the shifting but TP will likely insert extra jumps due to the jump distances. Example;
Code:
@loop:
 
	lodsb
 
	shr  al, 1
	jc   @test_1_3_5_7_9
 
@test_0_2_4_6_8:
	jz   @char_0
	jpe  @char_6
 
@test_2_4_8:
	shr  al, 1
	jc   @char_2
 
@test_4_8:
	shr  al, 1
	jc   @char_4
 
@char_8:
	; snip 
	jmp  @done
 
@char_4:
	; snip
	jmp  @done
 
@char_2:
	; snip
	jmp  @done
 
@char_6:
	; snip
	jmp  @done
 
@test_1_3_5_7_9:
	jz   @char_1
	jpe  @char_7
	shr  al, 1
	jc   @char_3
 
@test_5_9:
	shr  al, 1
	jc   @char_5
 
@char_9:
	; snip
	jmp  @done
 
@char_5:
	; snip
	jmp  @done
 
@char_3:
	; snip
	jmp  @done
 
@char_7:
	; snip
	jmp  @done
 
@char_1:
	; snip
	jmp  @done
 
@char_0:
	; snip
 
@done:

Rearranging the order of the functions may help with this to some degree.
 
Did you also try my version?
I cannot get an equivalent to call cs:[bx] to even exist... if I do a segcs before it the segment being called is overridden, not the one being looked up.

Even if I got it working, it's unlikely to be faster. The only real difference between our two is in mine the xchg is before the call (so it's possibly gaining some biu speedup there instead of being uncached), and my doing a reg,[reg] followed by a call reg vs. your call[reg] with a segment override.

Mem16 call (yours) is 21 + EA, since it's base with a segment override that should be 7 clocks of EA, so that's 28 clocks... plus the operation itself being 4 bytes.

A mov reg, [reg] is 3 bytes and 8+EA, so that's 15 clocks a call reg is 20 clocks flat and 2 bytes.

Yours is one less byte and seven less clocks, but by the time you figure in the caching by the BIU of the XCHG, it's a wash. That xchg after the jump on yours is adding ~12 clocks and only paying out 4 to the BIU. The XCHG and it's 4 payout after the MOV reg, seg:[reg] ends up being effectively 'free' and still provides enough that the call is still prefetched as well.

Did you also try my version?
t should be slightly faster although probably not fast enough to make up for that 10% difference. Also, I forgot to mention that the table and functions should be word aligned (do the inline assembler in TP support ALIGN 2 or EVEN?).
[/quote]
{$A+} { word aligned data here } {$A-}

I did that too.

Your shr/jc routine can be improved by replacing this;
Yer right, the final of each 'split' could be eliminated.

Alternatively, you can use the parity flag
Good point, I always forget that one even exists. That could eliminate two sets of shifts per depth, and with a depth of eight that really reduces the number of checks.

but TP will likely insert extra jumps due to the jump distances.
Actually it bombs out with this really vague "133 Cannot evaluate this expression"

That can be avoided though just by keeping the order of the jumps proper so that each 'block' is < 256 bytes in length. Using mov [si], bl; mov [si+2], al; mov [si+160], bh and so forth instead of the mov al, stosb, inc di stuff really helps with that. (as you then pretty much know each 'char' routine is only 42 bytes long).

Though I'm not 100% sure your logic order is right, I'm gonna go through that and implement it to see how it works out. Less shifting is a good thing.

Again, thanks for more useful advice.
 
For comparison, here is the actual TP int2str source code:
That's really interesting, I'd have expected that to be faster than what I'm doing... though they aren't extending the zeros like I am. It's actually a hair faster at values below 3 digits, but it's way slower at anything bigger.

I've actually made an 'improvement' to mine by further leveraging 'divide and conquer'.

Code:
function longToBCD8ASCII(value:longint):st8; assembler;
asm
	les  di, @result
	mov  al, 8
	stosb
	add  di, 6
	mov  ax, word ptr value
	mov  dx, word ptr value + 2
	mov  cx, $3030
	
	mov  bx, 10000
	div  bx  { ax = high 0...9999, dx = low 0..9999 }
	
	std
	
	mov  bl, 100
	div  bl  { al = high 0..99, ah = low 0..99 }
	mov  bh, ah
	aam      { ah = low 0..9, al = high 0..9 }
	or  ax, cx
	stosw
	mov al, bh
	aam      { ah = low 0..9, al = high 0..9 }
	or  ax, cx
	stosw
	
	mov  ax, dx
	div  bl  { al = high 0..99, ah = low 0..99 }
	mov  bh, ah
	aam      { ah = low 0..9, al = high 0..9 }
	or  ax, cx
	stosw
	mov al, bh
	aam      { ah = low 0..9, al = high 0..9 }
	or  ax, cx
	stosw
	
	cld      { I always keep it clear so I can assume clear in most routines }
	
end;

Naturally for my internal use it drops all those OR and outputs to an array instead of a string. Reduces it to exactly 3 divides regardless of number size, and leverages AAM to do the smaller divide by 10 without resorting to changing the contents of BL. AAM is handy, even if it is 83 clocks. Given moving bl to 10 and doing a div bl would work out to ~96 clocks, it's a small but important savings.

STILL slower than a fixed BCD add though; the whole BCD routine working out to the same as two divides.

Oh and dividing by the reciprocal isn't entirely viable in INTEGER math, the adjust afterwards would be way more than just using DIV in the first place. That's the problem with a LOT of the 'alternatives' is by the time you implement them they are WAY more code and WAY slower.
 
So you want to implement a fast version of the Chinese Remainder theorem?
I'm actually looking for something FASTER; all that extra tracking inside it -- with things like:

"Magic 3" conversion

or:.

this might be of interest.

by the time you implement them in software, you are many times slower than the 'painfully slow' 8088 operation. As bad as the microcode might be compared to a hardware divide, it still seems to beat out most software operations for non exponents of two.

Routines and methods like those are all well and good as a way to implement it in silicon of if your processor has no divide, but as a software alternative to a hardware microcode routine they are not an improvement, taking two to four times as long to actually execute per divide.

It ALMOST makes one think maybe Intel knew what they were doing.
 
Last edited:
What's the real object of the display--that is, what will the code be used for?
Displaying the score as it's updated during gameplay in games I write with the 160x100 semigraphics.

Right now Paku Paku 1.7 beta (which was already 10% faster under the hood already over the public 1.6) on a 128k Jr is running flat out with none of the timeslices having any excess - I might as well not even HAVE timing code... while still ending up at HALF the framerate of a proper PC (or an expanded junior). When the score updates it consumes 20% of it's timeslice on a proper PC, and 60%+ of the timeslice on Jr. Thats why I'm updating the blitting method AND the calculation method. Between the two I've made it a little over 5x faster.

That's why there's two ways of implementing it -- the way I was doing it with tracking it using a longint and doing integer math, then using long to string (well, BCD) conversion for the display routine; or the new way I'm doing it by just working in unpacked BCD across the board. A 8 byte BCD addition being faster than ANY longint to BCD conversion I can find... particularly when I also need leading zero's. Even if that addition is WAY slower than "add ax, value; adc dx".

Second, I have another 'to be released this century' game I'm working on that right now using the same engine seems to have a minimum spec of a 7.16mhz T1K or 8mhz XT. I want to get that trimmed down to actually run full speed on a regular 4.77mhz 8088. (I'm not even going to TRY going for a 128k Jr on that).

I'm also trying to get the memory footprint down on Paku Paku to under 48k, and for the future game to under 128k (right now it's hovering around 180k).

Finally, Paku Paku for DOS is getting a total rewrite from scratch for version 2.0 to leverage a lot of what I learned making the C64 version. I'm also planning a special Jr/T1K version, as well as an EGA/AT version. 1.7 is getting pitched in the trash completely, as is ~80%+ of the codebase. I'm even considering switching to arpeggio's for the sound instead of the 'priority' method I was using. (means doubling my timer interval and handling timeslices a wee bit different). Objects are getting the boot, and I've no intention of using the heap either; they introduce too much overhead.

Though that raises a silly question -- can jr/tandy graphics do more than one page at 320x200 16 color? I would think it could since you can declare 64k of video memory, I just can't figure out how to page-flip. The handful of texts and code examples I've found (much less the Jr tech reference on the topic) are just gibberish to me. I'm not comprehending them at ALL.

Oh, one other thing... does anyone have information on making cassette loadable machine language programs for the 5150 and/or Junior? It's just a crazy idea.
 
Last edited:
The score's only additive, right? I'd just do the arithmetic in BCD or ASCII. It's not as if you're taking the hyperbolic arctangent of something, where you'd actually need to compute something, is it? And you get the benefit of not having to look at the entire number if it's smaller than the full number of digits that you're keeping around (i.e., you can keep the number of digits that are currently significant and so preset your loop count.

And if you're using a V20, you can even use the string BCD instructions.
 
The score's only additive, right? I'd just do the arithmetic in BCD or ASCII.
Which is what I'm doing, and what some folks seemed to be questioning. See the original post where I have a 8 byte BCD add routine right there. It just feels 'wrong' after doing it on a 6502, as if it's more complex than need be.

Seems like books and tutorials on x86 machine language skip right past BCD as if it's a disowned bastard child. I had one really good book on it ages ago, but I can't remember what it was called or who it was by.

Though Krille's advice helped remind me to use CBW instead of xor ah, ah... resulting in each digit being:

Code:
	lodsb
	add  al, ah
	cbw
	add  al, es:[di]
	aaa
	stosb

I'm not sure what it is that's bothering me about that.
 
No, DFP (decimal floating point) is still quite desirable, particularly in financial computation. The IBM Power6 CPU has them

When I worked on the math package for SuperCalc, it was mandatory that everything be in decimal, even the floating point. It would have been much simpler to do things in binary. But then, what do spreadsheets most commonly get used for?

When we did our BASIC compiler and runtime, we found that spending on an extra 4 bits on each number to indicate the number of significant digits could speed things up considerably. Consider, that if your score counter is, say, 1000 and you add 9999 to it, you have to loop for at most, 5 digits, instead of slavishly going through all of those non-significant zeroes. In addition, if you want to blank leading zeroes, you know how many leading digits aren't zeroes.
 
JUST figured out what was bugging me!

Largest result from an add would be 19... so the largest carry from AAA would be 1, and ... AAA sets the carry flag. DOH! NOW I feel like a right proper idiot. All that screwing around with AH is pointless.

Code:
procedure addBCDUnpacked(var b1, b2:BCDUnpacked); assembler;
asm
	les  di, b1
	mov  dx, ds
	lds  si, b2
	
	{0}
	lodsb
	add  al, es:[di]
	aaa
	stosb
	
	{1}
	lodsb
	adc  al, es:[di]
	aaa
	stosb
	
	{2}
	lodsb
	adc  al, es:[di]
	aaa
	stosb
	
	{3}
	lodsb
	adc  al, es:[di]
	aaa
	stosb
	
	{4}
	lodsb
	adc  al, es:[di]
	aaa
	stosb
	
	{5}
	lodsb
	adc  al, es:[di]
	aaa
	stosb
	
	{6}
	lodsb
	adc  al, es:[di]
	aaa
	stosb
	
	{7}
	lodsb
	adc  al, es:[di]
	aaa
	stosb
	
	mov  ds, dx
end;

Smegging A, that's better.
 
No, DFP (decimal floating point) is still quite desirable, particularly in financial computation.
I know that -- it's not an uncommon thing.

The IBM Power6 CPU has them
So do most Javascript implementations -- when you are mostly working with numbers as strings for display and/or need arbitrary precision, it's the all around best-choice.

I just can't find any x86 books that do anything more than even mention the existence of it... even the ultra in-depth (and often way over my head) 80386 "A programming and design handbook" doesn't do anything more than list the various aax and dax in the opcode reference. I know how to do it z80 and 6502, I've got books here that cover it on Alpha, MIPS, 68K and even ARM... but x86?!? It's like nobody wants to even touch the subject. (and I've got ~2 dozen x86 ASM books here - Most of which came from Keene State library when they sold off their "outdated" books a few years ago.)

Consider, that if your score counter is, say, 1000 and you add 9999 to it, you have to loop for at most, 5 digits, instead of slavishly going through all of those non-significant zeroes. In addition, if you want to blank leading zeroes, you know how many leading digits aren't zeroes.

I've been tempted to add some jnc in there for when it doesn't carry, but testing BOTH values for length and the extra jumps is slower thanks to the extra code than just doing the handful of extra digits processing. If I was doing more than 8 digit precision, I'd probably put them in.
 
Last edited:
Well, to be fair, DAA hasn't changed since the 4004. DAS is new to x86, though the Z80 could handle it with the subtraction condition code and not terribly exciting.

AAM, AAD, AAA, AAS are pretty straightforward--I suspect that Intel regrets including them in the instruction set. Did PL/I-86 or COBOL-86 use DAA DAS for COMPUTATIONAL-3 numbers? or AAM, AAD, AAS, AAA for DISPLAY fields?
 
Well, to be fair, DAA hasn't changed since the 4004. DAS is new to x86, though the Z80 could handle it with the subtraction condition code and not terribly exciting.
Which is probably why I'm able to fake it without proper instructions...

AAM, AAD, AAA, AAS are pretty straightforward--I suspect that Intel regrets including them in the instruction set.
I think you're on to something with that, given how well documented their use is.
 
I had to admit that the ASCII adjust instructions left me wondering why the guys at Intel wasted their time on them. Maybe someone thought that they were "neat?". I'd rather they used their time to implement a single-cycle barrel shifter.

The DAA made sense on the 4004--consider that it has a 4-bit accumulator and that memory references are preceded by another instruction that specifies the scratchpad register pair containing the address of the memory location to be used. DAA did set the carry bit, but would not clear it if set. A little strange, no?

Of course, if the 4004 had the DAA instruction, then the 4040 had to have it, then the 8008 had to have it, then the 8080, 8085 and 8086...

One instruction that the 4004 had, but was dropped was ISZ, which was a sort of LOOP instruction (increment and skip if zero), which is again a little misleading. One is added to the register; if the result is used, execution proceeds with the next instruction. Otherwise, the branch is taken. You can see the same instruction surfacing on the Motorola 68K--but not on any Intel CPU after the 4040.

Forgive my rambling...
 
One instruction that the 4004 had, but was dropped was ISZ, which was a sort of LOOP instruction (increment and skip if zero), which is again a little misleading. One is added to the register; if the result is used, execution proceeds with the next instruction. Otherwise, the branch is taken. You can see the same instruction surfacing on the Motorola 68K--but not on any Intel CPU after the 4040.

I'm having trouble understanding how this differs from LOOP. What do you mean by "if the result is used"?
 
Back
Top