• Please review our updated Terms and Rules here

My Floating Point adventure

Ruud

Veteran Member
Joined
Nov 30, 2009
Messages
1,369
Location
Heerlen, NL
For a very long time I'm working on three software projects between other things:
- My own OS meant to run on 8088 PC compatibles. My primary targets: C= PC10 series. Main hurdle at this moment: directories.
- My own bootable BASIC, same targets. Shares a lot of routines with the OS.
- My own Pascal, written in Turbo Pascal, which outputs macros. My own assembler should turn the macros into binaries for the various target machines. For the moment only the PC and the Commodore 64.

The last two projects have one thing in common: they need Floating Point (FP) routines for the mathematical part. For the C64 I can use the onboard FP routines so that problem already has been solved. But FP routines for the 8086.....
So far for the 8086 I only found the sources of the original and a hacked Microsoft's GW-BASIC and the so called URC library. The problem: the code of both is very hard to understand. And I'm not the only one with this opinion. And what doesn't help either, I'm a NASM man and the last time I really worked with MASM/TASM was thirty years ago.

That raised the idea to use the to 8086 converted FP routines of the C64 for my Pascal and BASIC as well. OK, it is "only" 40 bits but that is good enough for me. If I understand things better, I always can extend the number of bits. The main advantage: to check if things work OK, the idea is just running a small BASIC/ML program and to see what happens with the Floating Point registers FAC1 and FAC2. And then the "only" thing I have to do is to achieve the same result with my Pascal/8086 routines.

I don't want to copy everything from the C64 so I decided to write the various routines in Pascal first in a way that more or less mimics the use of registers. I decided to start with converting a string to FP. I have to check for several things. I assume that a number can look like "-1234.5678 e -123", with or without spaces around the 'e' and without one or both '-' signs or maybe with a '+' sign instead.

After parsing the string I will end up with two booleans/bytes representing the two signs, one word for the extension, four words for the fractional part and four words for the integer part. I know, massive overkill, but I'm already thinking about supporting the extended format which is eighty bits. And scaling down afterwards is easier than scaling up.

Generating the signs values, the extension word and integer part were relatively easy compared to the fractional part. While parsing the number for creating the integer part, I had to multiply the previous number with ten and I used a combination of addition and "shift left". For creating the fraction part I thought it was just a matter of subtraction and "shift right" but unfortunately that wasn't the case. And to make matters worse, I could not make use of the assembly instruction DIV and IDIV if more than sixteen bits were involved.

I had to find a way to divide multiple words by ten and in the end I came up with four possible solutions:

- found on internet:

unsigned divu10(unsigned n)
{
unsigned q, r;
q = (n >> 1) + (n >> 2);
q = q + (q >> 4);
q = q + (q >> 8);
q = q + (q >> 16);
q = q >> 3;
r = n - (((q << 2) + q) << 1);
return q + (r > 9);
}

It works for the biggest LONGINT I can produce in Pascal, $7FFFFFFF, but it looks a bit complicated (read: I'm lazy) and I have no idea what the error is at the end so at this moment I'm not really convinced.


- found on internet:

Have a look at this:

result = (input * 205) >> 11

This can be seen as "r = (i * 205) div 2048". the error will be less than one per mille.
When using 16 bits, "r = (i * 6554) >> 16", the error will be 4/65536. I even found the 6502 version for the calculation on internet.
When using 32 bits the error will be rough1y 1/1.000.000.000.

Thinking things over I'm leaning over to the idea of using 19 bits. The error will be 2/524.288 and, more important, the multiplier will be 52429: this means I can still use the 8088's own multiply instruction MUL.

I have no idea if this idea will be more complicated to convert into code than the first idea but its biggest advantage is that you know what the error will be.


- using tables:

The main idea is to generate a table with the values for 0.1, 0.01, 0.001, 0.0001, 0.00001, 0.000001, etc., etc. The question is: how many records do we need and how may bits will each record contain? FYI: using just six records, thus including the one for 0.000001, will give already a smaller error than the above 16-bit multiplication/shift idea. Another advantage of using records: speed.


- using the C64 routines as base:

That was the original idea but then I found out that the C64 did the division by using FP routines. And that means that I ran more or less into the "chicken and egg" problem: I cannot produce an egg because I haven't a chicken yet.
Side note: the C64 used a table to look up the FP value for ten and that triggered the idea of using tables.


Once my own FP routines seems to work, I will see if I keep the original idea of division-by-ten or replace it with a FP division. The last will mean less code but speed can be a decisive factor

To be continued.....

Feel free to add comment, critics, ideas etc. Everything is welcome!
 
Last edited:
Many DOS compilers handle floating point with interrupts and self-modifying code. If an FPU is present, the interrupt calls get replaced with opcodes. Otherwise the interrupts emulate the FPU.

Some more details here.
 
I know about this interrupt method. I tried to trick Turbo by using Turbo Debugger to show me the code behind an 8087 instruction by stepping through the code in a machine without FPU. Once reaching the code I hoped it would jump to the alternative code. But no, it ran the alternative code and the pointer went to the next instruction.

Of course I looked for the source code of any FPU emulator but found nothing so far. So now I'm busy in the way described above.
 
An update:

I was leaning over to the idea of using 19 bits. The error would be 2/524.288 and, more important, the multiplier would be 52429. This meant I could still use the 8088's own multiply instruction MUL. But..... The result was that the fraction 0.1 was converted into "CC CD 00 00" while the C64 converted it into "CC CC CC CC". In other words, the C64 was much more precizer! It was clear that I had to use at least 32 bits. But that means I cannot use MUL and therefore have to find another way to do the multiplication. And if I want to end up with the 64- or even 80-bit FP format, I even need to use at least the 64-bit version.

Hmmm, just an idea: what about splitting the multiplier in two words, doing two multiplications and adding the results after shifting the hi-word result first?
 
I guess I'm not following. On an X86, the product of two 16-bit numbers is 32 bits (returned in DX:AX) . So shifting the 32-bit product 11 bits right should preserve the desired precision for your problem, no?
 
To clarify my last post: using 16 bits is not enough. In fact I found out that I needed at least the 32- bit solution: multiplying with 429.496.730 (= $1999 999A) and then do the shift-right thing. But the 8088 cannot do 32-bit multiplication so I have to split it in three parts:
- multiplying the original value with $999A and shifting the result 16 bits to the right
- multiplying the original value with $1999
- adding both results.

In fact I don't do the shift in the first part. I store the results of each multiplication in its own array and in the 3rd part I just add the correct elements. And you already guessed: my idea worked.

Busy with the renovation I at least have some time to think things over.
- Using tables is more or less the same as the multiplication/shift method except you have even more bits at your disposal.
- But how far must one go? 0.1 .... 0.000001 OK, but then what about 0.12345 e -10?
- When running onto an exponent at all, how to deal with one? 0.000001 = 1 e -6. Handling something like 1 e -18 could done with using the 0.000001 table three time in a row. But that would increase the error. Sure something to think about.

Am I inventing the wheel again? Sure I am! But the reason why I am doing it is to find out HOW it is done. Even a nice book like "The Art of Assembler" explains what the FP format means but never gave an explanation how to get there except pointing to the UCR library. I want to end up with routines that I understand and I want to use this threat so other people can understand as well. The story will be found on my site as well.

So one task for now: is to find out ho to convert a decimal number with exponent like the example above into a FP one without losing the original preciseness. But as I already mentioned, the C64 uses FP routines to divide by ten and I'm sure it will use FP routines when handling an exponent. So I think I will have a look at those routines first and first create my own ones using only numbers w/o exponents for the moment. And then to fine-tune them with exponents.

Back to my renovation....
 
I'm not following, but then I don't know your level of formal training in numerical methods.

Do keep in mind that precision can be illusory--see the WikiP article on Significance arithmetic

I recall the first problem given to the class as a homework exercise in my numerical methods class--a simple quotient of two expressions--one involving sin(x) and the other, cos(x). If one codes this (in, say, FORTRAN) as it appears, the result is essentially noise for certain values of x--all significance is lost in the computation. It was a good object lesson.

The CDC 6000 series mainframes had essentially two sets of FP instructions. The first was the ordinary zero-fill-during normalizing; the second, was rounding-during-normalizing (essentially using a pattern of 01010101 as fill). Even the lowly IBM 1620 floating point had a user-specifiable "noise digit" option.

To misquote Mark Twain, "there are liars, damned liars and floating point math".
 
An update:

The C64 handles exponents by either multiplying or dividing the result so far with 10 using the according FP routine and doing that in a loop for the value of the exponent times. To be honest I had hoped there was a more intelligent method when dealing with larger exponents but for the moment I have to live with it.
My own idea: what about dividing by, just an example, 10.000.000 first and followed by smaller values later?

Back to my renovation....
 
At the moment I have a 7 bits exponent and one bit for the sign. But there is IMHO no reason why I can expand it, if I have very good reasons. And at the moment I have a 63 bits mantissa plus one sign bit. Not IEEE-754 compliant but as long as I use it for internal use only, no problem. The UCR library for example, also uses a different format internally.

Regarding the subtracting and counting: AFAIK that won't work for the fraction part of a number. The method used by the C64 certainly is not fast but has one advantage: small code.

Back to my.. eh, no, three days off into the country with my wife :)
 
You know, on an 8 bit machine, decimal floating point is about the same speed as binary, especially if you don't have hardware multiply/divide. Conversion between display and internal representations is a lot simpler--and you get the benefit that 0.1 is an exact value (which matters in monetary transactions).
 
My target is the 8088
You know, on an 8 bit machine, decimal floating point is about the same speed as binary, especially if you don't have hardware multiply/divide. Conversion between display and internal representations is a lot simpler--and you get the benefit that 0.1 is an exact value (which matters in monetary transactions).
My main target for the moment is the 8088, written in assembly.

Your "that 0.1 is an exact value" needs some explanation IMHO. Converting 0.1 into FP ends up in a repeating value: 0x3FB999999999999A for double precision. And this leads to another challenge for me: the FP to string conversion. If I add 0.1 to 0.2 I expect 0.3. But when converting the FP value to string I wouldn't wonder if my first result would be something like 0.299999999. Yet I know it should be rounded to 0.3. My question so far: what are the conditions to round something up? You see, still a lot of research to do......
 
When you normalize a number, what value do you shift into the right-most positions?

As an aside, in S/360 FP, normalization is done only at 4-bit boundaries. This makes the exponent a power of 16, so only 32 bits were used for single-precision FP. This means that the precision could be as bad as 20 bits.

On the CDC STAR, our single precision was 48 bit mantissa+16 bit exponent.
 
As said before I intend to use a table for the values 0.1 ... 0.000001 using 80 bits for each value. I used this FP calculator to get the values for each record. I incremented the value if the 81st bit was an one. And that is the only normalization I do so far.
I just started to have a look how to do the FP -> string conversion and that's why this question rose into my mind. I have gathered some documentation about normalization and I will see what I have to at the right moment.
 
An update:

This FP -> string conversion isn't that simple as I thought. That is, the fraction part. I read everywhere that it is just adding 1/2, 1/4, 1/8, 1/16 etc., etc. if the right bit is set. Fine, that works for us humans but not for computers.
The C64 does it by multiplying the fraction with 1000000000 and then treats the result as an integer. But that means that one is limited to nine bits after the point. Using 80 bits should mean I should do much better. I'll just to write a little program that shows at what point Free Pascal draws the line. This will then also give me an idea what number I should use to multiply with.

Yet a question: is there another way to convert the fraction part into a decimal number? TIA!

Edit: Free Pascal only shows 15 digits.
 
Last edited:
An update:

I thought I had it all in the bag. but alas, no. I went for the multiplication just like the C64 but using us bigger number: 1 000 000 000 000 000 000. The multiplication went fine and after getting rid of some small errors, the hexadecimal value was what I expected. But my routine to turn that value into a decimal one was complete rubbish. Back to the drawing board....
 
Not sure if this will help you, but here is the code I wrote for parsing/generating strings to/from floating point values in my Java VM for the 6502. Written more for clarity rather than speed. Mind you, this uses the IEEE 32 bit single precision floating point representation, but you should be able to follow the logic pretty easily. It also hasn't gone through a great deal of testing as I moved off into a different VM direction. And it's in Java - shouldn't be hard to translate into Pascal.

 
"floatAsBits" is a Java function, I presume? Yes, I am using Pascal but for the simple reason that it is easier to write out ideas using Pascal than using ML. Once the idea works fine, the whole must be translated into ML in the end and that means I cannot use built-in functions.
But thank you anyway for wanting to help me!
 
Back
Top