Writing Assemblers... What should a good assembler do?

Kelly Gray · Dec 11, 2023

The first thing that jumped out at me is that / and \ are very similar, but do very different things. It's an easy mistake to swap them, and a pain to spot when reading the source.
Most of the time the % is used for modulus. That conflicts with your number type designation though.
You could use C conventions, 0xdddd for hex, 0bdddd for binary. ddddH and ddddB would also work and are also common.

Include files are very useful to break large programs up into modules. It's handy to be able to build a library of useful routines, and just include them as needed.
Local labels make writing reusable modules much easier since it's easier to eliminate label conflicts.

I would prefer the output filenames be specified on the command line. Anything not specified is not output. A com file is fine if you're creating a CPM executable, but if the code is going into ROM then an intel hex file is almost a requirement for the eprom burner software.

Ideally there should be assembler directives to turn the listing on and off. That way I can turn printing off, include a bunch of library stuff, then turn the printing back on, so I don't have to scroll past tons of well debugged library code in my listing file.

Chuck(G) · Dec 11, 2023

cj7hawk said:
Do you know if any assemblers offered "custom" instructions in the assembler itself?

OPDEF was common in DEC assemblers; CDC COMPASS PDF page 167 and many others.

Svenska · Dec 11, 2023

cj7hawk said:
Is there any advantage to relocatable code in CP/M?

Relocatable code is required for any code which does not know its final runtime address at build time. Examples are CP/M's RSX modules, but also more advanced overlay systems. Also, having relocatable object files is a requirement for linking and separate compilation. Not your use-case, but worth mentioning.

cj7hawk said:
I'll see what options are... C seems a bit far from compatible with the approach I've used, but now you've mentioned it, I should be able to put together a list of assembler operator functions.

Whatever approach you choose will make your assembler easier to use for people who know your blueprint, and harder to use for anyone else. Know your audience if you want anyone else to use it. Otherwise don't care. Since I know C reasonably well, I generally prefer having C-style operators and C-style number prefixes.

Another thing to consider for CP/M is the influence of ISO-646 (predecessor to Unicode, ISO-10646): International systems may have replaced a few characters with national language characters. Most notably, this includes brackets [], braces {}, backslash \, pipe |, caret ^ and tilde ~ characters. Relying on them exclusively doesn't work well on some systems. May or may not matter to you.

The US keyboard layout (and close derivatives) makes both slash and backslash very easy to reach. I believe this is the main reason why people confuse and misuse them. Most other layouts are different, often by having slash on the second layer (Shift+'7') and backslash on the third layer (AltGr+'+').

cj7hawk said:
Also, when you say character manipulation, do you mean adding characters into formula to represent a number value - eg, 'A'+$80 ?

Yes, this is a very useful thing to have.

Phil_G · Dec 11, 2023

tofro said:
For "normal" programs, you don't really need or want reloctable code. But if you want to write RSX (resident system extensions) modules or GSX drivers, (and some "more exotic" system extensions that need to relocate themselves to high memory), you need to have relocatable programs - Especially when your loading more than one drver or RSX, the load address is not known beforehand.

Provided you know where in high memory its loading, thats catered for with Davids 'offset' or 'phase' as M80 users will know it. If you dont know, then yes it needs to be relocatable.
For me personally its not a requirement, the only circumstance where I've needed to use relocatable code is when linking PLI-80 and RMAC assembler source.

Just thought: some means of ORG'ing tables onto a boundary would be handy, MOD or whatever

tofro · Dec 11, 2023

Phil_G said:
Provided you know where in high memory its loading, thats catered for with Davids 'offset' or 'phase' as M80 users will know it. If you dont know, then yes it needs to be relocatable.
For me personally its not a requirement, the only circumstance where I've needed to use relocatable code is when linking PLI-80 and RMAC assembler source.

Just thought: some means of ORG'ing tables onto a boundary would be handy, MOD or whatever

Well, with RSXs and GSX drivers, you can't know a load address. They're loaded to the top of memory (which is different for nearly every CP/M machine) and you can load more than one of them ("stack" them from the top down).

But to elaborate a bit on what my essentials would be:

Access to the current address ("*" and/or "$"). I think that's standard
Some minor macro support. Some assemblers overdo this a bit with code-generating loops and the like, that's nice if you have it, but awkward to implement
Expression evaluation should follow the principle of least surprise. Left-to-right, unfortunately, doesn't. Also, I think I would like to see full "C" operator support
Decimal, hex and binary constant formats I think are mandatory. PDP-11 guys like octal, but that's never been a thing on the Z80
A CP/M assembler should support a directive for $-terminated strings
Proper tabulated listing and symbol table generators are a bit more than nice to have
Local labels can be nice, especially in large programs ("just came a cross the 15th "str_loop"), but also can be a nightmare....
Assemblers I like tend to have directives for "non-mutable" and another one for "mutable" equates (the first one gives an error when redifined, the second one is explicitly allowed to be). I think my assembler of choice uses "EQU" and "SET", respectively.
INCLUDE is, I think, mandatory if you don't support linkable code and reloctable object files
INCBIN (include a binary file literally) is nice to have in places

cj7hawk · Dec 12, 2023

Phil_G said:
Provided you know where in high memory its loading, thats catered for with Davids 'offset' or 'phase' as M80 users will know it. If you dont know, then yes it needs to be relocatable.
For me personally its not a requirement, the only circumstance where I've needed to use relocatable code is when linking PLI-80 and RMAC assembler source.

Just thought: some means of ORG'ing tables onto a boundary would be handy, MOD or whatever

Hmmm ORG-ing tables onto a boundary... Should be possible to pick up via the existing ORG function on most assemblers I would assume...

For me, it would be something like ( Let's assume it's a 1-256 byte table, on a 256 byte boundary, so you can look it up with an 8 bit half of a 16 bit register without an ADD operation. ).

.ORG ^&%1111.1111.0000.0000+$100
Where ^ is the symbol for the current PC. This should mask it to drop it back to the previous boundary, then add $100 to move to the next boundary.

Which will locate the next boundary, except when it's already aligned with a boundary, in which case, 256 bytes will be lost. But otherwise this will shift the table forward to the next 256 byte boundary edge.

tofro · Dec 12, 2023

cj7hawk said:
That makes sense, thank you for the example. I guess my current architecture would just replace these with process hooks, so they could all install in the same location ( Typically $1000 ) and page in/out as called, while still being able to access common tables and other data in the original TPA. So it's pseudo relocatable from that perspective. But from your example, assuming I saved a list of fixed vectors to update with the new code location while loading, I could relocate code anywhere in memory.

Do you know if CP/M had a common loader for such code, or did linkers get used for this purpose?

CP/M uses the ".PRL" file format for relocatable binaries. Here's a description. LINK80 could create such files, but a linker is not mandatory.

Chuck(G) · Dec 12, 2023

CP/M 3 and MP/M can load page-relocatable executables. Relocatable objects, if generated by an assembler like M80 are still .REL. The old way, for things like generating page-relocatables for MOVCPM and MP/M was to assemble a program twice, with one version offset by 100h and combine the two objects to produce a relocation bitmap.

whartung · Dec 12, 2023

If you don't plan to support a linker, then you should most certainly support an include facility. Ideally you'd have some kind of MODULE concept specifically in regards to things like local and global (exported) variables. It's not a crisis if you don't do this, but it makes things easier in terms of reusable code broken up into include files.

If you don't have the module concept, then everything will be in a global name space, so you would need to ensure that MOD1.INC and MOD2.INC don't stomp on each other in terms of symbol names. You can do something as simple as prefixing local symbols (with a "." or "#" or whatever).

But, of course, the other issue is that you're A) writing this in Z80 which means, B) you're running this on a Z80. So you will potentially run into memory problems.

Also, there's a difference if you're running on a 4MHz Z80 or a modern 10, 20, 50MHz one. Line lengths, symbol lengths, etc. were all part and parcel to running on small systems (not necessarily "stupid parsers"). Also, added flexibility and generality cost performance. Not necessarily as much of an issue on a 512K, banked memory, 30Mhz Z80. But on a 64K CP/M machine, running at 4Mhz with 48K of free memory, and glacial floppy drives, it's a different story.

Same with the idea of a linker. On modern machines, linkers are pretty much optional, it's trivial enough to just assemble everything in one big go using include files.

On smaller, slower systems, assembly was more expensive than linking. So shorter programs improved cycle times that were linked together.

Chuck(G) · Dec 12, 2023

A linker also gives you access to object libraries, which can be useful.

cj7hawk · Dec 12, 2023

A question out of what wasn't mentioned - Is there value in the source being able to send notices to the console - eg;

.console "NOTE: This is NOT the latest version of this code. It's a test version. Go ask Peter for the latest source" -

And all it does is print that message during compile as a way to alert the person assembling it of the message during that time;

eg;
A> ASSEMBLE MYCODE.ASM /T /Y /L
NOTE: This is NOT the latest version of this code. It's a test version. Go ask Peter for the latest source

Assembling 723 lines. 0 errors. Created MYCODE.COM
A>

Regards
David

Svenska · Dec 13, 2023

cj7hawk said:
Is there value in the source being able to send notices to the console

Yes, and I think both warnings/informations and errors are useful.

Many assembly programs are customizable through definitions at the top, which are then evaluated by the assembler through IF statements. Having explicit WARNING and ERROR directives allow the author to handle unlikely or invalid configurations. For example, a configurable XMODEM tool could warn if no drivers are enabled (making it useless), or fail if incompatible drivers are enabled.

tofro · Dec 13, 2023

Ah, I forgot: Some means if conditional assembly (like #IFDEF/#END pairs) are very handy.

cj7hawk · Dec 15, 2023

So I'm about halfway through the instructions - Well through the single-byte instructions, and I've included the IX/IY (+D) versions of them with automatic prefixes and suffixes.... I've finished the maths and settled on single-byte text fields that can be mixed with calculations (eg, DB $20+'ABCDEf'-$20 would be the same as having DB $20+'A','B','C','D','E','f'-$20 which would be aBCDEF in terms of actual characters. ) I haven't added DW separtely yet but it will only allow a single entry - eg, DW $1234 / 32 >> ! +1 would be legitimate, but DW $1234+"AB" would switch back to 8 bits and error since it really messes with the maths routines to flip values.

Rather than use tables, I've been using calculations to form values - so I get some quirks presently like LD (HL),(HL) being accepted syntactically. ( Is that a real word? )

And I'm wondering if there are common ways to code this... At the moment, I'm grouping them by syntax, which creates a few interesting issues, such whether most assemblers individually trap ld (HL),(HL) - )ie Halt - and what the approach is to coding them... I'm finding text errors are taking up a lot of space, and that I tend to "bomb out" on errors rather than continuing assembly ( warnings excepted, which I have not included yet, since I'm focussed on errors at this time. ) and whether it's acceptable to only use the necessary number of bits on expansion of arithmetic values - eg, using the lower 8 bits for 8 bit values - Later I'll start to include generic checks for such assignments, eg, "CALL 8BITWARNING" before an assignment so that if bits are present in the upper 8 bits, it issues a warning, but otherwise assembles. Also, I allow labels to override values, which is interesting, as it means it's probably going to be possible to do something like EQU 8,9 which installs a genuine label for 8 but doesn't since decimals are evaluated first, but EQU 123456,12345 does work since it would determine that 123456 wasn't a valid 16 bit number and might be a label. Hopefully no one would obfuscate code like that, but it does

And it allows numbers and symbols ( except operators ) to be mixed into labels eg;
LOOP[1]:
1DATA:
TEST(4)BETA:
[START]:
are all valid labels. Though for the life I me, I'm not sure why I thought labels like [START] were a good idea, but I figure it might be useful for some.

And I'm wondering if anyone ever wrote a book back in the day on what a good assembler should do... Also the code is getting pretty big. It's a little over 4K at the moment, and I'm expecting to hit around 6K to 8K by the time I'm done and included both documented and "undocumented" instructions and included the appropriate warnings.

Somethine like a "How to write an assembler for dummies" or similar.

But it is working so far, and produces good code...
Also the error system is producing a bit of a cryptic response, which is a code like 3>00167-8131-LABEL-,- which means the third term on line 167 assembles to an instruction at hex location 8134 ( not necessarily that byte - that's where the instruction starts ) and has the token "LABEL' with the operator ,

And then I provide a text description of the error such as "LABEL not defined." so it's immediately clear which term caused the assembly failure and what line, but it's not all that user friendly yet...

But it's coming along, and it should work with the same source code as my PC based cross-assembler...

At the moment, I'm looking to allow it to assemble direct to memory, to a disk file, or to an intermediate format like HEX. Also it writes instructions separately from other bytes, so I could probably get it to produce debugging style outputs where the code is printed to the left of the line in hex along with a memory target, or things like that. Which could lead to adding in a relocatable format. Kind of like the old CP/M paper listings.

Also it should be possible to include warnings like "Not 8080 compatible" if it hits those instructions since at the moment, they seem grouped pretty cleanly.

I haven't changed any of the maths symbols yet, but they are easy enough to change in the source, so I'll leave that thinking for later.

Writing an assembler in assembly has been an interesting experience so far.

Svenska · Dec 15, 2023

cj7hawk said:
Somethine like a "How to write an assembler for dummies" or similar.

I linked to a very long video on this topic in a previous post. Did you look at it?

When it comes to characters, I prefer the differentiation between strings (in double quotes) and characters (in single quotes). Only characters are useful for arithmetic. Generally, I am also a fan of rejecting invalid input and decent error messages. Simply using software should not require having the documentation in a binder next to the keyboard because the author decided to be cryptic.

How do you differentiate between label names clashing with instructions, i.e. "LD A, 0x33" vs "LD: LD A, 0x33"?

Does your math handle operator precedence?

cj7hawk · Dec 16, 2023

Svenska said:
I linked to a very long video on this topic in a previous post. Did you look at it?

I didn't watch it all - I just scanned through it and bookmarked it to have a more detailed look at later - but it's all in C ( I've written my assembler before in BASIC and it works OK - but writing in Assembly is very different when building an assembler... )

In basic I just matched bytes to tokens while in assembly I'm assembling command bytes from different things, and trying to reuse code as much as possible... Which improves reliability also... And I'm making a lot of mistakes, but I started with the lexical analyzer and then build the mathematical process parts then some basic assembler directive like ORG so I could test the maths then I went onto the instructions starting with LD. I'm about 75% of the way through the first page of opcodes now.

Also, my lexical analyzer doesn't permit a "peek" ahead, so I can't look at what's coming next to see what to do with what I have. Also, I don't store what I have unless it's related to a calculation.

This streamlines the code a lot. If a routine is called and not needed, it steps back and out of the way, so I call one of two token fetches -
1. Get the next token and stop on any named operator, or whitespace or EOL, that is encountered. Used for labels with a colon and for instructions and directives.
2. Same as 1 but ignore any and all whitespace. Used for anything after the instruction. This means LD A,B and LD A , B are read the same and return the same token.

Also, operators are not included in the token fields. The only exception to this is the ')' token when used with IX+DISP) since the DISP can be a calculation, and ) is not a valid number, and is not supported in mathematical labels for this reason.

Svenska said:
When it comes to characters, I prefer the differentiation between strings (in double quotes) and characters (in single quotes). Only characters are useful for arithmetic. Generally, I am also a fan of rejecting invalid input and decent error messages. Simply using software should not require having the documentation in a binder next to the keyboard because the author decided to be cryptic.

The more I've been thinking of that, the more I realize I need to put better error reporting in... I'm writing it and I've already started to forget what the order of the number is while debugging.

But I figure out I need to include which term in a line caused the issue, as knowing the line doesn't always mean I notice which part of a line is faulty. Errors cause a hard stop, but warnings will not affect assembly.

Also, I hate having to remember that " means string and ' means character, so I just decided " and ' both mean the same thing, which means I can do stuff like;


EQU DQUOTE,'"'
EQU SQUOTE,"'"
EQU CHAR,'A'
DB STRING,'THIS IS A STRING'
DB LEADING_BIT8_SET,$80+'STRING'
DB TRAILING_BIT8_SET,'STRING'+$80
DB SENTENCE,"Here's a sentence with an apostrophe."
DB SQUOTED,"'This would be single-quoted text'"
DB DQUOTED,'"This would be double-quoted text"'

But they must be closed by the same. For example, LD A,'A" would be invalid.

So DB means Byte, Bytes, Strings, etc... Any sequence of 8 bits.
DW strictly means two bytes, little endian, value expansions only.
BLOCK means a number of reserved bytes of a specific character.
This makes it very simple, is versatile and I don't think there's anything I can't do that would require bytes and a byte to be different directives.

Svenska said:
How do you differentiate between label names clashing with instructions, i.e. "LD A, 0x33" vs "LD: LD A, 0x33"?

Oh, the colon differentiates it... And that'a perfectly valid syntax and works most of the time. For example, I use instruction names for the strings to make it easier to read...
eg


LD: DB 2,'LD' ; The LD command as found by the lexical analyzer.
LD DE,LD
CALL TESTLOOP
JP Z,LD_COMMANDS

etc.

But for DE, I had to use something different, eg,

DE_REG: DB 2,'DE'
LD DE,DEREG

Because clearly DE is a recognized register.
But LD DE,DE2 or anything else it doesn't recognize as a register is perfectly valid.
Also, unless a label is declared with a : as the first term, or with an EQU then it's going to cause an error... So mistyping DE2 instead of DE isn't going to create a random variable. For that reason, I've had to go with two passes. But the first pass can waste the code writes.
Also, this is how I'm going to implement MACROs. I'll just have the assembler redirect the output from codespace, or filespace, to the contents of a label... Then the label will hold the code sequence for that macro.

eg
MSTART THREE_BITS_RIGHT
RRA
RRA
RRA
MEND

Then this can be called by
MACRO THREE_BITS_RIGHT

Or at least that's the plan at the moment... I still have to do a bit more research into Macro's to make sure I understand them.

I have used as few data structures as I can, but the ones I have used are very versatile.

Svenska said:
Does your math handle operator precedence?

No it doesn't.

It's like reverse polish, except forward direction... An operator works left to right and only on the last and next value and while operators can be chained at times, there's no BIMDAS / BODMAS / ASMD or whatever it is now. So there's no need to think about operator precedence.

So, something like;
A+B/C * 12 \ 8 + %11100000 means Add A to B, then the result divided by C then the result multiplied by 12, then the result mod 8, then the result masked with 0xE0.

It would expand to something like this on normal maths notation ( not how the assembler does it ). So values on the left have precendence over values on the right.

((((( a+b ) /c ) *12 ) \8 ) +%11100000 )

This means there is no number stack, which makes it particularly convenient to upgrade to 32 or even 64 bit maths later... It's incredibly simple. There's probably a name for this type of precedent-less maths, but I don't know what it is.

It still supports complex stuff like (A+B)*(C+D) - but would ask the user to do it on multiple lines.

eg,
EQU CDSUM,C+D
EQU RESULT,A+B*CDSUM

or
EQU ABSUM,A+B
EQU CDSUM,C+D
EQU RESULT,ABSUM*CDSUM
Which is a little more complex since it's using the label table as a defacto mathematics stack.

But it's an assembler, not a mathematics program so I don't mind that. It's very easy to get your head around it and removes the problem that precedence changes every few years... Which is why we can't help our grandkids with their homework if we don't want them to fail.

Also, it supports *any* length of calculation, and quite complex calculations at that, but is never going to run out of stack.

Speaking of which the label table I made up is very simple but will support maths calculations, values, and even macros... It's quite versatile. And treats them the same way COBOL treats numbers... In the end it's all just a bit pattern.

cj7hawk · Dec 16, 2023

Also, as a thought, it means that Macros and String as variables are kind of the same thing...
So I guess it would be possible for a label to be both a value and a string.

eg;

MSTART STRING
DB 'This is a string I might change later. But I want to store it as a label. I don't know why'
MEND

This string is 90 characters, so
DB STRING or LD A,STRING would give the value of 90 (numerical).
while
MACRO STRING would write the 90 bytes to code at the current PC.

That's not intentional. It's just how it would work. Which is interesting.

Also,

MSTART STRING
MACRO STRING

Would crash the assembler- since it's infinitely recursive. But the principle is that Macros can macro macros.
But as I noted, I haven't started on the Macros yet, so this is just reflective of my thinking process.

cj7hawk · Dec 17, 2023

My assembler Bingo card is slowly being filled in... I don't know why coincidentally I left the hardest for last... Maybe I should just replan that entire routine.

Currently up to 6K assembled.. Which feels kind of large for an assembler, but there's a lot of error codes in there and I've improved the user error reports. I realized quite late that there's no easy way to know which errors to combine. I imagine I'll hit 8K by the time I've added in the BIT and Extended codes.

Svenska · Dec 17, 2023

The original DRI assembler is 8K, and hjalfi's implementation is slightly smaller. Both only handle 8080 opcodes.

Your feature set is comparable to both of them, so I'd assume you to land around the same size.

cj7hawk · Dec 17, 2023

Svenska said:
The original DRI assembler is 8K, and hjalfi's implementation is slightly smaller. Both only handle 8080 opcodes.

Your feature set is comparable to both of them, so I'd assume you to land around the same size.

As long as I can keep the executable a little under 8K I will be happy. I have already implemented the 8080 commands and all the non-bitwise IX/IY commands and am sitting at about 6.5K so I think I'll be OK.

I have a bit more to go before I can start throwing assembly files at it though... I still haven't comleted my first-pass activities or added the warnings. I will have to optimize it a little more too. Some code is repeated where I could use the same routines... But the deeper I get, the more I realize what I could have done. Planning would have been smart, but I didn't know how I wanted to do it, so just starting on the lexical analyzer seemed like the best choice... So far, it's working OK. But I need a way to check for errors, which is more complex than it sounds... I'll need a way to trap the failure and pick up where it left off.

Writing Assemblers... What should a good assembler do?

Member

25k Member

Veteran Member

Experienced Member

Member

Veteran Member

Member

25k Member

Veteran Member

25k Member

Veteran Member

Veteran Member

Member

Veteran Member

Veteran Member

Veteran Member

Veteran Member

Veteran Member

Veteran Member

Veteran Member