• Please review our updated Terms and Rules here

Writing Assemblers... What should a good assembler do?

Do you mean "Drop the link pointer" as in "using something like the counter to denote the list and scan for it?"
I don't know what you mean by that, but that's probably not what I'm suggesting.

You currently have entries in the form nnls...vv, where nn is the link pointer, l is the length of the symbol, s... is l bytes of symbol name, and vv is the value of the symbol. You currently save nn so you can move your pointer into the symbol table forward to that location if the current symbol doesn't match. But you don't need this: you can find the start of the next symbol simply by finishing a complete parse of the current entry. I.e., when you are c bytes into the symbol name and find it doesn't match, just add l-c+2 to your current location and you're at the start of the next symbol.

This is entirely orthogonal to having additional data types in the list. That, by the way, can be very compactly done by restricting the symbol length to, say, 64, which requires only 6 bits to represent, leaving the top two bits of the length byte free for other purposes. (It would even be reasonable IMHO to restrict max symbol length to 32, thus giving you three bits for type, covering eight different types.)

A linked list may not be the best option, but it does allow future requirements that I am adding in presently and has a small overhead while allowing for this.

In any event, there are far worse uses of space in the current code that I will need to address later such as the wasted space in the code to store the opcodes and scan them. This would be better as Bit7 start/terminate list...
The overhead is not small: as I pointed out it's over a kilobyte for your assembler itself. That is, for example, close to 20 times as much space as you'll save from removing the length bytes from the opcode list and using bit 7 termination.

I have an objective for the assembler to work under CP/M, but the target system will have 1Mb of memory...
That is a pretty radical change from the original spec you were talking about when you wanted this to run on a 20 KB system. If it really is intended to run on a 1 MB system, why do you keep worrying about code and data size?

LABEL EQU VALUE and LABEL CALL FOO are fine as high level concepts, but the : is the exception delimeter there, and if you examine the parser, it allows an operator to change the context of the line. The parser is very simple and simply breaks up the next incoming instruction, and as you note, *everything* is an instruction. A ':' operator is an instuction to add a label and set the value to PC.
Right, so in this case of EQU, you add a label, set the value to PC, discover that the "instruction" is an EQU, read and parse the operand, and reset the label value to that result. I'm not seeing the issue here.

- ' and " are both valid quotes. I think this may be unique to my assembler since quotes can be quoted - eg, "'" and '"' are valid, so can be used directly.
Nope, not at all unique. ASL and many other assemblers allow this.
 
I don't know what you mean by that, but that's probably not what I'm suggesting.

More or less we're thinking the same - I guess the differences is I'm planning on adding more values after vv, such as Macros, so additional data structure is needed, and I would prefer to scanning the list and contextually figuring out where the next code is. It means I need two bytes for *every* entry, but as I mentioned, saving executable code is preferred over saving functional space.
That is a pretty radical change from the original spec you were talking about when you wanted this to run on a 20 KB system. If it really is intended to run on a 1 MB system, why do you keep worrying about code and data size?

I can see where this would be confusing, but it's just the specification I've set for myself. It should run and be somewhat functional on a 20kb system, but ultimately it will run on a 1Mb system and future versions will be able to use all unused memory for the table.

Part of it is also that I'm making up the requirements as I go, so it's not as though I thought this out any further than the assembler code you've seen. As other ideas emerge I'm examining them and making notes, so what you suggest may be a great way to go.

Right, so in this case of EQU, you add a label, set the value to PC, discover that the "instruction" is an EQU, read and parse the operand, and reset the label value to that result. I'm not seeing the issue here.

I do see what you mean, but I'd still have to store the label somewhere temporarily... A vector would be enough though... Also, a label by itself usually should trigger a Macro, not adding the label. It's the colon that tells the assembler to add the label in as a label.

There is food for thought though. I don't think the solution is simple, but if I can use what you've suggested and keep the code space down, then that is likely to happen.

Nope, not at all unique. ASL and many other assemblers allow this.

LoL! I guess there really is nothing new under the sun. Just stuff I don't know about yet and it makes sense that other designers would do the same, since otherwise you have to use cludges like LD A,$22 rather than LD A,'"' or LD A,"'" :)

I feel pretty good about that - it means I'm on the right track for features - :)
 
More or less we're thinking the same - I guess the differences is I'm planning on adding more values after vv, such as Macros, so additional data structure is needed....
Yes, but something in your code has to be able to parse that, so you can just use that same parser to skip past the data.

It means I need two bytes for *every* entry, but as I mentioned, saving executable code is preferred over saving functional space.
That doesn't really match up with ideas like your CLEAR command that add code in order to (sometimes, if the user wants to put in the effort) save a little bit of data space. (I reckon more often that just won't be used, and you end up just with more code.)

And in fact, even in the absence of CLEAR or similar commands, it's still saving you the extra code you need for building and following the linked list.

But in the end, I'm not really seeing the difference between saving on space used by code and space used by data; on a CP/M system they're both in RAM anyway, so either will run into problems if you don't have enough RAM for the program and the data that the program needs to keep in memory.

I do see what you mean, but I'd still have to store the label somewhere temporarily...
I don't see why you need temporary storage for a symbol name used for EQU but not for a symbol name for any other mnemonic. It seems to me in both cases you just stick it at the end of the heap.

A vector would be enough though... Also, a label by itself usually should trigger a Macro, not adding the label. It's the colon that tells the assembler to add the label in as a label.
Right, and that colon tells the assembler to add the label in as a label regardless of whether it appears before a "real" instruction like PUSH or CALL or LD or a pseudo-instruction like EQU.
 
Yes, but something in your code has to be able to parse that, so you can just use that same parser to skip past the data.
It's definitely modular and replaceable. Just a few routines.
But in the end, I'm not really seeing the difference between saving on space used by code and space used by data; on a CP/M system they're both in RAM anyway, so either will run into problems if you don't have enough RAM for the program and the data that the program needs to keep in memory.
It has to do with storing a bare minimum of programs in a ROM that appears as a RO disk under CP/M, which is a part of the Loki architecture. It boots it's CP/M OS off of a ROM. Adding a basic assembler to the ROM is an objective, but ROM space is far more at a premium than RAM space. I need to fit the OS and all basic utilities into 64K of space... Well, about 59K of space after the boot sector with the BIOS/BDOS and the directory structure. 1K block size for the ROM. So that's 59K for the CCP, Assembler, Utilities, Monitor, Editor, Some drivers, etc.

I want it to work in low RAM systems also, but that's not a requirement - just an objective. 20K would still allow a fair bit of assembly.

The main reason for being able to clear the table is to support local variables in included files. It's possible to use in the main code, but that's not it's purpose.

I don't see why you need temporary storage for a symbol name used for EQU but not for a symbol name for any other mnemonic. It seems to me in both cases you just stick it at the end of the heap.
I don't need a symbol name, but I do need a vector to where that symbol is in the table at the least. Otherwise I need a symbol to locate the vector.

Right, and that colon tells the assembler to add the label in as a label regardless of whether it appears before a "real" instruction like PUSH or CALL or LD or a pseudo-instruction like EQU.

It would definitely work if I added the colon. But I'm not sure that's any better than what I have.

Still, it would be nice to add. I'll put that in for things to look at later.
 
It's definitely modular and replaceable. Just a few routines.
I think that's orthogonal. If you don't have the routines to read that data type, you (hopefully!) don't have the routines to write that data type either.

It has to do with storing a bare minimum of programs in a ROM that appears as a RO disk under CP/M, which is a part of the Loki architecture.
Fair enough. But again, removing the linked list processing from you code reduces the code size too, so this is basically a twofer.

I don't need a symbol name, but I do need a vector to where that symbol is in the table at the least. Otherwise I need a symbol to locate the vector.
It would definitely work if I added the colon. But I'm not sure that's any better than what I have.
Well, for a start, doing the same parsing for EQU-generated labels as you do for non-EQU-generated labels will both save on code and be easier to learn because it's more consistent for the programmer.

I don't get what you're talking about when you say you're not needing a symbol name but needing a pointer. In most assemblers an EQU-generated symbol is no different from a symbol generated from a PC location. I.e., when someone does a jp foo, it works exactly the same whether that symbol was (or will be) defined via foo: equ $1234 or org $1234 followed by foo: or a,a. So regardless of the situation, you process both the definitions and the references the same way.
 
I don't get what you're talking about when you say you're not needing a symbol name but needing a pointer. In most assemblers an EQU-generated symbol is no different from a symbol generated from a PC location. I.e., when someone does a jp foo, it works exactly the same whether that symbol was (or will be) defined via foo: equ $1234 or org $1234 followed by foo: or a,a. So regardless of the situation, you process both the definitions and the references the same way.

By the time the parser reads the "EQU" bit, it has already erased the label... So if I make a state exception, I still need a vector to where the subsequent value will be placed in memory.

Also, all of the matching routines are written to assume the buffer is the only place in memory that it matches, so if the symbol was created in the list, then the next thing I need to do is remember where it was - if I know the vector, I don't need to match it again.

The parser only moves forward. Not backwards. Nor does it buffer the line. It only reads forwards and it copies the next segment of code, normally a single element, into the buffer where it is processed.

But as you say, if it's already created... a line like this;

LABEL: EQU VALUE

Could be easily implemented - Not as a single command, but as two commands... Though given it would clash with EQU LABEL,VALUE

Overall, it would be far easier to implement as

LABEL= VALUE

This fits the current syntax, requires no further processing and is already supported with just a few bytes of code.

But LABEL EQU VALUE is either going to generate an error ( command not found) or execute a macro stored after the value.

Unless I rewrite the core lexical analyzer it's not going to be a simple process so it's something I need to look at more deeply. Though first I want to get the extra features in and working. I can work on optimising the code later.

David.
 
By the time the parser reads the "EQU" bit, it has already erased the label... So if I make a state exception, I still need a vector to where the subsequent value will be placed in memory.
Yeah, this is still all confusing as heck to me. It seems to me that these two lines should be parsed in the same way, using the same code up to the point where it decides how to process the mnemonic:
Code:
FOO:   EQU $1234
FOO:   JP  $1234
If you do just that syntax, that would make for a simpler and smaller parser than having alternate forms of EQU and label-prefixed EQU and whatever that work differently.
 
Oh, wait: I just did realise that you may be thinking of what to do about code that looks like this:
Code:
SOME_SYMBOL_NAME:
           EQU $1234
For that, can't you just update the value of the last symbol added to the symbol table? The same code could be used to process single-line NAME: EQU $1234.
 
Oh, wait: I just did realise that you may be thinking of what to do about code that looks like this:
Code:
SOME_SYMBOL_NAME:
           EQU $1234
For that, can't you just update the value of the last symbol added to the symbol table? The same code could be used to process single-line NAME: EQU $1234.

Both
FOO: EQU $1234
and
FOO:
EQU $1234

would parse almost the same at present, though I get a collision error on both since;

LOCATION: EQU VARIABLE, $1234
and
VARIABLE: EQU $1234

would look the same to the EQU handler and it would install a label called $1234 into the table, then error out on the lack of a value. Any attempt then to do the first would expand out and whatever "VARIABLE" was would go into "LOCATION:" - It's a syntax handling conflict. There's no clear differentiator on which way it's supposed to go.

To some extent, the "," operator could be used to determine which was correct, but functionally, there's imcompatability with the rest of the code that gets in the way of a simple solution.

But you've given me some good ideas and caused me to examine this problem from different perspectives... So when I write the macro command I'm going to consider then whether I can add an EQU handler in there.

The reason is that if a macro is called, and there's no Macro ( label is conventional, ie, terminates right after the value into the next label ) I was planning on generating an error. Since I already have the label, and it's position has been located, and I'm about to give a fatal error anyway, it's not a bad place to do "one more check" for an EQU or a SET, then pick up the value, then write it since the label location is already found and can be stored by the MACRO code. Then I could support both ways... If I put it there, I've already done the rest.

The only downside I can see is that I may not be able to use LABEL EQU VALUE within a Macro since it would use Macro execution code while in a Macro record function.
 
LOCATION: EQU VARIABLE, $1234
and
VARIABLE: EQU $1234

would look the same to the EQU handler and it would install a label called $1234 into the table, then error out on the lack of a value. Any attempt then to do the first would expand out and whatever "VARIABLE" was would go into "LOCATION:" - It's a syntax handling conflict. There's no clear differentiator on which way it's supposed to go.
Right, which is why I keep suggesting just switching to the sym: EQU value syntax and saving all the extra code you currently use to deal with the other syntax that has to parse out the symbol name from a pair of expressions, rather than just parse one expression of the exact same type used by any other operand.

The reason is that if a macro is called, and there's no Macro ( label is conventional, ie, terminates right after the value into the next label ) I was planning on generating an error. Since I already have the label....
I'm going to suggest that your life will be much easier if you do not consider macros to be labels/symbols, since they are referenced in the instruction field of a line, not the operand field. Mixing such two different things would be like mixing instructions like CMP and labels/symbols.

Note that this does not stop you from storing your macro names and values in the same name-value heap as the labels/symbols, at a slight cost in search time. You just need to tag the type of each entry, which can easily be done with e.g. an unused bit in the name length field, as I suggested earlier.
 
Having thought this through from a few angles, it seems the best approach is;

* Default to no local variables and single-sub-pass-per-pass processing, without any deletion of variables.
(ie, Just include like it's a part of the calling source ).
* Allow "Modules" to decide what they require allowing;
. Two pass optional if global variables need to be completely accurate on pass 1.
. Local variables and masking higher-level variables if required, including retro-forcing this via the module itself.
. The option to make *all* variables and labels local and non-interfering with previous labels, regardless of pass (eg, default all variables and labels are local and are not specifically deleted, but do not interfere with other labels and unique labels become global by default ), but once again, are non-interfering, with the first declaration of a label having precedence over later declarations at any level, but still not interfering with later declarations.
. Being able to delete specified and local variables and labels after processing and leave global ones.

Taking this control from the main source and making it all optional means first-time users won't care or need to care about the added functionality, and experienced users could make modules that take advantage of all of these if required, but generally as long as modules are register-level entry/exit or use common memory, any code could be made into a module by removing the .ORG statements and adding in a source-switch command to establish local variables only for that routine.

This should address speed/memory/pass-conflict requirements.
 
The includes are all written, and allow any level of nesting, and both single and double-pass per includes depending on how they are written, with single-pass as default.

Labels can be either local, global, or a special label that is removed without affecting other labels.

Each nesting costs around 170 bytes of memory to store the current file handles so nesting of includes can occur for as long as memory is available. I didn't include making includes auto-test redundancy as that would break conditional code, and it's possible for the system to alert an include that it's already been included, so it can choose not to reassemble on subsequent calls or calls by other routines depending on how it's written.

Also, the label table now gets rewritten since it's a linked list, allowing new routines to "mask" all previous labels except system labels, and then reuse all label names. This is recursive in nature, so each nested level can create it's own labels without worrying about duplicates, and the clearing of old labels is controlled by the source code (ie, not automatic on exiting a nesting level).

This means either code can be included as though it was in the source, or written so that it can be called at the start of a program and then all labels erased before moving on, can be called without erasing the labels, or can use markers to differentiate between local and global labels.

Now I need to include Macros... And these will consume system memory unless removed from memory, which is unlikely given what a macro is.

So two options present;

1) Store a macro as a series of bytes that get "stamped" into code when called or
2) Store a macro as a series of *instructions* that get *assembled* when called - eg, stripped down code to just the opcode mnemonics, values and labels without any wasted formatting space.

Options 1 makes for less use of system memory, since bytes are stored as bytes, allowing macros to be used to create new instructions I don't support, as long as they don't need to act on values.

Option 2 allows for more complex macros, that can reference labels, potentially allowing for arguments to be inserted, and the macro might change depending on arguments, but would require creating system labels for arguments (eg, arg1, arg2 etc ) so that a Macro could set arguments and reference them separately each time it was called. The downside is that I have to store opcodes and variable names in system memory once defined, so it's around 10 times larger typically. Though I'm assuming Macro's aren't going to be very long, since if they were, it would make more sense to write includes.

Option 2 also means code size for a macro might change when called, especially if conditionals are embedded in the macros.

Any thoughts on the best way to implement macros?

Current program size is around 9K, so still fairly small, and the macro code itself should be reasonably small also, so both approaches would be valid.
 
Just want to quip good job so far, sounds like you're getting a lot out of the project.
 
Just want to quip good job so far, sounds like you're getting a lot out of the project.

Thanks for the encouragement :)

Writing an assembler turned out to be a fantastic way to learn new techniques. It begins with someone pointing out what assemblers do, then I try to figure out why people use that feature. In learning more about the feature, I end up covering a huge cross-section of the community and their techniques, and it's not an exaggeration to say I never knew 90% of them existed... And of the other 10% that I did know about, I probably only used about a quarter of the techniques.

The more I think on it, storing the mnemonics for a macro makes more sense over storing the bytes the mnemonics generate. A macro shouldn't be too long anyway, and should be something rare, so being able to process it as code that can reference other labels and arguments seems to be the way to go. It also allows for programmers to create their own mnemonics if the CPU calls for it, since that would just be a string of DBs with formulas following instead of specific bytes.

Unfortunately I've never programmed my own macros before so I don't know what limitations other assemblers impose on the macro and I'm making assumptions about what they should do from a few examples I've read.
 
You can think of Macros much like the Includes, just manifest differently. The key differences are simply the (potential) logic involved, and, perhaps, the granularity (i.e. single file with several macros in it, rather than 1 "big" "stupid" macro per file).

But much of the logic is the same: scope, context, local variables, nesting, etc. You could almost say there's nothing different between a fancy "include" and a macro save where the text is sourced from. Just treat the includes as one big macro.
 
You can read about the very complex macro features of the IBM 370 assembler, here:


Page 251 in the document (page 263 of the PDF).

Bill

That's a fascinating document, thank you.

It's heavy on architecture even if it's light on examples ( though most of that is probably me not being familiar with the different instruction set ) and is fantastic in terms of common definitions and other information that an assembler writer would want to consider.

It's interesting how no matter whether the writer of an assembler knew of these or not, the form of the assembler will often have similarities and contstraits. Form follows function and function defines form.

Like most things, writing an assembler is better the second time around... I wish I had this to read before starting my project.

From examining how the macros work, much like my includes, "Option 2" is the way to go with the macros.. If a longer macro is needed and memory isn't sufficient, calling it via includes is more practical and avoides using much system memory.

It seems the best solution is to force short macros, maybe to 256 bytes maximum, and which corresponds to around 16 to 32 bytes assembled. Anything longer than this would be better served as an include, while macro's can be defined anywhere in my architecture, including within includes.

Then I only need macro's to do what my include feature cannot - eg, defining new commands for specific processors that I didn't anticipate yet.

Thanks again for posting that link.
 
I'm glad you found it useful.
There are LOTS of IBM 370 assembler related manuals on Bitsavers.

The manuals with PLM in the name, Program Logic Manual explains how they work.
(most of these stuff was supplied in source code form back in the day)

Getting a bit off track here...
Related to the size of your program, bloating with additional features, many of the IBM programs existed in different versions, with a letter such as "Assembler Level E" or FORTRAN H", etc. That letter code relates to the minimum core size machine needed to run the program.

The early 360 and 370 models were not big machines by modern standards.
Assembler level E for example needed a machine with at least 32K core.

IBM also used a lot of overlays.

ibm360-processor-size-model-letters.jpg
 
Macros are now implemented.

Question: How many arguments should reasonably be provided to a Macro?

My architecture allows specific arguments as labels, information, calculations, etc, to be passed to a macro as a comma separated list, with up to a 16 bit value.

These are saved in system variables, supporting something like this;

.org $100

MACRO BDOS
LD DE,ARG2
LD C,ARG1
CALL $0005
MEND

BDOS Console_Output,'#'
BDOS Print_String,STRING

RET

STRING: DB $0A,$0D,'THIS IS A STRING!.$'

; CPM Calls as EQU functions.
EQU Console_Output , 2
EQU Print_String , 9

.END

The above would produce the output when executed as a com file of;

#
THIS IS A STRING!.

This allows arguments to be passed to a macro to affect the code it assembles, yet make it entirely predictable, and would allow creation of new "Opcodes" using commas and values, since even register representations are only functionally numerical symbols.

I've presently allowed for three arguments.

So
MACRONAME 1,2,3 is OK , and the values are;
ARG1=1
ARG2=2
ARG3=3

But

MACRONAME 1,2,3,4
Would give the error "Too Many Macro Arguments" as a fatal error.

So the question: How many arguments should be possible to transfer after a macro is called. Is three enough? Is it too many? What is a reasonable limit?

Also, I don't allow nested macros, but I do allow chained macros. ( Eg, Macros can call other macros, but it chains them - ie, assembly of the current macro ends, and assembly of the new macro begins. )

Thanks
David

p.s. Arguments are *not* mandatory. A macro can be called with 0,1,2 or 3 arguments.
 
Last edited:
A reasonable limit would be 16 arguments. Macros are not only useful for code snippets, but also for table generation. Building a CP/M DPB requires more than 3 arguments.
 
Back
Top