I linked to a very long video on this topic in a previous post. Did you look at it?
I didn't watch it all - I just scanned through it and bookmarked it to have a more detailed look at later - but it's all in C ( I've written my assembler before in BASIC and it works OK - but writing in Assembly is very different when building an assembler... )
In basic I just matched bytes to tokens while in assembly I'm assembling command bytes from different things, and trying to reuse code as much as possible... Which improves reliability also... And I'm making a lot of mistakes, but I started with the lexical analyzer and then build the mathematical process parts then some basic assembler directive like ORG so I could test the maths then I went onto the instructions starting with LD. I'm about 75% of the way through the first page of opcodes now.
Also, my lexical analyzer doesn't permit a "peek" ahead, so I can't look at what's coming next to see what to do with what I have. Also, I don't store what I have unless it's related to a calculation.
This streamlines the code a lot. If a routine is called and not needed, it steps back and out of the way, so I call one of two token fetches -
1. Get the next token and stop on any named operator, or whitespace or EOL, that is encountered. Used for labels with a colon and for instructions and directives.
2. Same as 1 but ignore any and all whitespace. Used for anything after the instruction. This means LD A,B and LD A , B are read the same and return the same token.
Also, operators are not included in the token fields. The only exception to this is the ')' token when used with IX+DISP) since the DISP can be a calculation, and ) is not a valid number, and is not supported in mathematical labels for this reason.
When it comes to characters, I prefer the differentiation between strings (in double quotes) and characters (in single quotes). Only characters are useful for arithmetic. Generally, I am also a fan of rejecting invalid input and decent error messages. Simply using software should not require having the documentation in a binder next to the keyboard because the author decided to be cryptic.
The more I've been thinking of that, the more I realize I need to put better error reporting in... I'm writing it and I've already started to forget what the order of the number is while debugging.
But I figure out I need to include which term in a line caused the issue, as knowing the line doesn't always mean I notice which part of a line is faulty. Errors cause a hard stop, but warnings will not affect assembly.
Also, I hate having to remember that " means string and ' means character, so I just decided " and ' both mean the same thing, which means I can do stuff like;
EQU DQUOTE,'"'
EQU SQUOTE,"'"
EQU CHAR,'A'
DB STRING,'THIS IS A STRING'
DB LEADING_BIT8_SET,$80+'STRING'
DB TRAILING_BIT8_SET,'STRING'+$80
DB SENTENCE,"Here's a sentence with an apostrophe."
DB SQUOTED,"'This would be single-quoted text'"
DB DQUOTED,'"This would be double-quoted text"'
But they must be closed by the same. For example, LD A,'A" would be invalid.
So DB means Byte, Bytes, Strings, etc... Any sequence of 8 bits.
DW strictly means two bytes, little endian, value expansions only.
BLOCK means a number of reserved bytes of a specific character.
This makes it very simple, is versatile and I don't think there's anything I can't do that would require bytes and a byte to be different directives.
How do you differentiate between label names clashing with instructions, i.e. "LD A, 0x33" vs "LD: LD A, 0x33"?
Oh, the colon differentiates it... And that'a perfectly valid syntax and works most of the time. For example, I use instruction names for the strings to make it easier to read...
eg
LD: DB 2,'LD' ; The LD command as found by the lexical analyzer.
LD DE,LD
CALL TESTLOOP
JP Z,LD_COMMANDS
etc.
But for DE, I had to use something different, eg,
DE_REG: DB 2,'DE'
LD DE,DEREG
Because clearly DE is a recognized register.
But LD DE,DE2 or anything else it doesn't recognize as a register is perfectly valid.
Also, unless a label is declared with a : as the first term, or with an EQU then it's going to cause an error... So mistyping DE2 instead of DE isn't going to create a random variable. For that reason, I've had to go with two passes. But the first pass can waste the code writes.
Also, this is how I'm going to implement MACROs. I'll just have the assembler redirect the output from codespace, or filespace, to the contents of a label... Then the label will hold the code sequence for that macro.
eg
MSTART THREE_BITS_RIGHT
RRA
RRA
RRA
MEND
Then this can be called by
MACRO THREE_BITS_RIGHT
Or at least that's the plan at the moment... I still have to do a bit more research into Macro's to make sure I understand them.
I have used as few data structures as I can, but the ones I have used are very versatile.
Does your math handle operator precedence?
No it doesn't.
It's like reverse polish, except forward direction... An operator works left to right and only on the last and next value and while operators can be chained at times, there's no BIMDAS / BODMAS / ASMD or whatever it is now. So there's no need to think about operator precedence.
So, something like;
A+B/C * 12 \ 8 + %11100000 means Add A to B, then the result divided by C then the result multiplied by 12, then the result mod 8, then the result masked with 0xE0.
It would expand to something like this on normal maths notation ( not how the assembler does it ). So values on the left have precendence over values on the right.
((((( a+b ) /c ) *12 ) \8 ) +%11100000 )
This means there is no number stack, which makes it particularly convenient to upgrade to 32 or even 64 bit maths later... It's incredibly simple. There's probably a name for this type of precedent-less maths, but I don't know what it is.
It still supports complex stuff like (A+B)*(C+D) - but would ask the user to do it on multiple lines.
eg,
EQU CDSUM,C+D
EQU RESULT,A+B*CDSUM
or
EQU ABSUM,A+B
EQU CDSUM,C+D
EQU RESULT,ABSUM*CDSUM
Which is a little more complex since it's using the label table as a defacto mathematics stack.
But it's an assembler, not a mathematics program so I don't mind that. It's very easy to get your head around it and removes the problem that precedence changes every few years... Which is why we can't help our grandkids with their homework if we don't want them to fail.
Also, it supports *any* length of calculation, and quite complex calculations at that, but is never going to run out of stack.
Speaking of which the label table I made up is very simple but will support maths calculations, values, and even macros... It's quite versatile. And treats them the same way COBOL treats numbers... In the end it's all just a bit pattern.