• Please review our updated Terms and Rules here

Visualizing the Intel 4004 - Opening the "black box" of the MCS-4 architecture

Jaime Clot

Member
Joined
Apr 24, 2026
Messages
12
Location
Spain
Hello everyone,

I've been a bit of a silent observer here, especially following the threads on MCS-4 based industrial systems. Being 63, I’ve spent a lifetime around micros, but the 4004 always had that "mysterious" aura for being the first.

I got tired of emulators that just show you a hex dump and a "Run" button. I wanted to actually see the 4-bit bus moving, the timing, and how the 4004/4001/4002/4003 chips dance together.

So, I’ve spent the last few months building what I call Quadium 4004 Workbench. It’s a visual workbench where every register and chip is exposed.
I even ended up writing a custom transpiler (QuadBasic) because I wanted to see how high-level logic actually translates into those specific 4-bit instructions in real-time.

Let's face it: the 4004 is strange. Really strange. If you come from modern architectures, its logic feels almost alien at first. But after a lot of head-scratching, you finally have that 'AHA!' moment where everything finally clicks...

I’d love to hear from anyone who has poked at the real hardware or struggled with the quirks of the MCS-4 family.

I’m wide open to your feedback, suggestions, or any 'you should have done it this way' comments—I'd much rather have the technical truth than a pat on the back.

Cheers,

Jaime Clot

workbench.png
 
That looks beautiful. The 4004 / MCS-4 is a really fun chipset. I love the idea of a basic compiler for the 4004. Is this on github?
 
Hi dfnr2,

Thank you so much for the feedback! I’m really glad you like the visual approach; it’s encouraging to hear that.

Regarding GitHub, I don’t have plans to open-source the code for now. I’m currently leaning towards a more formal, standalone release once the system is fully stable and polished.

I'm glad you liked the idea of the compiler! I ended up creating QuadBasic specifically to handle the 4004's architecture. It’s a 4-bit high-level language that handles things like nibble-range variables and direct mapping to the 4001/4002 memory banks.

I’ve attached the draft manuals for the Workbench and the language. Please keep in mind they are very much a work in progress; in fact, the layout and readability are currently quite horrible, and they still contain some rough edges and inconsistencies, but they should give you a good idea of the internal logic.

To be honest, even though I started tinkering with Z80 machine code years ago, I must confess the 4004 left me completely baffled for a while. I remember those first days thinking... Wait, to load data into RAM I have to do... what? Or looking for the XOR instruction and realizing it just isn't there! And don't even get me started on the stack.

It took quite a bit of head-scratching to stop thinking in 'standard' logic and embrace the 4004's quirks. But once you finally get into that 4-bit mindset, you start to see how incredibly elegant the whole 'machine' and its timing really are. It’s a strange dance, but a beautiful one.

Cheers!
 

Attachments

Glad you liked it!
Right now the 4004 is pretty much eating up all my time, though I've definitely thought about the 4040 for the future.

The 8008 is a whole different beast, obviously... but hey, who knows? ;-)
 
I have a couple of 4004 machines. One was the early development board, 1702A programming card and connector backplane. I also have a recreation of an early project done with the 4004. It is a form of calculator called a maneuver board. It would be used to do things like closest point of approach. It would be used by ships to keep track of other ships and shore line. The original project was done by two students at the Navy Post Graduate School, in Monterey, Ca. They were working under an instructor, Gary Kildall, how would later create CPM. I created the hardware based on the listing and operational description. I created my own simulator to fix issues with the pdf listing that was poorly printed on a ARS33 with a rutted platen.
Years ago, I worked at Intel and was responsible to keep the manufacturing of the UPP prom programmer that use a 4040 processor. While there I created hardware to test the 4040 PROMs used for the UPP. The most common problem was capacitors legs put in the wrong holes ( on the bus ).
Most simulators available on the web are not very useful. They simply execute code ( sometimes with errors, not understanding the stack the most common ).
I wanted a simulator that I could connect an entry keyboard to, a simulated display and the specific RAM used in the maneuver board program. I also used the same simulator to work on code I used on the development board I had. This includes simulation of the serial TTY inputs and outputs. This requires keeping track of time of execution.
I call this process of simulating real hardware "instrumenting the simulator". It must do more then just execute instructions. It must keep track of simulation cycles read data input files, write output files, allow debug traps at various steps and essentially work like the machine it is simulating.
I never did a visual display as my main intent was to be used interactively as something other than just doing instructions. I had several keyboard instructions to print out things like registers, ports and cycle times. Also, for things like the maneuver board, allowed keyboard entry and result display of what the maneuver board was calculating.
Your display simulation is cool but I never got into having too much displayed and was more into debugging hardware or code that wasn't working.
 
Hi Dwight,

I’m honestly a bit floored by the level of feedback here. Having Dwight, Paul and Dave review my work is a real privilege.

Dwight, your Intel background and the 'maneuver board' story are truly fascinating. What really hits home is hearing about those capacitor legs. My background is in industrial electronics, and your mention of the ASR-33 triggered some serious flashbacks. I still remember my early days with a custom-built 'computer' box connected to a modified teletype that was my 'paper 4K monitor' back then, using punched tape for loading programs. Once you’ve lived through that, you never see hardware the same way again.

This whole project started as a mix of nostalgia and curiosity. Later on, I figured that for most modern devs, the MCS-4 is difficult to visualize through a terminal alone. I wanted to bridge that gap visually. That’s the reason behind QuadBasic: if they can write in a familiar syntax and see it translate into 4-bit logic in real-time, the lightbulb finally goes on.

Regarding the simulation itself, I made a conscious choice to focus on the instruction level rather than cycle-accurate emulation. The simulator steps through instructions one at a time, consistent with the 4004 ISA model. It’s designed to be a fast, practical workbench for debugging and teaching, not workloads that depend on exact cycle-by-cycle hardware timing.

And about those stack issues you've seen in other emulators... Stack? What stack? ;-) Debugging a 4004 stack overflow usually means checking who called whom… exactly three calls ago. ;-)

Thanks again for the warm welcome and for sharing these gems of history. It's an honor to talk shop with the guys who were actually there.

Best regards,

Jaime
 
Hi cruff,

The primary target platform is Windows. That said, given how the Workbench is structured, it should behave correctly under Wine/Proton on Linux. I haven’t had the chance to test that configuration myself yet, so I can’t say with complete certainty

The Workbench is already in its final testing cycle, but on the development side I’m still wrestling with a couple of architectural decisions. Right now I’m extending the transpiler to support an “8-bit Basic” layer.

Betraying my own principles just this once, and purely to add a bit more visual flavor, I’m also experimenting with a custom virtual VDP. It’s loosely inspired by the Texas Instruments TMS9918, very much from a different era, of course, but it’s strictly for entertainment and aesthetic purposes. Just something fun for users who want a bit more visual feedback. And since all virtual peripherals in the Workbench are fully configurable, each user can decide which ones to add and wire them exactly the way they prefer, whether they want to stay historically strict with a few LEDs or jump ahead a bit and attach an LCD. The final choice is entirely theirs.

My biggest ongoing battle with the transpiler, unsurprisingly, is ROM space. Some things never change. I keep an entire page reserved for core helpers such as logical operations and PRINT routines. Even though the transpiler only injects those helpers when the user’s program actually requires them, I still find myself fighting for every last byte. Much tighter than I’d prefer, but that’s the nature of generating compact 4‑bit code. Of course, an experienced user can always bypass the transpiler entirely and write raw assembly to squeeze every single nibble out of the machine, but for the high-level layer, it’s a constant puzzle.

In short, trying to write a compiler that produces elegant, efficient machine code gives you a renewed respect for the constraints you all worked under back in the day. And while there are days when I catch myself wondering why on earth I decided to take on something this demanding, the truth is that the whole project is turning out to be both demanding and genuinely enjoyable.

Cheers,
Jaime
 
Hello everyone,

I've been testing how the hardware handles stack overflows during the unwinding phase.

I'm planning to include the asm below as a built-in educational example:

Code:
ORG 0000H

; =============================================================================
; MAIN PROGRAM
; =============================================================================
MAIN:
        ; --- STEP 1: INITIAL JUMP ---
        ; JMS pushes the return address (the line "SET_A_TO_5") into the stack.
        ; Hardware Register Status:
        ; Slot 1: [Address of SET_A_TO_5] -> Occupied
        ; Slot 2: Empty
        ; Slot 3: Empty
        JMS SUB_LEVEL_1        

        ; CRITICAL POINT:
        ; This instruction is NEVER executed because its return pointer stored
        ; inside Slot 1 is physically overwritten during the 4th nested JMS.
SET_A_TO_5:
        LDM 5                  
        XCH R0                  ; R0 = 5 (Expected final value, but lost)

HALT_LOOP:
        JUN HALT_LOOP           ; Infinite halt loop

; =============================================================================
; SUBROUTINES
; =============================================================================

SUB_LEVEL_1:
        ; --- STEP 2: NESTED JUMP ---
        ; JMS pushes the return address (the line "BBL 1" below) into the stack.
        ; Hardware Register Status:
        ; Slot 1: [Address of SET_A_TO_5]
        ; Slot 2: [Address of SUB_LEVEL_1 BBL 1] -> Occupied
        ; Slot 3: Empty
        JMS SUB_LEVEL_2        
       
        ; --- STEP 7: THE TRAP CLOSES ---
        ; Execution lands here after SUB_LEVEL_2's BBL finishes.
        ; This BBL pops the stack. The circular hardware pointer moves to Slot 1.
        ; Expected: [Address of SET_A_TO_5].
        ; Actual: [Address of SUB_LEVEL_3 BBL 3] (Overwritten by the overflow).
        ; Result: The CPU jumps, returning to the exit line of SUB_LEVEL_3!
        ; NOTE FOR BEGINNERS: The '1' is data loaded into the Accumulator,
        ; NOT the return destination. Return destination comes ONLY from the Stack.
        BBL 1                  

SUB_LEVEL_2:
        ; --- STEP 3: NESTED JUMP (STACK FULL) ---
        ; JMS pushes the return address (the line "BBL 2" below) into the stack.
        ; Hardware Register Status:
        ; Slot 1: [Address of SET_A_TO_5]
        ; Slot 2: [Address of SUB_LEVEL_1 BBL 1]
        ; Slot 3: [Address of SUB_LEVEL_2 BBL 2] -> Occupied (STACK IS NOW FULL)
        JMS SUB_LEVEL_3        
       
        ; --- STEP 6: UNWINDING CONTINUES ---
        ; Execution lands here after SUB_LEVEL_3's BBL finishes.
        ; This BBL pops the stack. The hardware pointer moves to Slot 2.
        ; Result: It reads [Address of SUB_LEVEL_1 BBL 1] and jumps there.
        ; NOTE FOR BEGINNERS: The '2' is data loaded into the Accumulator.
        BBL 2                  

SUB_LEVEL_3:
        ; --- STEP 4: THE STACK OVERFLOW POINT ---
        ; The 4004 uses a 2-bit pointer over 4 address registers (1 PC + 3 Stack slots).
        ; This 4th nested call wraps around circularly and overwrites the OLDEST slot.
        ; Hardware Register Status (Corrupted):
        ; Slot 1: [Address of SUB_LEVEL_3 BBL 3] -> OVERWRITTEN! MAIN IS LOST.
        ; Slot 2: [Address of SUB_LEVEL_1 BBL 1]
        ; Slot 3: [Address of SUB_LEVEL_2 BBL 2]
        JMS SUB_LEVEL_4        
       
        ; --- STEP 5: FIRST REBOUND ---
        ; Execution lands here immediately after SUB_LEVEL_4's BBL executes.
        ; This BBL pops the stack. The hardware pointer moves to Slot 3.
        ; Result: It reads [Address of SUB_LEVEL_2 BBL 2] and jumps there.
        ; NOTE FOR BEGINNERS: The '3' is data loaded into the Accumulator.
        BBL 3                  

SUB_LEVEL_4:
        ; --- STEP 4b: DETAILED ANALYSIS OF THE LAST BBL ---
        ; This is the deepest execution point. When this BBL is reached:
        ; 1. The CPU executes a 'POP' operation to find out where to return.
        ; 2. It reads the last register targeted by the circular pointer (Slot 1).
        ; 3. Due to the overflow, Slot 1 contains [Address of SUB_LEVEL_3 BBL 3].
        ; Result: The CPU executes a corrupted return, jumping directly to the
        ; exit line of SUB_LEVEL_3, bypassing MAIN entirely.
        ;
        ; IMPORTANT DISAMBIGUATION FOR NOVICES:
        ; The literal value '4' is simply loaded into the Accumulator (ACC=4)
        ; as a return value. It has absolutely NO influence on where the CPU
        ; jumps. The destination is forced blindly by the broken Hardware Stack.
        BBL 4                  

; =============================================================================
; THE DESTRUCTIVE INFINITE LOOP FLOW
; =============================================================================
; Because the exit path to MAIN was permanently erased from the chip,
; the CPU is trapped in a circular chain reaction, jumping exclusively between
; the exit lines of the subroutines in this exact repeating sequence:
;
; SUB_LEVEL_4 BBL 4  ==> Jumps to SUB_LEVEL_3 BBL 3
; SUB_LEVEL_3 BBL 3  ==> Jumps to SUB_LEVEL_2 BBL 2
; SUB_LEVEL_2 BBL 2  ==> Jumps to SUB_LEVEL_1 BBL 1
; SUB_LEVEL_1 BBL 1  ==> Jumps back to SUB_LEVEL_3 BBL 3 ... (Infinite Loop)
; =============================================================================



Here is a quick video showing the actual execution loop:



Looking at the execution loop, do the overall behavior and the unwinding path seem correct to you?

I also have a couple of questions regarding the UI representation (top-left window):

  • The "NEST" counter: This is just a software helper showing pending returns. It doesn't map to any real internal register. Do you think keeping this helps or does it just confuse newcomers learning the architecture?

  • The Stack layout: I discarded a modern "L1, L2, L3" list because it implies a push/pop shifting mechanism that the real silicon doesn't do. Instead, I went with a horizontal register array and implemented a visual pointer that follows the real 2-bit counter:

Captura de pantalla 2026-05-19 003349.png

Does this approach feel right and make sense from an educational standpoint to you?


Thanks,
Jaime
 
The easiest way to think about the stack of the 4004 is to do what the processor does. One of the 4 registers is the instruction pointer. The other three could then be the return pointers. If your software emulation just keeps a pointer into the 4 ea 12 bit registers as the current instruction pointer. It will then work exactly the same as a real 4004. If you have a separate value as the current instruction pointer you'll always be wondering if you are using it right. This means that each instruction fetch is indirectly through one of the 4 locations. It won't be the most efficient instruction pointer you could have for execution but it would always work correctly.
As for timing. The 4004 was intended to control hard ware. That includes timing. Keeping track of how many clock cycles used for each instruction means that you can stimulate all kinds of simulated actions.
The current processors are so speedy that even with these overheads of emulating real hardware is a tiny price to pay. If you want to simulate the real speed of simulation you already have a handle if each instruction keeps track of how many clocks it would take.
Dwight
 
Thanks for the four-register perspective, one 12-bit address for fetch, three for returns, with fetch going through whichever slot is active in your unified model.
That’s the picture I had in mind for the overflow trace; in the workbench we still expose fetch as PC plus the three return latches below.

In the screenshot from my previous post, you may have only seen the return addresses (the three hex values, NEST 3, and the yellow marker). I’m attaching the full CPU WATCH window here: the PC sits above that row.

returnadd.png

The system uses the PC for the current fetch address, while the three boxes below are the hardware stack slots. The arrow simply marks where the next JMS will write on that three-slot circular ring, not which address is currently executing.

Separating the PC from the three stack slots in the UI, rather than drawing four equal slots with a moving fetch arrow, was a deliberate display choice. Certainly not a claim about how one would lay out the physical die.
I’m curious whether you think this visual split is clear enough, or if it risks hiding the fourth 12-bit address when people only see a partial view.

On timing: noted. Implementing cycle-accurate timing is currently out of my reach for this phase, but it is definitely something I would love to tackle in a future version once the core system is fully stable.

If anything in that overflow loop looks incorrect compared to the behavior of a real part, please let me know.

Cheers,
Jaime
 
Years ago, I worked at Intel and was responsible to keep the manufacturing of the UPP prom programmer that use a 4040 processor. While there I created hardware to test the 4040 PROMs used for the UPP. The most common problem was capacitors legs put in the wrong holes ( on the bus ).
I have one of those :)

I bought it off a scrapper who unfortunately already scrapped the thick aluminum case top.
 
I'm deeply envious! Here in Spain, vintage Intel treasures like that are almost mythical.

Nothing beats touching real silicon history.

Of course, what I have can't even compete with that, but what do you think of these "high-density" chips I dug up?

foto.png
 
Hi
Just a quick update. I've cleaned them up and moved all three manuals (Hardware, QuadBasic, and VDP guide) to a dedicated GitHub repository so they are much easier to read and track.

You can check them out here:
https://github.com/Quadium4004/Quadium-4004-Workbench

If you see any mistakes or typos, please let me know. And of course, I'm always up for discussing any of the architectural choices or constraints in there right here in the thread.

Jaime
 
Thanks for the four-register perspective, one 12-bit address for fetch, three for returns, with fetch going through whichever slot is active in your unified model.
That’s the picture I had in mind for the overflow trace; in the workbench we still expose fetch as PC plus the three return latches below.

In the screenshot from my previous post, you may have only seen the return addresses (the three hex values, NEST 3, and the yellow marker). I’m attaching the full CPU WATCH window here: the PC sits above that row.

View attachment 1322159

The system uses the PC for the current fetch address, while the three boxes below are the hardware stack slots. The arrow simply marks where the next JMS will write on that three-slot circular ring, not which address is currently executing.

Separating the PC from the three stack slots in the UI, rather than drawing four equal slots with a moving fetch arrow, was a deliberate display choice. Certainly not a claim about how one would lay out the physical die.
I’m curious whether you think this visual split is clear enough, or if it risks hiding the fourth 12-bit address when people only see a partial view.

On timing: noted. Implementing cycle-accurate timing is currently out of my reach for this phase, but it is definitely something I would love to tackle in a future version once the core system is fully stable.

If anything in that overflow loop looks incorrect compared to the behavior of a real part, please let me know.

Cheers,
Jaime
As long as the over writing the stack with a 4th JMS actually overwrites the stack. That i actually used by the assembler Tom Pitman wrote. He used that for a way to change the flow of the program from first pass to second pass. Not good programming practice but then it did squeeze the assembler into 4ea 1702's.
Dwight
 
Hi Dwight,

Thank you for the guidance. I’ve been giving this a lot of thought. I always try to strike a balance between making the tool educational and keeping it accurate, but you're absolutely right: reality is reality, and the silicon behaves the way it behaves. Even if it looks a bit unusual at first glance, it has to be represented correctly. I have already reorganized how those four registers are displayed.

stack00.png

Your note about Pittman using a 4th JMS on purpose is pretty curious I’d never have thought to use the stack that way.

Anyway, I ran a stack test in the workbench — four nested JMS, the fourth overwrites the oldest return. Stepping through it we end up with ACC = 15, not LDM 0.

Code:
        ORG 0000H

        JMS DEPTH1
        LDM 0
        JUN IDLE

DEPTH1:
        JMS DEPTH2
        BBL 1

DEPTH2:
        JMS DEPTH3
        BBL 2

DEPTH3:
        JMS LEAF               
        BBL 3

LEAF:
        BBL 4

MARK:
        LDM 15                 

IDLE:
        JUN IDLE

        END

Code:
Addr    Code
0x000    JMS DEPTH1
0x002    LDM 0 ← main return
0x003    JUN IDLE
0x005    DEPTH1: JMS DEPTH2
0x007    BBL 1
0x008    DEPTH2: JMS DEPTH3
0x00A    BBL 2
0x00B    DEPTH3: JMS LEAF ← 4th JMS
0x00D    BBL 3
0x00E    LEAF: BBL 4
0x00F    MARK: LDM 15 ← success
0x010    IDLE: JUN IDLE


Thanks again for your explanations. Without your insights, I probably would have ended up with a rather lazy implementation of the stack.

Cheers,
Jaime
 
The reason he did this is that he'd reached the end of doing the first pass but also need to unroll some of the state and I/O operations before starting the second pass. So, 3 of the stack depth had to be unrolled. A normal software would set a flag and then do the returns. At the lowest level one would check the flag, at the additional cost of checking it each time it returned to the beginning level. Instead the next sequence would start after all the other operations had completed. I'll admit, it is not a good practice.
To add to that, the later 4040, that is otherwise code compatible, would fail without code modifications. It would not clean the stack as needed. I'd thought about how one might change things. There were quite a few locations in ROM, that looked to be unused, that one could add some patch code. I thought about adding more JMS instructions at the transition JMS but then the code wouldn't run one the 4004 anymore. It then occurred to me that the best place to put it would be at the original beginning of all the code. I could fill the stack with the desired address first. It would then still run on the 4004.
Technically the 4004 does not do a power up reset of the stack pointer and could start at any location since all locations are valid. That means one less piece of unneeded hardware. A 2 bit register will always start with a valid value, if not the same as before any reset.
Dwight
 
Back
Top