• Please review our updated Terms and Rules here

How can it be this fast, assuming it is

alank2

Veteran Member
Joined
Aug 3, 2016
Messages
2,264
Location
USA
I've finally got the throttling code implemented in the 4004 emulation code. I can run it throttled or unthrottled, and below is unthrottled to see how fast it _can_ run. This is on a Windows 10 box, win32 console application. My task manager says the CPU is 1.69 GHz because I have my power mode set to balanced. The 4004 code is a single instruction that just jumps to itself. A two instruction cycle instruction. I am attaching the code so you can see how it measures and reports this. In the case of the jump instruction, it increases a variable called icycles for instruction cycles twice. The first instruction makes this a 2, and so on and so on as it runs.

When compiled for DOS and run on my Toshiba 1100plus - 8086 @ 7.16 MHz, the unthrottled speed is about 130000 Hz - so it takes about 55 8086 cycles to emulate one 4004 cycle. While I would have liked a bit more performance on this, it doesn't compare to the nearly 10:1 that I was able to obtain on an AVR for the 8080. That was obtained however by using the AVR's registers to directly hold many of the 8080 internals. Anyway, for DOS on the T1100 plus we are looking at 55:1. This makes sense to me.

What does not make sense to me is the WIN32 results. See below. Over 1 GHz running on a 1.69 GHz CPU. How can this be possible assuming there isn't something going on that I am missing. 1.7:1 ? It does not make sense. I've pasted some CPU step windows to show it does have to go through all the instructions as expected. I just don't see how it can do it so quickly.


1704646111327.png

1704646117021.png
1704645880594.png
 

Attachments

  • i4004.zip
    3.4 KB · Views: 1
Modern CPUs can change clock speed on the fly to meet demand, regardless of your power plan setting. You will need to lock the CPU to a specific speed for an accurate benchmark. The easiest way is to temporarily disable SpeedStep, Cool'n'Quiet, Turbo Boost, Turbo Core, etc in the BIOS.
 
Could this primarily be a CPU cache issue?

Even if I run the DOS version of my program on a Pentium 166 MHz booted into DOS, it does about 40 MHz emulated speed. It does about the same running WIN32 console under NT 4.0 on it.

How many clock cycles do instructions take if they can load the data they want from on cache?
 
Also, most modern CPUs are not designed to jump around in such short code segments. It has to keep refilling the pipe.
You also should not be doing such tiny code pieces of code in C. It does a poor job of using the registers properly for such short burst of execution.
It is clearly a several time increase in speed if made into assembly code.
Dwight
 
When I've run a 8080 exerciser through my 8080 emulator, I've noticed interesting behaviour. An older Intel Core i7 ran the test with an emulated ~1 GHz, which was about reasonable. But a much newer Core i5 ran the same code almost 40x faster, at the same ~1.6 GHz clock rate.

My 8080 emulator was originally written in AVR assembly and later ported to C, using "computed goto" for speed. The 8080 exerciser dynamically generates the instructions at runtime, keeping the code small as well. The whole code ran exclusively from L1 cache and almost without stalls.

You can't easily predict the performance on modern CPUs. Performance may differ in major ways between generations. Your CPU emulation is single-threaded, causing the CPU to probably step up the frequency of a single core at the cost of the other cores, possibly only for short bursts depending on the heat budget. Treat the task manager's number as unreliable, rely on performance counters instead.

If you want to match frequencies (not a good idea), force a specific host frequency and disable all boosting options. Make sure that you don't accidentally throttle your executing core (again, performance counters should note whether it happened).
 
8086 instructions take a relatively large number of cycles. At least 4 cycles for any memory access, and you must include opcode fetches there too. Dozens of cycles for a multiply.

On a 486 most instructions will run in 1-3 cycles. On a Pentium MMX you have branch prediction and can also issue two instructions simultaneously. You say the 166MHz CPU was able to emulate at 40MHz so now you're at 4.1 cycles compared to 55 on the 8086. You're also running 32-bit code, whereas the 8086 would've been running 16-bit code which is likely a bit uglier.

Modern CPUs convert x86 instructions into micro-ops and can issue several micro-ops at the same time. I'd recommend checking Agner Fog's website if you want to go deep into modern CPU's achieve high performance https://www.agner.org/optimize/
 
Back
Top