Benchmarking design and questions

Trixter · May 30, 2008

I’ve got an interesting design problem, and was hoping those familiar with x86 assembler and PC architecture could help me work out a few things; I apologize for the lengthy background, but it’s necessary to understand the problem I’m wrestling with:

Background: One problem that has cropped up recently in this hobby is a means to properly identify and benchmark systems, not only “real” (ie. true IBM PC/XT, AT, etc.) but moreso unmarked clones. There is also a need for something like this for emulator writers, so that they can attempt to get their code cycle-exact, and also just for regular people who, for example, just want to play games at the right speed in DOSBOX.

Everyone is familiar with the old Norton SI and Landmark CPU Speed benchmarks, but they are horribly misleading and generally incomplete test suites. Other benchmarks, such as C&T MIPS.COM, are much better, but they aren’t realtime (test takes 30 seconds) and only offer three machine classes to compare to. So, I have volunteered to write a benchmark that would meet the above needs. The goals would be relatively simple:

Take a performance measurement of a machine and store it locally in a tiny database that accompanies the program
Allow comparison of the current machine’s metric to the database, and bring up close matches for comparison
Perform the measurement/comparison continuously, so that running it inside an emulator would allow you to immediately see the results of tuning the emulator speed. (For example, this would allow people to “dial” the speed of the emulator to match a target machine.)

Now the problems:

I’m having trouble coming up with a decent metric and/or way of profiling a machine that not only works on ANY PC (ie. even PC/XT where there is no RTC or RDTSC available, only the 8253) but also working as high up as, say, a Pentium @ 166MHz (but not much higher, as there is no target audience for this benchmark above that platform).

The basic idea I had was to run through every single 808x-compatible instruction (except POP CS which would hang 286 and later, and aad/aam with a custom divisor because those hang NEC V20/V30) and time it, then perform some memory moves/fills in system RAM, then the same to video adapter RAM, and then print out the closest matches in the database for all three measurements. Optionally, also output some sort of combined score (like a “fingerprint” for the machine) so that one generic clone can be compared to other generic clones and/or known machine performance profiles.

I was planning on using the 8253 at full resolution to perform the timing, using Abrash’s Zen timer code which I am very familiar with. The problem with this method, as far as I can tell, is that once I hit 486 and later, L1 caching becomes a problem – not because caching is an “unfair” speed boost (if anything, I definitely WANT caching to affect speed as a true test of how fast a system is), but rather because of how small the test suite is -- it would fit entirely in cache and, coupled with pipelining on Pentium and later, would execute faster than the 1.19Mhz 8253 would be able to detect! Ie. the entire test suite could execute in a single tick of the 8253.

Questions:

Is this a reasonable fear, or am I overestimating how much pipelining and cache will speed things up? (Remember, there is no target audience for this benchmark beyond a Pentium)
Should I look into some sort of alternate timing method, such as running the test suite multiple times in a certain time period? If so, what would a reasonable time boundary be? (no more than a full second, I hope... Remember, one of the primary goals of the benchmark is to run “realtime” such that adjusting an emulator, or popping a “turbo” button on/off, would be noticeable; I'm also worried about having interrupts turned of for a long period of time)
Has this problem already been solved by a benchmark utility I am not yet aware of?
Is cutting the benchmark off at Pentium reasonable, or will people be strangely compelled to benchmark their quad-core xeon against an IBM PC/XT? (ie. should I even worry about machines above Pentium?)

Thanks for reading this far

Any and all thoughts regarding this are appreciated.

mbbrutman · May 31, 2008

It's a tough problem to solve. Here are some ideas.

There are ways to determine what class of processor you are running on. Use this to select code that gets you in the ballpark of what you are trying to measure.

80486s have cache control instructions that can probably be used to invalidate the cache line by line. Even if you don't want to mess with specific instructions like that, you can design the benchmark to touch more than 8K of instructions. I don't remember the details of the 486, but the amount of L1 cache is pretty small - 8 or 16KB, and possibly not even separate instruction and data caches.

L2 caches are harder to deal with because they are generally much larger - 64K to 256K is common. You will have to explore the cache invalidate instructions in order to clear these properly.

Another fun thought - even some 386s have L2 caches. My primary development machine is an AMD 386-40 with a 128K external L2 cache. I'm not sure if the 386 has cache management instructions - you might have to do port IO to an external cache controller.

Caching is definitely a problem. Normal memory runs from 210ns to 60ns for machines of that era. Cache memory in an L2 cache generally responds in about 10ns, and the on chip cache of a 486 is going to respond in 1 or 2 cycles.

Trixter · May 31, 2008

The thing is, I don't necessarily want to screw with the cache -- because that's not representative of normal use of the machine. 99.9% of all programs run on a 486 don't explicitly disable the cache, so testing with cache disabled is not a fair test of the speed of the machine.

mbbrutman · May 31, 2008

If you are trying to characterize the machine to identify what it is, you almost certainly need to try some runs with the cache disabled/flushed.

The broader goal of characterizing a machine is just too hard. You can come up with a set of benchmarks that quantify memory movement performance, floating point performance, raw cycle speed, etc., but how that relates to the performance of an application in an emulator depends on what the application does (ie, the instruction mix).

Remember, a test measures what a test measures. Nothing more.

IBMMuseum · May 31, 2008

mbbrutman said:
It's a tough problem to solve. Here are some ideas.

There are ways to determine what class of processor you are running on. Use this to select code that gets you in the ballpark of what you are trying to measure...

I'm with Mike on this one - Jeff Procise had a routine (Assembler) in his "DOS x Techniques & Utilities" that figures out the CPU class (Pentium and below). It's the most accurate for what I have seen determining CPU MHz. Even a good calculation of cache present.

Trixter · May 31, 2008

mbbrutman said:
If you are trying to characterize the machine to identify what it is, you almost certainly need to try some runs with the cache disabled/flushed.

To determine the effective speed of the machine, or identify the processor class and speed? I agree with the latter, but not the former, since that's simply not representative of the normal usage of the machine. My 386-40 has 64K L2 cache; yours has 128K L2 cache. If I disable the cache while benchmarking, our boards would appear to have identical performance which is obviously not true.

The broader goal of characterizing a machine is just too hard. You can come up with a set of benchmarks that quantify memory movement performance, floating point performance, raw cycle speed, etc., but how that relates to the performance of an application in an emulator depends on what the application does (ie, the instruction mix).

I disagree that it's too hard, but then again I haven't written a line of code yet ;-)

Trixter · May 31, 2008

IBMMuseum said:
I'm with Mike on this one - Jeff Procise had a routine (Assembler) in his "DOS x Techniques & Utilities" that figures out the CPU class (Pentium and below). It's the most accurate for what I have seen determining CPU MHz. Even a good calculation of cache present.

I would love to see this routine, if you have it on disk somewhere. I have my own routines, augmented with information from The Undocumented PC, but I'd love to compare notes.

IBMMuseum · May 31, 2008

Trixter said:
I would love to see this routine, if you have it on disk somewhere. I have my own routines, augmented with information from The Undocumented PC, but I'd love to compare notes.

Whoops, you are more correct than me. I was thinking of the Undocumented PC routines. Jeff did have some information in his books too, so I'll see what I can dig up.

Trixter · Jun 1, 2008

BTW looking at the Undocumented PC routines, they are quite accurate, yes? Guess how they were written:

...wait for it...

He ran the same routine on several machine classes to come up with values that produced accurate results for that processor class. In other words: He benchmarked each CPU

and the magic values are embedded in the source code.

I mention this because I think a few people would consider that "cheating"

Benchmarking design and questions

Trixter

Veteran Member

mbbrutman

Associate Cat Herder

Trixter

Veteran Member

mbbrutman

Associate Cat Herder

IBMMuseum

Veteran Member

Trixter

Veteran Member

Trixter

Veteran Member

IBMMuseum

Veteran Member

Trixter

Veteran Member