• Please review our updated Terms and Rules here

RFC on new benchmarking tool

Trixter

Veteran Member
Joined
Aug 31, 2006
Messages
7,478
Location
Chicagoland, Illinois, USA
I've decided to spend a little time working on my benchmarking tool (the one I keep threatening to write every so often) and would like to know if anyone has any thoughts or criticism on the idea. The idea behind the benchmark is something that can be run on any 8088-80486 PC, take a "fingerprint" of its performance, and then store those results into a simple database. The current machine can then be compared to any other in the database. Since the calculations run realtime, so you can also use the benchmark to adjust DOSBox as it is running to "dial" up a certain machine's speed.

Here's a draft of the documentation:

http://www.oldskool.org/pc/benchmark

If anyone could read the above and offer criticism, comments, etc. I'd appreciate it. I'd like to get the design correct before coding it.
 
Sounds like a very nice project. I've always wondered how to twiddle dosbox to play Hi-Octane at the right speed.

When I started reading, I wanted to ask you about alternatives, but reading on I realize you have a pretty specific goal for your project and have already investigated existing tools.

There is only one concern I have, you mention "interactive reports", isn't there a slight risk that they will affect results? If you accept submissions of "fingerprints" perhaps you should make sure that they were run in some kind of standard mode, perhaps also make sure that TSR programs and what-not is not running. (I'm slightly out of my water here, haven't worked much with dos in a loong while)
 
I think you have the big three metrics covered:

  • CPU instructions
  • CPU to main memory transfers
  • CPU to video memory transfers

Disk I/O was always interesting to me, but there are so many variations now that comparisons are meaningless unless you spec every part out. And even then, so what? But perhaps sequential DOS reads/writes and random DOS reads/writes would make sense when running on non-emulated hardware.

(I ask because this is how I test the throughput of my various adapters. I'm always looking for the fastest solution for parallel port based adapters so I had to write something that used DOS calls, not BIOS calls.)
 
When I started reading, I wanted to ask you about alternatives, but reading on I realize you have a pretty specific goal for your project and have already investigated existing tools.

The only benchmark that ever impressed me was MIPS.COM which you can get in either xtfiles.rar or tanfiles.rar in ftp.oldskool.org/pub/misc. It was the seed that has been growing in my head for a while.

There is only one concern I have, you mention "interactive reports", isn't there a slight risk that they will affect results? If you accept submissions of "fingerprints" perhaps you should make sure that they were run in some kind of standard mode, perhaps also make sure that TSR programs and what-not is not running.

It would take a background running process (formatting a floppy, doing a zmodem transfer, etc.) to really affect results badly, and most people would hopefully not do that when running a benchmark :)

There is a way to guarantee access to the system while running something to be timed, but the maximum length of time you can do this is 55ms before you lose the ability to know how much time has gone by. The lower the sampling period, the less granular the results get on faster and faster machines. So I am not planning on disabling interrupts, unless I run into problems.
 
Disk I/O was always interesting to me, but there are so many variations now that comparisons are meaningless unless you spec every part out.

I thought of doing a simple BIOS read sector (to bypass DOS's double-buffering) suite, but the benchmark is definitely slanted towards testing timing-specific issues that affect games. Games load everything into memory and then the hard drive is rarely touched again, and a slow hard drive does not affect how a game plays, so I ditched the idea. I figure if someone really needs to run a DOS application like a database or something as fast as possible nowadays, they'll just do it in an emulator where disk speed is not an issue.

And even then, so what? But perhaps sequential DOS reads/writes and random DOS reads/writes would make sense when running on non-emulated hardware.

(I ask because this is how I test the throughput of my various adapters. I'm always looking for the fastest solution for parallel port based adapters so I had to write something that used DOS calls, not BIOS calls.)

The BUFFERS= line in config.sys directly dictates how many 512-byte sectors are allocated for a DOS read operation. For any device with BIOS support (ie. hard drives), any read is read into that buffer first, then copied to the user's target buffer. This results in double-buffering, so you can usually never get the "full speed" of the hard drive when doing a DOS read.

One of my ideas to speed up the 8088flex player was to parse the FAT and figure out every sector a file used, then read the sectors directly, but I'll leave that up to someone else.
 
1. Consider benchmarking new hardware as well. For me, there's already a few good-enough benchmarks for 8088..80486, eg. CheckIt and Landmark, but they crash on modern hardware, and I really would like to know how much faster nowadays Xeon boxes are compared to IBM 5150. Of course, I understand that benchmarking modern CPUs using 8086 instructions has some limitations, but I still hope for some not-quite-useless common benchmark for *ALL* PCs...

2. Consider benchmarking FPU as another test suite. Perhaps along with some emulation, to make it visible how much slower floating-point calculations are without the FPU.

3. Consider benchmarking not-quite-compatible PCs. Eg. I would like to run this benchmark on Atari Portfolio, a palmtop with 40x8 screen, and some other oddities...
 
I like your idea of staying very focused. Trying to extend the benchmark to modern systems is almost impossible, what with multicore CPUs, GPUs with gigaflop performance, special instruction set additions, etc.

As far as pure compute power for single-processor machines, don't we already have standard benchmarks, such as Whetstone (floating point) or Dhrystone (fixed point) or Linpack? Granted they're not GUI, but they've been around for many years.

Trying to provide meaningful benchmarks for peripherals can be quicksand--although there's still COREtest lingering around somewhere for hard disks--though I don't know if it was ever reliable.
 
1. Consider benchmarking new hardware as well. For me, there's already a few good-enough benchmarks for 8088..80486, eg. CheckIt and Landmark, but they crash on modern hardware, and I really would like to know how much faster nowadays Xeon boxes are compared to IBM 5150. Of course, I understand that benchmarking modern CPUs using 8086 instructions has some limitations, but I still hope for some not-quite-useless common benchmark for *ALL* PCs...

I promise that I will test and verify operation on my Core i7. The numbers will be quite comical, but I promise it will not crash.

How you run a DOS executable in Windows 7 and/or Vista, however, is your problem :)

BTW, Landmark, CheckIt, etc. weren't that good, and neither was Norton SI. They spent too much time trying to figure out how to estimate a "relative to 5150" value while still taking advantage of 286, 386, etc. instructions, had inaccurate results and/or crashed when the machine got too fast, etc. It is my frustration with those benchmarks that partially prompted me to start my own.

2. Consider benchmarking FPU as another test suite. Perhaps along with some emulation, to make it visible how much slower floating-point calculations are without the FPU.

I thought about that, but there are hardly any DOS games (the impetus of the benchmark) that use floating point. Also, floating-point benchmarks already exist (whetstone, etc.). Finally, the only emulation library I could compare with would be Norbert Juffa's lib (the one I use with my development environment) so I'm not sure how useful the comparison would be.

3. Consider benchmarking not-quite-compatible PCs. Eg. I would like to run this benchmark on Atari Portfolio, a palmtop with 40x8 screen, and some other oddities...

The benchmark will require 80x25 or better to run the main display, but I do plan on including command-line options for taking a system's metrics and storing them in the database using only the console. You could snapshot your Portfolio using command-line options, then take the database to another machine to compare to other systems.

The benchmark will do all timing through the 8253 timer, but since that only has a small-tick resolution of 55ms, I will most likely use the BIOS tick count variable as well. If anything is running that screws with the BIOS tick count, the results will not be accurate...

...actually, you've just given me an idea -- I can rely on ONLY the 8253 timer using interrupt-on-terminal-count and that way I can guarantee an accurate run without having to rely on the BIOS tick variable. Awesome, thanks for the idea :)
 
I like your idea of staying very focused. Trying to extend the benchmark to modern systems is almost impossible, what with multicore CPUs, GPUs with gigaflop performance, special instruction set additions, etc.

Right, it wouldn't be practical, and trying to come up with a common number that applies to everything would be difficult if not impossible.

As far as pure compute power for single-processor machines, don't we already have standard benchmarks, such as Whetstone (floating point) or Dhrystone (fixed point) or Linpack? Granted they're not GUI, but they've been around for many years.

This is true, but they can be subjective (ie. subject to the nature of how the compiler assembles the code). I will come up with a synthetic machine instruction opcode test (probably a few hundred lines long) that is restricted to 8086 opcodes (ie. no MOV EAX,EBX) that are also completely forward-compatible (ie. no POP CS).

Trying to provide meaningful benchmarks for peripherals can be quicksand--although there's still COREtest lingering around somewhere for hard disks--though I don't know if it was ever reliable.

Yeah, I'm not touching peripherals except for video card memory speed, since it is a definitive factor in gaming performance.
 
Yeah, I'm not touching peripherals except for video card memory speed, since it is a definitive factor in gaming performance.

For systems running at 8MHz or below, shouldn't video memory performance be the same as system memory performance? I understand that the system treats RAM access, whether to RAM on the motherboard or on the expansion bus as the same on XT and AT class machines. IBM PC & XTs generally used 200ns memory and IBM AT & XT/286s used 150ns memory. I checked my IBM video cards and they seem to use 120ns memory. In this case, the CPU should be able to access the memory without the use of extra wait states, so the read/write speed should be equal at these speeds. Above 8MHz, memory access should be slower because the CPU is only able to access video RAM on the bus at that speed and the video RAM may be slower than the CPU RAM.

If I am horribly wrong here, please let me know.
 
If I am horribly wrong here, please let me know.

Video memory is always slower than the main system RAM because of contention between the CPU and video circuitry (the PCjr probably being the most well-known example of this).
 
Well, it depends. The PCjr video circuitry actually tried to arbitrate between video logic and CPU, and the video logic has higher priority.

On a normal PC all system RAM is the same. The video RAM is at best as fast as the system RAM, and at worst quite a bit slower. It turns out that CGA is probably as fast as main memory is on a PC because the video circuitry does not stop CPU accesses - which gives you the wonderful snow effect.

Keep in mind there are 8086 machines out there with 16 bit paths to memory, but possibly 8 bit paths to video. Or machines with dual ported video memory that can allow both video controller and CPU access simultaneously. It is hard to generalize.
 
Well, it depends. The PCjr video circuitry actually tried to arbitrate between video logic and CPU, and the video logic has higher priority.

Ditto for 128k Tandy 1000s.

On a normal PC all system RAM is the same. The video RAM is at best as fast as the system RAM, and at worst quite a bit slower. It turns out that CGA is probably as fast as main memory is on a PC because the video circuitry does not stop CPU accesses - which gives you the wonderful snow effect.

Ah, yes. Since I have my 5150, I've learned all about the wonders of snow. I tried running WordPerfect 5.0 on it the other day, and it looked like a blizzard was falling. Turns out that WP writes text directly to the video buffer for maximum speed.
 
How do the other adapters solve the "snow" problem? The IBM PC/XT Technical Reference Manual seems to give the answer when talking about the Monochrome and Printer Display Adapter:

There are 4K bytes of static memory on the
adapter which is used for the display buffer. This buffer has two
ports and may be accessed directly by the processor.

This seems to imply that RAM on the MDA is dual ported. Is that the same for the EGA & VGA cards?

Why is snow only a problem on the CGA 80-column text modes and not on the 40-column or graphics modes?
 
This seems to imply that RAM on the MDA is dual ported. Is that the same for the EGA & VGA cards?

Yes. As far as I know, only the IBM CGA card has single-ported RAM (it's possible that some cheap Asian cards do as well).

Why is snow only a problem on the CGA 80-column text modes and not on the 40-column or graphics modes?

The 80-column text mode uses more bandwidth than the other modes.
 
Yes. As far as I know, only the IBM CGA card has single-ported RAM (it's possible that some cheap Asian cards do as well).

But is it? From the IBM PC Technical Reference Manual on the IBM Color/Graphics Display Adapter:

A dual-ported implementation allows the
processor and the graphics control unit to access the buffer. The
processor and the CRT control unit have equal access to this
buffer during all modes of operation, except in the high-resolution
alphanumeric mode. In this mode, only the processor should
access this buffer during the horizontal-retrace intervals. While
the processor may write to the required buffer at any time, a small
amount of display interference will result if this does not occur
during the horizontal-retrace intervals.

This text implies that the RAM on the CGA card is dual ported.

The 80-column text mode uses more bandwidth than the other modes.

How does this precisely affect this card and not the other cards? The RAM is no faster on the CGA card.
 
This text implies that the RAM on the CGA card is dual ported.

I don't know. Everything I've read says that the memory is single-ported.

How does this precisely affect this card and not the other cards? The RAM is no faster on the CGA card.

I believe it's a combination of the single-ported memory plus the higher bandwidth of 80-column text mode. The 32k graphics modes on the Tandy/PCjr are high-bandwidth as well, but since they don't have single-ported memory, they don't suffer from snow.
 
Back
Top