• Please review our updated Terms and Rules here

Tandy 1000 A/EX/HX DMA speed-up

... Now some possibly interesting Topbench results.

As mentioned earlier, I recently built an expansion card that backfills a Tandy 1000 EX to 640k using a SRAM chip and does *not* include a DMA controller. Today I upgraded the machine to a V-20 CPU. (There's a lame thread I updated about my half-hearted attempt to see if it could be trivially overclocked; that's why I was doing tons of reading on the timing chips in the EX and found the bit about wait states and video memory.) After upgrading I ran Topbench, and I found a curious result. Here are my results (copied from database.ini):

[UID9890119E98]
MemoryTest=1843
OpcodeTest=1110
VidramTest=1129
MemEATest=1428
3DGameTest=1030
Score=8
CPU=NEC V20
CPUspeed=7.16 MHz
BIOSinfo=Tandy 1000
BIOSdate=19860714
BIOSCRC16=9890
VideoSystem=CGA
VideoAdapter=Tandy 1000
Machine=Tandy 1000 EX - V20
Description=Tandy 1000 EX w/V20, Custom SRAM RAM card, no DMA
Submitter=Eudimorphodon@VCfed forums

(Apparently I'm right on the bubble between being rewarded a "7" or an "8", the batch run that wrote it to the database gave me an 8 while a dynamic run gives me a 7, for what appears to be two microseconds difference.)

running.jpg

The reason I was possessed to dig up this thread was this result from the database for a 1000 HX with a V-20 and 640k. (I presume that 640k is from a Tandy DMA RAM card.)

[UIDF9D031C]
MemoryTest=2033
OpcodeTest=1231
VidramTest=1265
MemEATest=1600
3DGameTest=1142
Score=7
CPU=NEC V20
CPUspeed=7.16 MHz
BIOSinfo=Copyright (C) 1984,1985,1986,1987 (06/01/87, rev. 100)
BIOSdate=19870601
BIOSCRC16=F9D0
VideoSystem=CGA
VideoAdapter=Tandy 1000
Machine=Tandy 1000 HX
Description=640kb memory, ROM 2.00.00, with a 2400 baud dialup modem. Go ahead - laugh - it came with the machine, dangit!!
Submitter=Maverik1978 (VCF)

My machine actually seems to run substantially faster than the one with the DMA card; 114% as fast according to the database comparison:

comparison.jpg

I am banging my head against the wall for not comparing in detail the results to the 1000 EX in the database *before* swapping CPUs. I *really* don't want to swap them back. My assumption is that all things being equal an EX and an HX should score identically in Topbench. Does anyone have an EX with the Tandy RAM board and a V-20 to test? If these results are correct then I have to assume that the DMA card must at least sometimes induce wait states that my dead-stupid SRAM card doesn't. QED, a Tandy 1000 (EX) with no DMA is *faster* than one with it?
 
Anyone ever try a word processor that allows you to keep editing while printing a document (to a common printer that has a small or no buffer)? That would be a good test of DMA.

It's not a good test. Printers either have buffering or are slow enough that it doesn't matter. I recall background printing utilities for WordStar that used neither interrupts per se nor DMA. The periodic timer tick was good enough to keep the printer busy.

I recall on a x80 compiler that I worked on, it was easiest to print the symbol table cross-reference using a selection sort because the print process was so slow in comparison to the printing speed that a fancy sort algorithm wouldn't have made a significant difference.
 
I realize this is fairly grotesque thread necromancy, but I think I may have some relevant observations to share based on some stuff I've been working on recently.

Keep being a necromancer! I never put two and two together to get four before (always 2.5). But that actually makes sense. I discovered the 'expansion ram starts at zero' after I built a 2MB EMS/CMS card for my 1000A - which was after this thread.

So posit..
  • We've observed the machine gets faster when inserting a memory expansion card with a DMA controller.
  • I've contended it can't be because of DMA as the only thing DMA could be accelerating is memory refresh which is a small contributor (~5% - smaller than the benchmark increases)
  • What is probably happening is in a non-expanded system, the program is running from planar RAM which has shared arbitration/contention between the general accesses and video frame buffer
  • In an expanded system, the expanded memory is mapped to address 0 and planar RAM remapped above it. The EDA available size is adjusted downward to reflect the frame buffer size reservation. And the frame buffer starting address is pointed at that upper planer memory reservation.
  • The system is faster because there is no longer arbitration/contention between the two memory banks.
  • This could be proven by re-testing RAM in the expanded area vs not. Eg. in a system with 128KB system board and 128KB expansion, test above and blow the 128K mark.

So I'm going to re-iterate my original post's point, "ADDING DMA DOES NOT SPEED UP A TANDY 1000". I will concede that adding expanded conventional memory - any expanded conventional memory will. And adding DMA MIGHT speed up floppy transfers, however it would have more to do with the slightly faster DMA controller (5 MHz) and cycle timings than anything. And I'm really not convinced that floppy transfers are sped up any. As I have stated before, I have disassembled a few Tandy 1000 BIOSes now and have found zero code that programs an 8237A to-date. Someone please find me a smoking gun (or stop saying DMA speeds up a 1000).
 
Keep being a necromancer! I never put two and two together to get four before (always 2.5). But that actually makes sense. I discovered the 'expansion ram starts at zero' after I built a 2MB EMS/CMS card for my 1000A - which was after this thread.

Does your EMS card replace the normal base memory card you'd have in a 1000, or is it an add-on (and therefore you're still running with a DMA chip present)?

This morning I pulled out my old dog-eared copy of "The Indispensable PC Hardware Book, 3rd Edition", and I think I found the explanation why my SRAM card might run faster than the DMA board. The section about memory refresh in the PC/XT architecture says that (on a PC, I assume it's similar on a 1000 with the board) counter 1 of the PIT timer is set up for a 66khz square wave, which is used to trigger a dummy DMA cycle every 15 microseconds. For the duration of that cycle the DMA controller is going to be the bus master asserting MEMR, which of course is going to force the CPU to wait. (I didn't quite understand how this worked when I mentioned the DMA RAM card possibly having "wait states", now I get it.) So indeed, the mere presence of a DMA controller issuing refresh cycles is going to occupy a small percentage of bus cycles that would otherwise be open if you didn't need memory refresh.

It'd still be neat if someone has an EX with a V-20 to verify just in case there's some tiny change in the HX architecture that fundamentally makes it run a bit slower than an EX, but I think that's pretty unlikely.
 
My card has 8 512K SRAMs and 2 512K flash chips with a CPLD, a 245 data buffer, and dip switch block. You can set it to backfill some portion of conventional 640K (with a Tandy mode switch to start at 0) and all 320 16K pages can be mapped into the EMS page frames. The driver looks at the dip switch settings and determine which pages are free for the EMS pool and which provide CMS back-fill and configures the EMS available pool accordingly.

I don't have Int 13h support for flash remapping yet. It does not have an 8237A.
 
My card has 8 512K SRAMs and 2 512K flash chips with a CPLD, a 245 data buffer, and dip switch block. You can set it to backfill some portion of conventional 640K (with a Tandy mode switch to start at 0)

It might be interesting for you to run Topbench with your card set for backfill and the original card (assuming you have one lying around) in your 1000/A and see if there's also a measurable difference in scores that could be accounted for by the lack of refresh cycles. That would pretty much clinch it.

If my theory is correct then theoretically the slowest memory in a Tandy 1000 would be the backfill portion of the planar memory in 1000 *with* the DMA board installed, since the CPU could potentially contend with both the DMA controller's busmaster cycles (which will pause the CPU no matter where it's looking since the bus doesn't support simultaneous busmasters) *and* the wait states generated by contention with video output. I'm kind of wondering if there's some way I could fill up the base 384k of RAM on my EX and then run the Topbench stub, thereby forcing it to execute from the Planar ram despite the memory card being installed. That would presumably make the machine either roughly tie or be a little slower than the HX+V20+DMA entry in the database.
 
I will eventually. I just have a dozen balls in the air right now and don't have the time to pull my 1000A out of the basement to test thing.
 
Not to derail the conversation too much, but I'm publishing a video this weekend that proves the Tandy 1000 (the original) can be just a hair slower than the IBM PC, found when doing a software-controlled sound output test. The program in question used software loops for timing, and a lot of port 61h writes. The audio is audibly slower/lower in pitch than when run on the IBM PC. I'm unable to account for this discrepancy.
 
Not to derail the conversation too much, but I'm publishing a video this weekend that proves the Tandy 1000 (the original) can be just a hair slower than the IBM PC, found when doing a software-controlled sound output test. The program in question used software loops for timing, and a lot of port 61h writes. The audio is audibly slower/lower in pitch than when run on the IBM PC. I'm unable to account for this discrepancy.

Most plausible theory is the 1000 did share video RAM in much the same way as the Jr. However I suspect the extra CPU waits due to the video access were more efficient than the Jr's video gate array as the 1000 was an evolution. And obviously the 1000 mapped video RAM at the end of main memory so there wasn't the nasty hole.

If you need a memory expansion board for the 1000 to test some of this thread's theorys, I can mail you a few today.
 
Not to derail the conversation too much, but I'm publishing a video this weekend that proves the Tandy 1000 (the original) can be just a hair slower than the IBM PC, found when doing a software-controlled sound output test. The program in question used software loops for timing, and a lot of port 61h writes. The audio is audibly slower/lower in pitch than when run on the IBM PC. I'm unable to account for this discrepancy.

Does your T1000 have a RAM card installed or is this part of a demo meant to run on an un-expanded machine? RAM access to the shared planar RAM probably is slower than a PC; faster than a PCjr but from reading the manual I don't think it's contention-less.

If you want a data point I'd be willing to run your software loop on my SRAM EX in slow mode, but maybe that won't tell you anything useful. (especially since I have a V-20.)
 
Most plausible theory is the 1000 did share video RAM in much the same way as the Jr. However I suspect the extra CPU waits due to the video access were more efficient than the Jr's video gate array as the 1000 was an evolution. And obviously the 1000 mapped video RAM at the end of main memory so there wasn't the nasty hole.

A good theory, however the PCjr's slowdown at very high interrupt rates (16 KHz) is much worse than on a Tandy 1000. So the Tandy 1000 is slower, but only by 1%, just enough to be audible. On an expanded PCjr, where the main code runs out of the expansion but the interrupt is reading 4 bytes from the first 128K at 16 KHz, it's much more pronounced.
 
A good theory, however the PCjr's slowdown at very high interrupt rates (16 KHz) is much worse than on a Tandy 1000. So the Tandy 1000 is slower, but only by 1%, just enough to be audible. On an expanded PCjr, where the main code runs out of the expansion but the interrupt is reading 4 bytes from the first 128K at 16 KHz, it's much more pronounced.

Paging through my copy of the DATABASE.INI for Topbench it doesn't look like there are results for an original 1000/1000A in there, the closest thing there is is a 1000 SX in slow mode. And comparing the scores for that to the scores UnknownK placed in the thread earlier suggests the original 1000 may *be* slower, slightly, than later models. 1000SX from database.ini:

[UID989015B1]
MemoryTest=3742
OpcodeTest=1770
VidramTest=2082
MemEATest=1962
3DGameTest=1883
Score=4
CPU=Intel 8088
CPUspeed=4.77 MHz
BIOSinfo=
BIOSdate=19860714
BIOSCRC16=9890
VideoSystem=CGA
VideoAdapter=Tandy 1000
Machine=Tandy 1000 SX (4.77MHz)
Description=
Submitter=Great Hierophant

UnknownK's score from a 1985 Tandy 1000 with a DMA memory card installed:

[UID85086DB6]
MemoryTest=3823
OpcodeTest=1833
VidramTest=2148
MemEATest=2028
3DGameTest=1922
Score=4
CPU=Intel 8088
CPUspeed=4.77 MHz
BIOSinfo=unknown
MachineModel=0000
BIOSdate=19850305
BIOSCRC16=8508
VideoSystem=CGA
VideoAdapter=Tandy 1000
Machine=Tandy 1000

It's not a lot but it's measurably slower in every test, which is a little odd since you'd think they should be equivalent. The SX score is almost dead on with a 5150 from the database. Perhaps an original 1000 actually *is* slower. Do you have another 8088 model (EX/HX/SX) to run your test on?

As an aside, I noticed right under the 1000SX score in the database is a score from an expanded IBM PCjr (Description=IBM PCjr running in the faster RAM of the memory expansion). The scores from that PCjr are better than the 1000SX in every category but the VidRamTest by about as much as my SRAM EX beats the DMA/DRAM HX. DMA expansions were rare for PCjr's, so... I wonder if likewise that RAM expansion doesn't burden the system with waiting for refresh cycles.
 
I wonder if likewise that RAM expansion doesn't burden the system with waiting for refresh cycles.

The more common IBM 128KB memory side-cars did use Dynamic RAM which would require refresh. But at least the video wasn't in contention. But even with 1, 2, or even 3 128KB side-cars, you wouldn't want to block out the entire 128KB from use by DOS unless you were specifically doing this benchmark. Each KB was $$ ($$$$ at 1980s exchange rates). I'm guessing it was something like a JR-IDE which uses SRAM and runs a hair faster than DRAM due to no refresh.
 
Also, re: how comparable the Tandy 1000's video ASIC is to the PCjr's, I think these scores:

;Data collected by: TOPBENCH | Benchmark and detection stub | Version 0.97e
;This file contains fingerprinting information about your computer. Please
;email this file to trixter@oldskool.org with a subject line of "Benchmark" to
;help test these routines and seed the TOPBENCH database.

[UID85086477]
MemoryTest=4232
OpcodeTest=2103
VidramTest=2236
MemEATest=2375
3DGameTest=2152
Score=4
CPU=Intel 8088
CPUspeed=4.77 MHz
BIOSinfo=unknown
MachineModel=0000
BIOSdate=19850305
BIOSCRC16=8508
VideoSystem=CGA
VideoAdapter=Tandy 1000
Machine=Tandy 1000

DOS 2.11 Boot disk, 128KB no DMA.

Compared to these:

[UID7F5C71D]
MemoryTest=5926
OpcodeTest=3584
VidramTest=3373
MemEATest=4392
3DGameTest=3490
Score=2
CPU=Intel 8088
CPUspeed=4.77 MHz
BIOSinfo=COPR. IBM 1981,1983 (06/01/83, rev. 86)
BIOSdate=19830601
BIOSCRC16=7F5C
VideoSystem=CGA
VideoAdapter=IBM PCjr
Machine=IBM PCjr
Description=Stock, 128KB RAM. This score is accurate -- it is slower because the first 128KB of PCjr RAM is also display RAM and has an additional wait state.
Submitter=trixter@oldskool.org

Settle that. It doesn't look like it's completely wait-state free (and Unknown_K's scores support that with the aforementioned 10%-ish hit compared to having a RAM card installed) but it is clearly a massive improvement. (The video chip does have 16 bit access to RAM and separate video and CPU data latches, so contention would at least be substantially reduced from just that. It also looks like it may be able to generate wait states with better granularity for when conflicts do happen, although I can't say that without knowing more about how the Jr. works.)
 
I'm guessing it was something like a JR-IDE which uses SRAM and runs a hair faster than DRAM due to no refresh.

Yeah, that's what I was was suspecting. A Jr-IDE should be roughly the equivalent to SRAM in a Tandy for expansion RAM speed, and would be another data point for "your machine will be faster with no DMA controller unless you need it for something".
 
I do, but I'm swamped until about 1-2 weeks after VCFMW is over.

So... I noticed something last night, and I was wondering if you could give me some insight on how TopBench assigns the aggregate speed number when you add a new system to the database. I was mucking around with a keyboard-less 1000 HX motherboard last night, driving it via PC Anywhere (which I don't know is actually a factor), and out of curiosity related to this "DMA speed-up" thread I ran TopBench on it to look for more confirmation/denial of the theory that one of these machines with a static RAM board is faster than the DMA board. I did the "add this machine to the database" routine, and then I ran system comparison to see how the HX with the static board compared to the EX in the database with the Radio Shack board. (The HX in the stock database has a V-20 so the EX should be the closer comparison.) Here's a screenshot of the curious result. (If you'd like the actual database numbers I can transfer them over. The screen looks like a VGA screen because it's the PC Anywhere view.)

20191001_184217.jpg

I'm curious why the result that is faster on every individual test rated a "5" instead of a "6", and was thus declared "20% slower". (Based on the "total" times my calculator says it should actually be around 8% faster?) Is the "speed number" based on some arithmetic mean of the existing scores (and thus only really valid if the whole database is re-crunched?), or a wall-clock time that might have been influenced by PC Anywhere running in the background?

I was already sort of curious about this because my other machine was assigned an "8" when it went into the database, but always gets a "7" when the test is run interactively. The scores will be identical, or near enough I can't imagine it mattering. (Literally a microsecond or two either way.)

(I apologize if this has already been answered elsewhere.)
 
These are all great questions, and I'm sorry if the documentation doesn't do a good job of explaining it. The details are in https://github.com/MobyGamer/TOPBENCH/blob/master/BTSUITES.PAS but the gist is this: There are five sections of code that are run, each testing a different area of performance (memory speed, general instruction exercise, etc.). Each section returns the number of microseconds (usecs) it took to run, so if you see those, lower numbers are better. However, those are not part of the Score -- they are recorded as a convenience to emulator authors, or if you want to see exactly why one Score is slightly off than another. The Score itself is the number of times the complete test suite ran in a 50ms period.

The goal of the Score was to keep it simple -- just a synthetic integer you could use to compute relative performance between systems. It was a reaction to the (IMO misleading and confusing) Landmark "XX.XX MHZ AT SPEED" measurement. I still think how I did it was the right decision, but where it breaks down is when you have an edge case where a very slight modification on a very slow system lets the Score calculation run just one extra loop, which is what you're seeing. One extra loop is no big deal on a faster system, but when the Score is single-digits, it can look like (in your case) a 12% jump in speed when it might really only be a 1% jump that pushed it over the edge. For such edge cases, you can determine the actual % speed difference by looking at the actual microsecond timings and comparing those instead.

This has prompted me to make a small change to TOPBENCH where I can put the % speed difference for the microsecond timings in that same display in your screenshot. I'll make that code change by the end of the weekend and then update this thread when it's uploaded. Hopefully that will give you more insight.

When I designed TOPBENCH I didn't think people would be making such small changes to very slow systems -- people usually made larger changes, like 8088->NEC V20, or 6MHz to 8MHz. Had I known people would be making tiny changes, I would probably have changed the score such that it would be # of iterations in a 100ms or 200ms period, rather than 50ms. I suppose I could release a new version that does that, and adjusts the scores to match so that they're all the same relative to each other, but that's disingenuous and inaccurate for systems that actually exhibit these edge cases, so I'm afraid the benchmark's core operation is forever frozen.
 
There are five sections of code that are run, each testing a different area of performance (memory speed, general instruction exercise, etc.). Each section returns the number of microseconds (usecs) it took to run, so if you see those, lower numbers are better. However, those are not part of the Score -- they are recorded as a convenience to emulator authors, or if you want to see exactly why one Score is slightly off than another. The Score itself is the number of times the complete test suite ran in a 50ms period...

I still think how I did it was the right decision, but where it breaks down is when you have an edge case where a very slight modification on a very slow system lets the Score calculation run just one extra loop, which is what you're seeing. One extra loop is no big deal on a faster system, but when the Score is single-digits, it can look like (in your case) a 12% jump in speed when it might really only be a 1% jump that pushed it over the edge. For such edge cases, you can determine the actual % speed difference by looking at the actual microsecond timings and comparing those instead.

Are the microsecond timings that are recorded in the end the average for each of the loops over the 50ms run, or just the amount of time a given (first,last?) run of that individual code tree took? That's where I'm still a little puzzled, because in this bizarre edge case the times for *all* the individual tests were lower (faster), but the aggregate score was lower (less total runs executed).

I've noticed that I can significantly ding the interactive Topbench score (at least on these bottom-of-the-barrel machines) by, for instance, grabbing the mouse and whirling it around. I assume that happens because it's possible for interrupts to impinge in the middle of the 50ms loops and knock an individual execution out? If that's the case then I imagine PC Anywhere's diddling with the serial port could be responsible for the lower synthetic score and the result will make more sense after I work out how the heck I'm going to put a keyboard on this thing?
 
Back
Top