• Please review our updated Terms and Rules here

Tandy 1000 A/EX/HX DMA speed-up

eeguru

Veteran Member
Joined
Mar 14, 2011
Messages
1,618
Location
Atlanta, GA, USA
I just read yet another post today from someone claiming adding the DMA memory expansion card to a 1000 {A, EX, HX} made the machine 'faster'. I would really like someone to explain this one to me at a technical level as I don't understand how this was verified. A couple theories:

1) With an 8237A, DRAM refresh could be off-loaded from a CPU timer interrupt. The Tandy variation of the 8-bit slot has HOLD/HLDA and 20-bits worth of addressing. So it could, in theory, do automatic scheduled DRAM refresh for all lower system memory. However, has anyone ever found evidence (BIOS fragment, benchmark, etc), that this is the case?

2) With an 8237A, floppy sector transfers could be offloaded to the DMA controller and not done via PIO (move + in/out + loop). However the BIOS routines are synchronous. The CPU doesn't have anything to do but continuously poll status registers or marshal the result. PIO loops 6 bytes or less may not generate any instruction fetches. But, the wait states on the data register I/O movement instruction could be saved per byte via DMA. And the floppy DMA req/ack lines are also present on the ISA slot. But again, has anyone ever found evidence (BIOS fragment, disk I/O benchmark, etc) that the BIOS floppy transfer routines really do auto-detect an 8237A and adjust?

3) Other known speed up mechanisms?

I find it *very* hard to believe that at least the 1000/1000A BIOS writers were so forward thinking they would have included advanced support for #1 or #2. Maybe on the EX and HX. However I've disassembled much of that BIOS in the past and don't recall ever finding 8237A-related code. I think it's more plausible that they changed the IBM 8-bit slot pins slightly so that DMA support could be added in the future to support an as-yet undefined peripheral use case - there is still one unused channel ack/req set on the Tandy 8-bit bus.

If you know for a fact how the machine is sped-up and can point me to either benchmark data or a BIOS location, please educate me! Thanks! :)
 
I just read yet another post today from someone claiming adding the DMA memory expansion card to a 1000 {A, EX, HX} made the machine 'faster'. I would really like someone to explain this one to me at a technical level as I don't understand how this was verified. A couple theories:

1) With an 8237A, DRAM refresh could be off-loaded from a CPU timer interrupt. The Tandy variation of the 8-bit slot has HOLD/HLDA and 20-bits worth of addressing. So it could, in theory, do automatic scheduled DRAM refresh for all lower system memory. However, has anyone ever found evidence (BIOS fragment, benchmark, etc), that this is the case?

2) With an 8237A, floppy sector transfers could be offloaded to the DMA controller and not done via PIO (move + in/out + loop). However the BIOS routines are synchronous. The CPU doesn't have anything to do but continuously poll status registers or marshal the result. PIO loops 6 bytes or less may not generate any instruction fetches. But, the wait states on the data register I/O movement instruction could be saved per byte via DMA. And the floppy DMA req/ack lines are also present on the ISA slot. But again, has anyone ever found evidence (BIOS fragment, disk I/O benchmark, etc) that the BIOS floppy transfer routines really do auto-detect an 8237A and adjust?

3) Other known speed up mechanisms?

I find it *very* hard to believe that at least the 1000/1000A BIOS writers were so forward thinking they would have included advanced support for #1 or #2. Maybe on the EX and HX. However I've disassembled much of that BIOS in the past and don't recall ever finding 8237A-related code. I think it's more plausible that they changed the IBM 8-bit slot pins slightly so that DMA support could be added in the future to support an as-yet undefined peripheral use case - there is still one unused channel ack/req set on the Tandy 8-bit bus.

If you know for a fact how the machine is sped-up and can point me to either benchmark data or a BIOS location, please educate me! Thanks! :)

I believe the confusion is because the PCJr does appear to be much slower without DMA due to dram refresh overhead, but Tandy 1000 systems handled ram refresh with dedicated motherboard hardware that had little to no performance impact.

*edit*
http://www.vcfed.org/forum/showthread.php?58833-Experience-of-a-DMA-less-Tandy-1000
*/edit*
 
I believe the confusion is because the PCJr does appear to be much slower without DMA due to dram refresh overhead, but Tandy 1000 systems handled ram refresh with dedicated motherboard hardware that had little to no performance impact.

That isn't the reason. DRAM refresh cycles are not a large impact (~5% maybe). The PCjr was slow because it had single ported RAM and the video graphics controller had to arbitrate and interleave access from itself tracing out the current frame buffer and CPU general accesses. It wasn't because of a lack of 8237A.
 
Tandy 1000s all have single ported RAM as well. The TX and TL series have dedicated video ram as an option, which provides a huge speed increase on processor intensive work if installed.

The CPU has to continually poll the keyboard on the PC Jr. but not on the Tandy as if I remember correctly Tandy included the extra circuitry to prevent this.

DMA probably did have a greater speed up effect on the PC Jr.

Anyone ever try a word processor that allows you to keep editing while printing a document (to a common printer that has a small or no buffer)? That would be a good test of DMA.

I just sold my Tandy 1000EX with DMA I wish I had kept it to run a few benchmarks/tests.
 
This should be easy to test since the DMA was provided by the RAM upgrade:

1. Find any Tandy 1000 or 1000A
2. Does it have 128KB? Perform your testing.
3. Does it have more than 128KB? Rip the memory board out, then perform your testing.

Of course, now you're going to ask me to get my 1000 out of storage. I'll be in a position to do that in a few weeks, I'll try to remember.
 
I hace a T1K I can test with and without DMA. Someone have a boot disk image they can mail me with a benchmarking program. I'll run it.
 
The PCjr.'s keyboard reading routines were slow in part because there is no serial to parallel shift register to handle the work of forming a complete byte from the keyboard for the PC to read. So the PCjr's processor itself has to read the port more often and perform the work of deserializing the reads which the PC's processor did not have to do.

Tandy came up with the ingenious idea of arraying memory that was controlled by the video display controller as a 16-bit array. This gave the controller a very high video bitrate, allowing the video to keep pace with the system instead of being a bottleneck. The array looked to be 8-bit to the 8088 CPU.

The video display controller can handle the refresh of 128KB of DRAM in the Tandy 1000/A/HD and 256KB of DRAM in the Tandy 1000 EX & HX. The remainder is handled by a DMA chip when added to the system with a RAM upgrade. Alternatively, there have been some solutions recently which have used SRAM, which is easier to implement and does not need constant memory refresh.

This thread may be useful in this discussion : http://www.vcfed.org/forum/showthread.php?58833-Experience-of-a-DMA-less-Tandy-1000
 
I hace a T1K I can test with and without DMA. Someone have a boot disk image they can mail me with a benchmarking program. I'll run it.

I'll make a bootable 360K image of the TOPBENCH stub program when I get home and reply back. (Even if we prove it with a different benchmark, I'd like the results for the database.)
 
No bootable image, but I did make the program stub small enough that it will run in only 53K of RAM so you can use it with any boot disk you want. Stub is here: ftp://ftp.oldskool.org/pub/TOPBENCH/TSTUB97E.ZIP

Instructions are simple, just run it from the floppy disk and it will create OUTPUT.INI with the results of various benchmarks. Run it again and it will append to OUTPUT.INI. Post both results here and, if DMA does or doesn't speed up CPU/memory/video operation, it will be obvious.
 
;Data collected by: TOPBENCH | Benchmark and detection stub | Version 0.97e
;This file contains fingerprinting information about your computer. Please
;email this file to trixter@oldskool.org with a subject line of "Benchmark" to
;help test these routines and seed the TOPBENCH database.

[UID85086DB6]
MemoryTest=3823
OpcodeTest=1833
VidramTest=2148
MemEATest=2028
3DGameTest=1922
Score=4
CPU=Intel 8088
CPUspeed=4.77 MHz
BIOSinfo=unknown
MachineModel=0000
BIOSdate=19850305
BIOSCRC16=8508
VideoSystem=CGA
VideoAdapter=Tandy 1000
Machine=Tandy 1000

Tandy 1000A with a MFB1000 512KB RAM card with DMA enabled (640K total). Used a DOS 6.22 boot disk for the test, could not do one with DMA off because I get stuck with base 128KB and that's too low for DOS 6.
 
Let us know what the 128K non-DMA test shows once you can run it with DOS 3.3 or lower. (Assuming you have the ability to write a boot disk, since you were able to transfer the program?)
 
;Data collected by: TOPBENCH | Benchmark and detection stub | Version 0.97e
;This file contains fingerprinting information about your computer. Please
;email this file to trixter@oldskool.org with a subject line of "Benchmark" to
;help test these routines and seed the TOPBENCH database.

[UID85086477]
MemoryTest=4232
OpcodeTest=2103
VidramTest=2236
MemEATest=2375
3DGameTest=2152
Score=4
CPU=Intel 8088
CPUspeed=4.77 MHz
BIOSinfo=unknown
MachineModel=0000
BIOSdate=19850305
BIOSCRC16=8508
VideoSystem=CGA
VideoAdapter=Tandy 1000
Machine=Tandy 1000

DOS 2.11 Boot disk, 128KB no DMA.
 
Thanks for this! Interesting: The pure memory test was 10% slower without DMA. So let's see how that affected the rest of the metric timings:

OpcodeTest=14% slower (tests a mixture of every executable opcode)
VidramTest=4% slower (copies system RAM to video RAM and vice versa)
MemEATest=17% slower (tests the CPU's effective memory address calculations)
3DGameTest=12% slower (tests a mixture of instructions found in 8086-era 3-D game inner loops)

Despite the measurable slowdown without DMA, the overall synthetic Score of 4 is unaffected, which means it wouldn't really be noticeable to the naked eye.

The MemEATest slowdown being more than the others is interesting to me since it's just a mixture of stuff like "mov al,es:[bx+si+disp16]" in various combinations.

For anyone wondering if the system is slowed/affected while the floppy drive is rotating, that's not a factor since 1. the entire binary loads into RAM before it starts and 2. I explicitly wait for the floppy motor to stop turning before I start the calculations.
 
I don't believe it's a valid test for comparing speed. Unless the code runs out of the same memory controller (MB RAM or Expansion card RAM), the wait states might not be the same. The lower 128K could be slower than the upper 512K. Is there any way to restrict the test to absolute memory address ranges?
 
I'm not sure I follow; the stub is small enough (53K total memory usage) that it ran in the lower 128K in both tests; isn't that a fair comparison? Or am I misunderstanding you?

I could adjust the code to run out of a specific memory location, but I'm hesitant to do that for several reasons not worth going into. I could write a small program that benchmarks the entire lower 768K (for Tandys that can go that high) of memory read and write speeds in 64k chunks though, would that help?
 
I'm not sure I follow; the stub is small enough (53K total memory usage) that it ran in the lower 128K in both tests; isn't that a fair comparison? Or am I misunderstanding you?

I could adjust the code to run out of a specific memory location, but I'm hesitant to do that for several reasons not worth going into. I could write a small program that benchmarks the entire lower 768K (for Tandys that can go that high) of memory read and write speeds in 64k chunks though, would that help?

If it only uses the execution segment memory for the test (not the instructions, the source/targets of your mov's), then that is an apples to apples comparison. I didn't know how the benchmark worked. Though I'm still at a loss to explain why there is a difference. There is no way the external DMA controller is refreshing the lower 128K. It technically could, but a) it's redundant, b) I doubt there is bios support for a dynamic switch.
 
The memory test uses the memory directly after execution segment memory. Because the stub is small, it falls into the first 128KB.

As for what the memory test is actually doing, it's just stressing the string instructions: https://github.com/MobyGamer/TOPBENCH/blob/master/METRICS/_MBLOCK.BOD

The "Vidram" test is different; as the name implies, it performs an instruction mix deliberately against video memory.
 
I realize this is fairly grotesque thread necromancy, but I think I may have some relevant observations to share based on some stuff I've been working on recently.

The memory test uses the memory directly after execution segment memory. Because the stub is small, it falls into the first 128KB.

Since this was left hanging and I know the answer now I'll fill in on why a test involving a Tandy 1000 with a DMA-equipped RAM card present or absent will produce misleading results:

The Tandy 1000 doesn't map RAM like either the PCjr (to my knowledge) or a regular PC. All memory in an unexpanded 1000/A/EX/HX is controlled by the video ASIC. During initialization a small amount of the onboard RAM is mapped into the B000 segment and tested. Then the machine starts testing for expansion memory at 00000 hex, counting up in 128k blocks up to 512k (Tandy with 128k onboard) or 384k (EX/HX). If expansion memory is found the video-controlled memory minus 16k off the top is mapped *after* the expansion RAM. Because of this behavior if you run a benchmark that fits in a 128k Tandy 1000 in a machine that has *any* expansion memory (minimum amount allowed is 128k) your benchmark will be running from expansion RAM, not the built-in RAM.

The reason I'm positive about this is I've just completed a build of a DMA-less RAM card that backfills an EX to 640k; this mapping behavior is really poorly documented unless you read the *right* Tandy 1000 manual (the original Service manual, not the "Technical" manual) and I *almost* sent off a PCB implementing it the wrong way. Caught it just at the last moment.

Anyway, I haven't looked in the original 1000 manual to check the situation there, but the EX manual says that the "Light Blue" timing chip can generate a variable number of wait states on the CPU when RAM behind "Big Blue", the video chip, is accessed. I don't know if the DMA RAM card also implements wait states(*), but it's very likely that you'll get more of them from video memory. So the results saying that a 128k Tandy 1000 is slower than one with the DMA card plugged in may well have nothing to do with DMA per-se.

(*) I will get back to this next post.
 
Back
Top