• Please review our updated Terms and Rules here

Disk Caching - EMS vs XMS

maxtherabbit

Veteran Member
Joined
Apr 23, 2019
Messages
2,153
Location
VA, USA
So under 100% equal conditions, EMS would always be faster since it doesn't require the memory contents to be copied to conventional before they can be read by DOS. But what about the somewhat uncommon case of EMS which is limited to 8-bit memory accesses to the page frame? (Due to the LA lines only having 128kB resolution and the page frame often sharing a 128kB region with a physical 8 bit device.)

Would 8 bit EMS still be faster on a 0 wait system? Twice as many bus cycles and added waits vs having to copy the XMS back to conventional (a 16 bit accessed 0 wait operation)
 
I guess no one wants to hazard a guess, what disk performance benchmark would be most valid to test the speed of a system RAM based cache? DISKTEST for DOS?
 
8 bit ISA timing is not the same as 16-bit, if that's what you are getting at. I would say accessing content directly off the 8-bit device would still be quicker than (possibly evicting a page then) copying from XMS then reading again.
 
8 bit ISA timing is not the same as 16-bit, if that's what you are getting at. I would say accessing content directly off the 8-bit device would still be quicker than (possibly evicting a page then) copying from XMS then reading again.

I know. That's my whole point. Two 8-bit access cycles with all their extra associates waits vs. 16-bit zero wait with a data copy required
 
My guess is that XMS is faster. Would this QBASIC benchmark do the job?
Code:
' create a file
buffer$ = SPACE$(16384)
OPEN "tempfile.tmp" FOR BINARY AS #1
PUT #1, , buffer$: CLOSE #1
' read it back a few times
timeout = TIMER + 4
OPEN "tempfile.tmp" FOR BINARY AS #1
filepos = 1: count = 0
DO
GET #1, filepos, buffer$: count = count + 1
LOOP UNTIL TIMER >= timeout
CLOSE #1
KILL "tempfile.tmp"         ' remove the temp file
PRINT count * 4; "KB/second"
 
I remember that PC Magazine had a comparison of several caches in different configurations which should have proved a good reference for choosing an optimal setup. I can't find an index mentioning what issue had the article.

I think EMS was faster than XMS unless one is running DOS extender programs. Copying between extended memory blocks will take less time than shifting back into real mode and switching the EMS page and then sending that data over to the extended memory addresses that need it and then resuming protected mode operation.
 
My gut feeling is you'd need a machine significantly faster than a 286 (IE, something where ISA's limitations actually seriously start turning into a pinch point) and a very specialized application to actually tell the difference either way.

Not that I can test, I only have XT-class Tandy 1000s to play with and experimentation with them reveals that according to the DISKTEST benchmark their XT-CF-lite adapters are effectively as fast as a RAMDISK in EMS memory. Disk cache software just slows them down.
 
This question has multiple answers based on the operating environment. You can't just say "which is faster, EMS or XMS" and leave it at that. The answer changes based on the hardware and use case.

Would 8 bit EMS still be faster on a 0 wait system? Twice as many bus cycles and added waits vs having to copy the XMS back to conventional (a 16 bit accessed 0 wait operation)

But this isn't a practical configuration. If you have a 286, you'll be adding a 16-bit memory board, and you'd either have 16-bit EMS or 16-bit extended memory. The only reason this configuration would exist in the past is if someone upgraded their hardware and didn't want to spend the money on upgrading their memory board and just moved it over, but we don't have those kinds of limitations or situations today.

Different applications use EMS and/or XMS in different ways. The only way to truly answer your question is to maybe find an application that supports both, such as Norton Cache 2, and run some disk access benchmarks with the cache configured both ways.
 
Last edited:
But this isn't a practical configuration. If you have a 286, you'll be adding a 16-bit memory board, and you'd either have 16-bit EMS or 16-bit extended memory. The only reason this configuration would exist in the past is if someone upgraded their hardware and didn't want to spend the money on upgrading their memory board and just moved it over, but we don't have those kinds of limitations or situations today.
That's not the case, at least with 5162/70s and their faithful clones.

Because IBM decided to latch the MEMCS16# signal on AEN, the ONLY way to get a 16-bit memory cycle is to assert based solely on the LA[17:23] pipelined address. Which only gives 128kB resolution. Therefore if you want your 16-bit memory card to provide a 16-bit accessible EMS page frame in the UMA, you have to have nothing but 16-bit memory devices in its 128kB region. If you are using any 8-bit video card, that rules out the A/B segments. If you are using an 8-bit EGA that rules out C/D. And E is off the table since IBM decided to map it to ROM sockets on the motherboard.

So basically a stock AT w/EGA cannot have 16-bit EMS, which is what prompted this whole exercise
 
That's not the case, at least with 5162/70s and their faithful clones.

Because IBM decided to latch the MEMCS16# signal on AEN, the ONLY way to get a 16-bit memory cycle is to assert based solely on the LA[17:23] pipelined address. Which only gives 128kB resolution. Therefore if you want your 16-bit memory card to provide a 16-bit accessible EMS page frame in the UMA, you have to have nothing but 16-bit memory devices in its 128kB region. If you are using any 8-bit video card, that rules out the A/B segments. If you are using an 8-bit EGA that rules out C/D. And E is off the table since IBM decided to map it to ROM sockets on the motherboard.

So basically a stock AT w/EGA cannot have 16-bit EMS, which is what prompted this whole exercise

Either this isn't correct or I don't follow what's being said here.

The '286 will request a 16-bit operation by a0=0 and BHE=0. Regardless of the pipeline evaluation, the Expansion card 'replies' with M16 (or IO16) - or not, if it's 8-bit - only on the falling edge of BALE *which happens after all address bits are present*. This is used by both the '286 and the bus steering logic, i.e. if not asserted on a read, the bus logic will transfer and latch the low byte then, with cpuready (i.e. cpu wait state), change SA0 to 1 and transfer the high, while enabling a copy from low to upper data path for odd address, and represent the latched byte so providing the '286 with both.

I don't think there is any co-existence issue here with 8- and 16-bit devices within the same 128KB window or any other window for that matter.
 
Either this isn't correct or I don't follow what's being said here.

The '286 will request a 16-bit operation by a0=0 and BHE=0. Regardless of the pipeline evaluation, the Expansion card 'replies' with M16 (or IO16) - or not, if it's 8-bit - only on the falling edge of BALE *which happens after all address bits are present*. This is used by both the '286 and the bus steering logic, i.e. if not asserted on a read, the bus logic will transfer and latch the low byte then, with cpuready (i.e. cpu wait state), change SA0 to 1 and transfer the high, while enabling a copy from low to upper data path for odd address, and represent the latched byte so providing the '286 with both.

I don't think there is any co-existence issue here with 8- and 16-bit devices within the same 128KB window or any other window for that matter.

No what I said was 100% correct. Are you expecting the expansion card to have time to decode A[0:19] and assert MEMCS16# all within the span of the ALE pulse? If the card doesn't pull MEMCS16# low BEFORE ALE falls you get an 8-bit bus cycle.
 
No what I said was 100% correct. Are you expecting the expansion card to have time to decode A[0:19] and assert MEMCS16# all within the span of the ALE pulse? If the card doesn't pull MEMCS16# low BEFORE ALE falls you get an 8-bit bus cycle.

There is a timing diagram here that suggests that's exactly what happens. Or, sort of, anyway. it actually shows M16 getting sampled twice with this footnote:

[4] M16 is sampled a second time, in case the adapter card did not active the signal in time for the first sample (usually because the memory device is not monitoring the LA bus for early address information, or is waiting for the falling edge of BALE).

It also says this:

SBHE will be pulled low by the system board, and the adapter card must respond with IOCS16 or MEMCS16 at the appropriate time, or else the transfer will be split into two separate 8 bit transfers. Many systems expect IO16 or M16 before the command lines are valid. This requires that IO16 or M16 be pulled low as soon as the address is decoded (before it is known whether the cycle is I/O or Memory). If the system is starting a memory cycle, it will ignore IO16 (and vice-versa for I/O cycles and M16).

So, yes, there's a requirement that the card decode the address *fast*, but do note that according to the diagram LA17-23 *are* valid before the BALE pulse even starts so there *is* time for the card to sort it out whether it's the target before it has to answer.

This interpretation agrees with this second source, I guess I could yank out my old paper "ultimate PC hardware reference" and get a third opinion.

EDIT... Wait, now the light bulb is coming on. Only LA17 and higher are ready before BALE, so, yeah, now I'm wondering if you're right about that 128k granularity at least some of the time? The lesser address lines are only ready at BALE. This would explain why the standard specifies IOCS16 should be sampled later than MEMCS16. But the one source also says that some boards violate that and probe MEMCS16 and IOCS16 at the same (earlier) time, which would mean that they expect card decoders to be able to give a yes within the time span of the BALE pulse. Now I'm legit confused; if your card has fast enough decoding *can* you specify MEMCS16 on a sub128k-size block or not?!
 
Last edited:
The diagrams in the Mindshare book 'ISA System Architecture (3rd.ed)' show CS16 being sampled twice, sample one on falling edge and sample two 62ns later. As I interpret the text, it does support decoding on LA[] only as you say... though "if" and "some" do feature. Have you scoped it? I'd be interested to see what's going on in reality. There is one 125ns additional cycle in IO btw, after falling edge.
 
Last edited:
Earlier in the above mentioned text, referencing the ISA design including only LA[23..17], it says this was decided because banks of 16x 4164s 64kb RAM are 128KB... hence LA[23..17] provides enough to select the right bank, and all IOs are <128KB hence the extra wait state. So yeah, 128KB boundaries for sure.
 
Last edited:
The diagrams in the Mindshare book 'ISA System Architecture (3rd.ed)' show CS16 being sampled twice, sample one on falling edge and sample two 62ns later. As I interpret the text, it does support decoding on LA[] only as you say... though "if" and "some" do feature. Have you scoped it? I'd be interested to see what's going on in reality. There is one 125ns additional cycle in IO btw, after falling edge.

Yes I've tested it. You can also see it from the schematics of the 5162/5170. They latch MEMCS16# on ALE, there is no second sample point. I've also confirmed this behaviour on the VLSI VL82C100 and VL82C200 series chipsets.

I also tested a 440BX based board and confirmed it DOES sample MEMCS16# a second time after ALE.
 
Thanks - I always learn something (my wife would say I didn't need to know) every time I visit this site ;)
 
I also tested a 440BX based board and confirmed it DOES sample MEMCS16# a second time after ALE.

I wonder how common the second-sample behavior is verses strict 5170 compatibility. Strictly speaking there never was an "official", universally-ratified, non-draft specification for ISA; they *kind* of cooked one up in the process of creating EISA that I imagine most cloners at least looked at when slapping together their systems and chipsets but that was obviously several years after the fact.
 
if your card has fast enough decoding *can* you specify MEMCS16 on a sub128k-size block or not?!
I was unable to beet the clock, so to speak. Even a super basic address decoder consisting of nothing but a 688 decoding A[14:19] and a 74S05 driver for MEMCS16# was unable to beat the latch on the system board and get a 16-bit cycle.

This is of course on the original IBM style design. It worked fine on the 440BX based test system.
 
Back
Top