• Please review our updated Terms and Rules here

Trixter: Preventing CGA Snow

mbbrutman

Associate Cat Herder
Staff member
Joined
May 3, 2003
Messages
6,408
I know you're an expert here ... ;-0 I have the new version of my IRC client that does split screen console IO, and it is fast, but it is snowy too.

I think I need to be monitoring 3DA bit 1 to see when the coast is clear. Bit one indicated if vertical or horizontal retrace is happening.

- I think that vertical retrace time is 1.25ms. A rough calculation says that I have about 10000 cycles in that time to do my memory moves to avoid snow. How long is the safe time for horizontal retrace?

I'm assuming that my memcpy routine used something reasonable, like a REP MOV. What else might a library have done for an efficient memcpy? (I need to try to predict what they did so that I know how much memory to move during the retrace times.)


Thx,
Mike
 
You're correct in that you need to monitor 3da for vertical or horizontal retrace to see if it's "safe" to write a character without snow. How you want to do this is up to your method, though: Do you write directly to screen memory, or do you write to a virtual buffer and then want to update that buffer to screen memory?

If the former, it is safe to monitor 3da for both horizontal and vertical retrace (it's a single bit) and just write one character/attribute when any sort of retrace is in progress. In my tests, copying a virtual off-screen buffer to screen memory, I was able to copy 200 16-bit words (character+attribute) during screen paint (one per scanline obviously) and another 470 16-bit words during vert. retrace.

For perfect snow handling writing directly to the screen, you pre-load everything since you've got very little time: Load DX with 3da, pre-fetch your character/attribute into AX (or just character into AL with attribute already in AH) and then do this:

Code:
## BH holds attribute, BL holds character
## DX = 3DA
## ES = b800 (or b000 mono)
## DI points to offset where you want to store the character in screen ram

  cli                      {I hate doing this, but interrupts screw up timing}
@WDN:                      {wait until we're out of some random retrace we may have started in}
  In   AL,DX               {grab status bits}
  test al,c_display_enable {are we in some random horizontal sync cycle?}
  jnz  @WDN                {if so, keep waiting}
@WDR:                      {wait until we're in either vert or horiz retrace}
  In   AL,DX               {grab status bits}
  shr  al,1                {shift bit into carry -- were we in retrace?}
  jnc  @WDR                {if not, keep waiting}
  xchg bx,ax               {get character and attribute into AL -- this is a 1-byte opcode}
  stosw                    {write it out - stosw is a 1-byte opcode; we need speed because otherwise we will see snow!}
  sti

Whether or not you think the rest of the system can handle the cli/sti is up to you.

There are lots of tradeoffs you can make to speed things up (don't disable interrupts, etc.) but they usually result in a tiny bit of snow along the left side of the screen. I figure if someone is bothered by snow, they want it GONE.

If you have an entire buffer to copy, you can do it in about 63ms without snow (copying during horizontal and vertical retrace takes about 4 screen refresh cycles, for a maximum update rate of 16fps). If you would like THAT routine, let me know and I'll post it here.
 
I don't mind a little snow, and if I can put in a just a little effort to reduce it I'm willing to do that. I don't need it perfect.

I'm scrolling 21 lines of the screen upward, so there is a fair amount of memory movement going on. I'm doing it with a sequence of memcpy calls, each one is moving 160 bytes. My quick and dirty plan is to put the poll for the vertical retrace before each call to memcpy, so the memcpy only fires when I know it is safe.

Unfortunately, I think this is going to slow things down quite a bit. If each vertical retrace takes 1.25ms, then at worst case I'll have to wait 1.25ms before doing each memcpy. That means a 26ms screen refresh. Something tells me that is slow ..

How else does one do vertical scrolling? I really want to use the existing memcpy routines and not write my own .. it's not bothering me that much.
 
I'm scrolling 21 lines of the screen upward, so there is a fair amount of memory movement going on. I'm doing it with a sequence of memcpy calls, each one is moving 160 bytes. My quick and dirty plan is to put the poll for the vertical retrace before each call to memcpy, so the memcpy only fires when I know it is safe.

You should be able to do 2 or three lines at a time that way, easily.

Unfortunately, I think this is going to slow things down quite a bit. If each vertical retrace takes 1.25ms, then at worst case I'll have to wait 1.25ms before doing each memcpy. That means a 26ms screen refresh. Something tells me that is slow ..

It is, so do two or three lines at a time. Also, don't read from screen memory and then write to screen memory, unless you're really really low on RAM and targeting a 64KB PC. I would maintain a buffer in system RAM, and update the system buffer (scrolling, etc.) and then just memcpy the updated portions (ie. the top 21 lines or whatever you mentioned).

How else does one do vertical scrolling? I really want to use the existing memcpy routines and not write my own .. it's not bothering me that much.

If you don't have any columns on the side of the screen, the obvious way to do scrolling on CGA is to have a "double-high" screen RAM buffer and set the start address down a few lines when you want to scroll. Scrolling is then instantaneous without any screen writes :) and you just write to the new lines, and copy the new line to the top of the buffer (so that when you reach the end of your double-high-page and snap back to the beginning, the previously-scrolled lines are there). However, this doesn't work with mono (hercules cards have multiple pages but mono has only one single 80x25 page), and you expressed a desire to KISS, so I think the memcpy of three lines at a time will work for you.

BTW, do you have a system that exhibits snow so that you can test your code? If so, is it 4.77MHz? If it's anything faster, your timings may not reflect the machines where snow handling is most needed.
 
You should be able to do 2 or three lines at a time that way, easily.

If that's the case, then my worst case screen refresh is down to 13ms or so. Isn't that about what a 60Hz refresh rate does anyway?

It is, so do two or three lines at a time. Also, don't read from screen memory and then write to screen memory, unless you're really really low on RAM and targeting a 64KB PC. I would maintain a buffer in system RAM, and update the system buffer (scrolling, etc.) and then just memcpy the updated portions (ie. the top 21 lines or whatever you mentioned).

I was being cheap with the RAM and I'm using some of the C library functions to do output, so I don't have a separate buffer in RAM that I'm using as a source. I could burn that extra 4K of memory, but then I would have to give up on using most of the C library functions for output and do everything in that RAM buffer.


If you don't have any columns on the side of the screen, the obvious way to do scrolling on CGA is to have a "double-high" screen RAM buffer and set the start address down a few lines when you want to scroll. Scrolling is then instantaneous without any screen writes :) and you just write to the new lines, and copy the new line to the top of the buffer (so that when you reach the end of your double-high-page and snap back to the beginning, the previously-scrolled lines are there). However, this doesn't work with mono (hercules cards have multiple pages but mono has only one single 80x25 page), and you expressed a desire to KISS, so I think the memcpy of three lines at a time will work for you.

I'm dense here .. I'm talking about the case where I'm adding new input at the bottom of the screen, and I need to push prior input up. I'm not even worried about having a backscroll buffer yet. Indulge my denseness and elaborate on this a little bit for me ...

BTW, do you have a system that exhibits snow so that you can test your code? If so, is it 4.77MHz? If it's anything faster, your timings may not reflect the machines where snow handling is most needed.

I'm on a 4.77Mhz XT with almost everything original. I don't mind the snow .. just looking to cut it a little bit.
 
If that's the case, then my worst case screen refresh is down to 13ms or so. Isn't that about what a 60Hz refresh rate does anyway?

About (16ms) but there's no way you're updating 21 of the 25 lines in 13ms. Not enough time for that @ 4.77Mhz.

I was being cheap with the RAM and I'm using some of the C library functions to do output, so I don't have a separate buffer in RAM that I'm using as a source. I could burn that extra 4K of memory, but then I would have to give up on using most of the C library functions for output and do everything in that RAM buffer.

If you're really concerned about snow handling, you'll be doing that anyway :) If you're using the C library functions, don't they have a snow-less screen scroll?


I'm dense here .. I'm talking about the case where I'm adding new input at the bottom of the screen, and I need to push prior input up. I'm not even worried about having a backscroll buffer yet. Indulge my denseness and elaborate on this a little bit for me ...

I was being slightly facetious in that nobody would be willing to go through the trouble for perfect scrolling. For "perfect" scrolling, you don't actually move anything but instead change the memory address CGA starts reading the display contents from. For example, your screen is 80x25, but on CGA you have 16K of video ram. So the screen is a "window" on top of all that RAM. To scroll everything UP, you move the window DOWN. No memcpy required; you just paint the next two lines or whatever you're scrolling by.

The "gotcha" comes when you're at the end of screen memory. Then what? Well, to scroll when there's nowhere left to scroll, you can just copy everything visible to the top of RAM again, then move the window up to the top of RAM. Since the slight delay doing this is noticeable, you can instead be copying a single line or two as necessary so that, when it's time to jump up again, the lines are already there.

Like I said, this is a hassle and most programs don't do this.

If you'd like this behavior for "free", do this:

  • Use 100% BIOS routines for screen writing AND scrolling
  • Tell people if they don't like the speed to run NNANSI which hooks the BIOS routines and makes them much faster, as well as performing the aforementioned hardware scrolling trick

I'm on a 4.77Mhz XT with almost everything original. I don't mind the snow .. just looking to cut it a little bit.

I don't mind snow either, but it drives some people batsh*t. So, try doing the bare minimum (ie. wait for vert retrace before moving stuff) or, and I would seriously consider this, use 100% BIOS routines (which your C library can be configured to do, I'm sure) and package the NNANSI distribution with it.
 
I did a little reading, a little disassembling, and a little math. Here is the outcome so far.

If the compiler is using some form of REP and MOVSW (and it is) then it will take approximately 25 cycles per word to move, not counting the setup time to get everything into the correct registers. I'm appalled at how slow this is per word .. Assuming that I interpreted things correctly, that's about 2000 cycles to move 160 bytes. I've got about 10,000 cycles if I hit the vertical refresh dead on. I decided to move 320 bytes per memcpy, which should consume 4000 cycles - that leaves me plenty of slop time for the call to the memcpy routine and the setup time. I can probably do four lines at a time.

The check for the vertical retrace was a tight little inline assembler loop .. read from 0x3DA, and with 8, jump to the top of the loop if the result of the and is zero. The call to memcpy is probably expensive, but small compared to the overhead of the memory movement.

I'm a little disappointed because there is a noticeable lag compared to what I had, but it is not a problem. I won't know if I was successful until I test on the XT tomorrow. (The PCjr doesn't get snow due to circuitry differences.)

And wow, memory movement on an 8088 class processor sucks badly. I knew this before and I've tried to minimize memory movement whenever possible, and now I remember why.


Mike
 
If the compiler is using some form of REP and MOVSW (and it is) then it will take approximately 25 cycles per word to move, not counting the setup time to get everything into the correct registers. I'm appalled at how slow this is per word .. Assuming that I interpreted things correctly, that's about 2000 cycles to move 160 bytes. I've got about 10,000 cycles if I hit the vertical refresh dead on. I decided to move 320 bytes per memcpy, which should consume 4000 cycles - that leaves me plenty of slop time for the call to the memcpy routine and the setup time. I can probably do four lines at a time.

It is indeed very slow per word. Welcome to the dirty little secret of why a lot of people hate the 8088, and why the 6502/6809/Z80 8-bit computers are competitive at lower clock speeds -- they need only 1 cycle per memory access, while 8088 needs four. So 4.77MHz 8088 / 4 = barely an advantage over a C64. The trick to optimizing for a PC is to use the advantage that the internal registers are 16-bit and you can perform 16-bit operations on them, and also to use the "special" instructions as much as possible (xlat, rep xxxx, etc.). Only then can you start to pull ahead of the other micros.

The check for the vertical retrace was a tight little inline assembler loop .. read from 0x3DA, and with 8, jump to the top of the loop if the result of the and is zero.

I hope there's a loop before that loop that checks to make sure you're not already in retrace. Otherwise, your check ("I'm in retrace? Good, let's go!") could happen at ANY point of the retrace, including the very bottom, which means you'd do your copy and see snow.

I'm a little disappointed because there is a noticeable lag compared to what I had, but it is not a problem. I won't know if I was successful until I test on the XT tomorrow.

Just leave it as an option. I'm so used to CGA snow it doesn't bother me any more, so I usually make it a point to turn it off for speed. In fact, I've patched a few programs to JMP around the CGA snow code because they were unbearably slow to work with otherwise (very very bad screen handling). In some cases, CGA snow shows odd programming practices: One of my favorite text editors, Aurora, "shows" me through snow that it updates the ENTIRE screen on EVERY keypress. This is odd and somewhat irritating, but since it's the fastest editor with functional undo that I've ever used, I still use it every week.

Just remember: If you're copying FROM screen ram TO screen ram, that is even more of an unnecessary speed hit; the proper thing to do is copy from system RAM to screen RAM. If you don't want to write your own routines, then consider this trick: During an idle time period (or maybe immediately after a scroll operation), copy screen RAM to an internal buffer. When it comes time to scroll, do the 21-lines copy (or however much you're scrolling by) from system ram instead. This turns your screen-to-screen rep movsw @ 160KB/s into a system-to-screen rep movsw @ 240KB/s.

And wow, memory movement on an 8088 class processor sucks badly. I knew this before and I've tried to minimize memory movement whenever possible, and now I remember why.

Yep :-( Maybe when I show you my full-screen rotozoomer running at 9fps it will seem slightly more impressive :)
 
I was being slightly facetious in that nobody would be willing to go through the trouble for perfect scrolling. For "perfect" scrolling, you don't actually move anything but instead change the memory address CGA starts reading the display contents from.

DOS Plus does this. It does mean that on my copy of DOSEMU, which doesn't emulate the CGA CRTC, the screen doesn't scroll at all, and I have to switch to mono mode.
 
Since we're talking about it... did any non-IBM video hardware ever have "snow", or was it a phenomenon specific to the IBM CGA board? I've personally never seen a clone with snow... not even a really cheap Emerson 8088. (A really huge dot pitch on the Emerson CGA monitor, yes, but no snow.)
 
Since we're talking about it... did any non-IBM video hardware ever have "snow", or was it a phenomenon specific to the IBM CGA board? I've personally never seen a clone with snow... not even a really cheap Emerson 8088. (A really huge dot pitch on the Emerson CGA monitor, yes, but no snow.)

AT&T PC 6300 had the same flaw. It was less noticeable, because it was 400 lines instead of 200, and even less so on a monochrome monitor because half of the emulated CGA colors were somewhat dim. But it was there.
 
AT&T PC 6300 had the same flaw. It was less noticeable, because it was 400 lines instead of 200, and even less so on a monochrome monitor because half of the emulated CGA colors were somewhat dim. But it was there.
Hmm, there was a PC 6300 (with the color monitor) in one of my high school classes and I don't distinctly remember the snow, so I guess it is less noticeable than on IBM's CGA. One thing I do remember is that the thick anti-glare(?) coating on the screen gave it strangely muted, pastel-like colors. The IBM 5153 monitor seemed to have the most vibrant colors to me, even compared to the otherwise excellent Tandy and Zenith RGB monitors.
 
Just remember: If you're copying FROM screen ram TO screen ram, that is even more of an unnecessary speed hit; the proper thing to do is copy from system RAM to screen RAM. If you don't want to write your own routines, then consider this trick: During an idle time period (or maybe immediately after a scroll operation), copy screen RAM to an internal buffer. When it comes time to scroll, do the 21-lines copy (or however much you're scrolling by) from system ram instead. This turns your screen-to-screen rep movsw @ 160KB/s into a system-to-screen rep movsw @ 240KB/s.

I'm being dense here, but walk me through it.

the REP MOVSW loop is going to take 25 cycles per word no matter where I copy to or from. The memory on the CGA card is accessed as the same speed as the expansion or motherboard memory, so screen-to-screen or system-to-screen transfer times take the same.

The only difference will be that I might run into less snow because if I time things right, I'll be accessing the screen memory half has much. But I still only have the time during a vertical retrace to draw a line, and I still have to move 160 bytes per line, no matter what the source and target is. And that 160 bytes takes the same time to transfer, so I don't see how the performance will improve or the snow will decrease.

(Unless I'm long about how slow the CGA memory is - if it is slower than expansion or main memory on the motherboard, then it makes perfect sense.)


I hope there's a loop before that loop that checks to make sure you're not already in retrace. Otherwise, your check ("I'm in retrace? Good, let's go!") could happen at ANY point of the retrace, including the very bottom, which means you'd do your copy and see snow.

Nope, I just dropped out of the loop as soon as I saw it was in vertical retrace. Yes, I might have caught it on the tail end of the retrace and thus I'll get snow. I just want to knock the snow down a bit, not totally crush the screen update performance.

I'll make the snow code optional. I don't need it on my Jr, or my EGA/VGA class machines so it really is just unnecessary punishment. But wow, the CGA cards really suffer badly ..

With the snow code in I can copy 2 lines at a time perfectly. If I copy 4 lines at a time there are varying traces of snow. Performance for either one suffers quite a bit in either case. The C runtime routines are pretty good, probably because they are making use of the horizontal retrace too. The BIOS routines? Forget it .. unusable.

And really the problem is only when you do something that causes a lot of spew, like the MOTD when you first sign onto the IRC server, or doing a 'names' command. For normal chatting any of the methods work.
 
the REP MOVSW loop is going to take 25 cycles per word no matter where I copy to or from. The memory on the CGA card is accessed as the same speed as the expansion or motherboard memory, so screen-to-screen or system-to-screen transfer times take the same.

Wrong. :) The RAM on the CGA card is slower than system RAM.

My point is: If you are scrolling by moving RAM around, it is faster to REP MOVSW from system RAM to screen RAM than it is from screen RAM to screen RAM.

I tested the app tonight and the scrolling was more than acceptable, so I wouldn't worry about it too much.

(Unless I'm long about how slow the CGA memory is - if it is slower than expansion or main memory on the motherboard, then it makes perfect sense.)

Bingo.

For normal chatting any of the methods work.

I agree.
 
Eh, so what's the technical reason for the CGA memory being slower than the system memory? Naively I would think that it's a bank of memory chips on the expansion bus with some addressing logic that also happens to be shared with the video controller. Is the CGA card generating wait states when this memory is accessed?
 
Eh, so what's the technical reason for the CGA memory being slower than the system memory? Naively I would think that it's a bank of memory chips on the expansion bus with some addressing logic that also happens to be shared with the video controller. Is the CGA card generating wait states when this memory is accessed?

It's exactly the same reason why the first 128KB of RAM of a PCjr is slow, but RAM provided by expansions isn't. Same principles.
 
Ah, but the PCjr has extra hardware that arbitrates access to that memory and thus there is no snow. The CGA just seems to be slower but still snowy. :)
 
Ah, but the PCjr has extra hardware that arbitrates access to that memory and thus there is no snow. The CGA just seems to be slower but still snowy. :)

Okay, smarty-pants :) The PCjr inserted a wait state for the display hardware because it couldn't keep up, right? Same for the CGA card. The speed is indeed the same (slower than system ram), and the CGA is indeed "snowy-er" ;-)

What causes the snow is that early CGA and clones (AT&T PC 6300, etc.) did *not* use dual-ported RAM, unlike every other card that got it right. So when you write to CGA RAM at the same time the card is trying to read from CGA RAM, the designers correctly forced the card to lose that decision (otherwise, if your write lost, your byte would not get written). When the CGA card loses, it draws garbage for that particular character cell. The attribute usually makes it through, though -- you'll notice that your "snow" is "colored" the same as the attributes already in memory.
 
Back
Top