• Please review our updated Terms and Rules here

PDP-8/e Extended Memory & Checkerboard Test trouble

thunter0512

Veteran Member
Joined
Sep 27, 2020
Messages
839
Location
Perth in Western Australia
I have been restoring and fixing my new PDP-8/e leaving core to the end once I verified that everything else is working using Roland's 32k memory board (using the acceptance tests I have).

The 8/e had two MM8-EJ 8k core memory board sets one of which was faulty.
I also had a third set from Jack which too had a fault.
Both faults were on the G111 Sense/Inhibit boards.
I fixed both boards and the 3 x 8k cores were fully functional when exercised via the front panel.

To verify that it is robust I first run the "PDP-8/e Memory Extension and Time Share Control Test" with the file name "maindec-8e-d1ha-pb" with all 6 fields selected for a few minutes and all was well.
I then run the "PDP-8/e Extended Memory & Checkerboard Test" from Vince's site with the file name "maindec-08-dhkma-d-pb" with all 6 fields selected for several hours without a single error.

Then I added the front panel plastic & bezel and the case lid and rerun "PDP-8/e Extended Memory & Checkerboard Test".
It failed within a few minutes and is failing ever since.


Here are the steps I tried to resolve the problem:
  • checked all power rails (5.12V, -15.08V, 15.12V) - all are fine
  • Removed the front panel plastic & bezel and case lid and let the system cool to ambient temperature - still fails (all subsequent test were done without front panel plastic and case lid).
  • checked all power rails with oscilloscope while the test was running - no significant "wiggle" or noise.
  • replace all core with Roland's 32k memory - it still fails.
  • one-by-one I replaced all CPU boards (M8330, M8300, M8310, M837) and bus terminator (M8320) with known good spares - it still fails.
  • replaced the front-panel PCB - it still fails.
  • replaced the M8655 UART board - it still fails.
  • consolidated everything into the first Omnibus (removing the two M935s and moving the terminator (M8320) into the last slot of the first Omnibus) - it still fails.
The only boards which remained throughout are the first Omnibus board and Roland's M847 boot loader board.

Finally I run the "PDP-8/e Extended Memory & Checkerboard Test" just on the first MM8-EJ core board set servicing fields 1 and 2 (but leaving the core boards for field 3/4 and 5/6 plugged in) and I get no errors.
Trying again with fields 1 - 6 it almost immediately fails in field 1 which just tested good.

I have not yet retried to just run with Roland's 32k board limiting to field 1 and 2.

I also tried on my LAB-8/e and the "PDP-8/e Extended Memory & Checkerboard Test" runs without any problems.

I am reluctant to one by one swap the LAB-8/e boards with the PDP-8/e boards - meaning trying to break the LAB-8/e with the PDP-8/e boards.
The LAB-8/e is too nice and precious to stuff around with.

I could understand that the case lid and the front-panel plastic changes the airflow causing some localised overheating and subsequent fault, but I think I have eliminated almost everything.

What am I missing here?

I would appreciate any help/advice.


Thanks
Tom
 
I run the "PDP-8/e Extended Memory & Checkerboard Test" just on the first MM8-EJ core board set servicing fields 1 and 2 (but leaving the core boards for field 3/4 and 5/6 plugged in) and I get no errors.
Trying again with fields 1 - 6 it almost immediately fails in field 1 which just tested good.
That seems like a useful clue. Your references to the memory fields seem a bit off. The first MM8-EJ board set should be for fields 0 & 1. You shouldn't have a field 6 with only three 8K board sets. Field 6 would be the 7th field (of 8 total) if you had a fourth MM8-EJ board set. Are you confident you have the 3 EMA jumpers set correctly on each G111?
I'd try running the test on fields 0&1, then 2&3, then 4&5 with all 6 of the core memory boards installed.
Or, try narrowing the range of memory test that runs OK ex: can you run fields 2,3,4,5 OK? Just 4&5 OK? Does 0&1 run OK if you have the board set for fields 4&5 out? etc... There are enough combinations to try to keep you busy for a while.
 
Your references to the memory fields seem a bit off.

Sorry for my confusing use of fields. Of course you are right about the off by one field numbering. I mis-wrote sitting on my PC away from my electronics lab with the PDP-8/e.
What I meant is 1st and 2nd field (field numbers 0 and 1) and 3rd and 4th field (field numbers 2 and 3) and 5th and 6th field (field numbers 4 and 5).

Everything else I wrote (I hope) was correct.

As to the 3 EMA jumpers, they are correct. Keep in mind that the system run the checkerboard test for several hours without any problem.
Only after I installed the plastic front panel & bezel and the lid did it fail.
So likely something died due to reduced or different air flow. Unfortunately until now my search for the "something" didn't yield and result.

I have reluctantly decided to "sacrifice" my little untouchable precious "princess" - the LAB-8/e - and start testing boards there. :)
I hope I won't regret it. There is a sad old story to this ... maybe another day.

Thanks
Tom
 
Last edited:
Progress of sorts!

I made a surprising discovery. I tried the 32k SRAM board (Rolands version) out of the LAB-8/e in the PDP-8/e and it works perfectly when I run the "PDP-8/e Extended Memory & Checkerboard Test".
I used an identical second board previously and it failed in the same way as the real MM8-EJ core boards.

At first I was puzzled, but when looking at the two boards side-by-side I realised that on the "good" board I used Cypress CY62256NLL-70PXC SRAM I bought on Ebay, and the "bad" board I used Lyontek LY62256PL-55LL I bought from Farnell (Element14 in Australia).

I now remember that I was looking for a non-Ebay supplier and found the Lyontek SRAM on Farnell with seemingly better specs and tried it. The Lyontek SRAM worked perfectly fine running OS/8 but I never did a checkerboard test on it until now.

So I went down a terrible rabbit hole on the bad assumption that my 32k SRAM was working perfectly.
Once the checkerboard test failed on the "bad" 32k SRAM board I incorrectly thought that core must be good and the fault is somewhere else.
I spent too much time and thought on identifying the culprit when it was a failed core all along.

I have since swapped the Lyontek SRAM in the "bad" board with a spare set of Cypress SRAM and now the "bad" board too passes the checkerboard test for the past 45 minutes.
At least now we know that the cheap Lyontek LY62256PL-55LL from Farnell is not suitable (although it seemed stable with OS/8).

Oh well - an intermittent core fault will be challenging to diagnose and fix.
All my core fixes until now were on boards which had permanent faults on individual bits.

Interestingly the 3 core memory board sets worked perfectly for hours until I put the case lid and front-panel plastic on after which something changed/broke permanently.

It will require some head scratching to come up with a good fault finding strategy.

Tom
 
Oh well - an intermittent core fault will be challenging to diagnose and fix.
All my core fixes until now were on boards which had permanent faults on individual bits.
In a pre-Omnibus machine, I'd suspect a core thermistor. I haven't debugged Omnibus core enough to know what they do about termperature compensation,

Interesting that the Lyontek SRAM is failing. I was able to pass diagnostics with a sample I bought and a later (less noisy) prototype of the 32K card. (Only to find that while I was testing, the Lyontek parts also had become unobtainium.)

DHKMAD is apparently quite difficult to pass. I tried several "scaled down" versions in the hope of isolating the problems, but without much success in making unpredictable failures predictable.

Vince
 
Interesting that the Lyontek SRAM is failing. I was able to pass diagnostics with a sample I bought and a later (less noisy) prototype of the 32K card. (Only to find that while I was testing, the Lyontek parts also had become unobtainium.)

Element14/Farnell/Avnet still sell the part with 549 global stock at this moment:


Of course with it not passing checkerboard it is not very useful for the 8/e (although I run OS/8 with it for a little while).

I now wonder if the problem is that the Lyontek part is 55 ns versus the Cypress being a 70 ns part.
Maybe the Lyontek is too fast and picks up garbage/noise.
 
My best guess is that the issue with the Ram chips is that there are several nano seconds of shoot thru current during memory cycles. Shoot thru is when multiple gate outputs are tied together and some of the gates are actively driving high and others are actively driving low. In this case it is caused by the bus transceivers driving into the ram chip outputs. You can pretty easily see this by looking at the power supply at the ram chips and seeing the voltage jump around and the ground pin bouncing. Why it does not affect the Cypress chips is a mystery. I will be investigating this after I finish up the console serial disk project.
 
DHKMAD is apparently quite difficult to pass. I tried several "scaled down" versions in the hope of isolating the problems, but without much success in making unpredictable failures predictable.

Vince
Hmmm - I have now tried various permutations and now all seem to work (again). I am confused. VERY confused.

I am curious about your statement "DHKMAD is apparently quite difficult to pass.".
Where did the "apparently" come from? Is the "DHKMAD" a known fragile (broken?) diagnostics?

When it works, it works perfectly for hours.
When it fails, it fails consistently.

Other than maybe varying Omnibus slots I cannot see a pattern.
I plug the boards roughly back into the same Omnibus slots while testing different combinations, but did not yet mark them to be sure they are exactly the same every time.
I did not change anything on the Omnibus or on any boards when I fitted the case lid and front panel, but it worked perfectly before (for many hours) and failed consistently after ... until now.

Temperature is another possibility although there is not more than maybe 2 or 3 degrees Celsius variation from day to day at them moment in my electronics lab.

Maybe I should start another more predictable hobby ... knitting ... gardening ... beer drinking ... :)
 
I am curious about your statement "DHKMAD is apparently quite difficult to pass.".
Where did the "apparently" come from? Is the "DHKMAD" a known fragile (broken?) diagnostics?
Any of my memory boards seem to pass the other diagnostics, but only a few can pass DHKMAD, especially for an hour or two.

I haven't had any particular problem with the reliability of the diagnostic itself. OTOH, when it fails it sometimes chews itself up a bit, so that it can't go on properly. That doesn't seem surprising in a memory diagnostic.

Vince
 
Other than maybe varying Omnibus slots I cannot see a pattern.
Temperature is another possibility although there is not more than maybe 2 or 3 degrees Celsius variation from day to day at them moment in my electronics lab.
I'd note the slots in use during failure and success, as I have seen defective and/or dirty slots before.

If that's not it, I'd look at the temperature thing, as I think that's what DHKMAD was originally designed for. I'd also do the usual bit of checking the supply rails for voltage, noise, etc.

Could also be a weak gate or bad solder joint in address decode, bus buffers, etc. Those usually generate patterns in the failure data, though.

I don't have a lot of experience debugging Omnibus core. Most of mine is boxed up, with the SRAM boards in the 8/E. I do plan to re-assemble an 8/A soon, and that might run core briefly. The plan is to test boot/ram prototypes in it, though, so that would likely be only long enough to convince me that the 8/A is working again.

Vince
 
Are you mixing core planes with sense/driver/decoder boards?
The MM8-EJ boards are matched sets so I don't mix them.
Each of the 3 boards in a set is clearly marked (by me) to belong to that set so that I won't mix them up with other boards.

Last night the checkerboard diagnostics (maindec-08-dhkma-d-pb) started failing again in the 3rd set of MM8-EJ core (field numbers 4 and 5).

I subsequently run the extended address diagnostics (maindec-8e-d1fb-pb) and it failed too:

EA8-E EXT MEM ADDR TEST

SETUP SR & CONT
6 STACKS IN THIS SYSTEM
STACKS SEL'D ARE 5 4
NO RELOCATION, PROG IN STACK 0
PR LOC ADDR GOOD BAD TEST
01303 50001 0001 0000 1
01303 50011 0011 0000 1
01303 50021 0021 0000 1
01303 50031 0031 0000 1
01303 50041 0041 0000 1
01303 50051 0051 0000 1
01303 50061 0061 0000 1
01303 50071 0071 0000 1
01303 50101 0101 0000 1
01303 50111 0111 0000 1
01303 50121 0121 0000 1
01303 50131 0131 0000 1
01303 50141 0141 0000 1
01303 50151 0151 0000 1
01303 50161 0161 0000 1
01303 50171 0171 0000 1
01303 50201 0201 0000 1
01303 50211 0211 0000 1
01303 50221 0221 0000 1
01303 50231 0231 0000 1
01303 50241 0241 0000 1
01303 50251 0251 0000 1
...

This seems to say that all addresses ending with 1 (and only those) are not written (or written in the wrong place).

Has anyone seen this type of fault before?
Does this point to a fault in the X/Y drivers?


Thanks
Tom
 
Last edited:
Confusingly today evening the core memory started working perfectly again - it passes both address and checkerboard diagnostics.
All I did is swap in the H212 core board from the LAB-8/e which worked fine and subsequently swapped back Jack's H212 which failed last night and today morning.
Now Jack's H212/G111/G233 work again passing address and checkerboard diagnostics.
Nothing changed. The boards are in the same Omnibus slots as before and I was very careful to seat the top-connectors always in the same order.

In desperation I gently tapped the 3 boards with the back of an insulated screw driver hoping to trigger the problem again, but checkerboard is rock solid again.
The assumption was that maybe seating and unseating the boards caused & fixed some issue, but the gentle tapping did nothing.
 
This kind of failures are very hard to find.
Only through very secure investigation it can be found.

F.i. Look at the board fingers. I had a PDP8 card that had a tiny little coper failure right at the end of the strip.
The solder tin was a tiny bit moved upwards due to inserting deep and has caused a crack.
When inserting full to the bottom there was no problem, but when not quite inserted it did not work.

Or any other tiny crack on the boards.
Indeed tapping them could find those crack, but as forse to push them in , you have to really bent the boards.
With in limits that is! be careful.

An other common problem could be solder tin cracks around component leads that form a circle crack.
Hard to see, but once noticed you be able to find them more easy next time.
In the so called up hill towards the component lead you can spot a dark ring mostly half way.
Just a resolder with fresh tin solve that kind of issue.

in short , investigate those boards very secure.

I saw a YT vid of an IBM system 36 that had turning knops to vary the several voltages with in a small range.
This was installed to find components that were half faulty.
That trigger me to keep in mind for next trouble shoot with old TTL..

As there are several power traces on those boards, measure voltages also at the end of each line.
 
F.i. Look at the board fingers. I had a PDP8 card that had a tiny little coper failure right at the end of the strip.
I've had several colleagues who've experienced top block failures, so I'd inspect the heck out of those, too. Bad solder, cracked traces, dirty or corroded connector pins, etc.

Vince
 
Thanks for all the suggestions.

I have carefully checked all the edge connectors on the 3 MM8-EJ boards and the top blocks for any trouble using magnifying goggles.
All looked fine to me. The top connector solder joints are clean and neat. No oxidization or cold solder joints.

I think I have worked out the diode array on the H212 core board which could be causing the fault in the addresses below, but it would help if someone could sanity check my conclusion.
The suspect diode array could be E1 or the PCB tracks going to JC1 or JD2 or something further upstream.

The addresses failing are:

0001 0011 0021 0031 0041 0051 0061 0071 0101 0111 0121 0131 0141 0151 0161 0171 0201 0211 0221 ...

Does anyone agree that a faulty E1 (or the connections to JC1 or JD2) on the H212 could cause the faults in the address pattern above?

I checked E1 in circuit using a multimeter and the diodes seemed fine when I tested. Of course the problem is intermittent (long periods of working and long periods of failing), so my E1 diodes testing good means not much.

Today checkerboard and address tests have been running for many hours without failure. :(
 
Back
Top