Memory failure modes and RAM test patterns

jonathanjo · Apr 1, 2024

Hello Friends

I've been reading about memory failure modes, especially for dynamic RAM, and wanted to hear from people about what they've experienced most on DEC equipment.

Specifically I'm interested in whether there are many "cross-chip errors", or whether most are "single chip".

The system I'm debugging (main story here), uses three blocks of 18 chips 41256 DRAM, giving 1.5 MByte arranged in 16-bit words with byte parity. They are variously described as 256 rows by 1024 columns (Toshiba), 256x256x4 with a multiplexer (Siemens, Texas) or unspecified (Samsung).

Following diagram from Toshiba datasheet, link below.

As well as all-1s and all-0s, memory tests very usually write various bit patterns across bytes/words/long words. Such as walking 1s or 0s, or these from AK6DN's MEMX program

Code:

    .word    ^b0000000000000000    ;\
    .word    ^b1111111111111111    ; \
    .word    ^b0000000011111111    ; |
    .word    ^b1111111100000000    ; |
    .word    ^b0000111100001111    ; | -- table of patterns
    .word    ^b1111000011110000    ; |
    .word    ^b0011001100110011    ; |
    .word    ^b1100110011001100    ; |
    .word    ^b0101010101010101    ; /
    .word    ^b1010101010101010    ;/

For this to detect anything more than bits stuck-at-1 or stuck-at-0, the patterns would have to trigger some kind of interference from one RAM chip to another. This would obviously be the case for shorted data lines, for example, or some kind of broken driver chip -- certainly problems which occur in the field.

But my question is this: I wonder if that is common in comparison to within-chip interference? If errors were approximately even across silicon area, we'd expect many more inter-row and inter-column effects, such as suggested by Sridharan & Liberty's 2012 paper (ref below). There just isn't very much circuitry shared between chips of the same word; and nonetheless they say multibit errors are approximately half of all errors. (Their study has 64kx4 chips, which allows multiple errors per 4-bit word; a "rank" is a given DIMM of 18 chips.)

If we knew the system-address to row- and column-address mapping, we should perhaps be looking for patterns across word-addressing, rather than across-bits. With our example 41256, if we suppose a the simplest 256x1024 arrangement with RA0-7 mapped to SA0-7, and CA0-9 mapped to SA8-17, then we'd have to write to bit in an adjacent column or row to trigger an error, as system address word xx-xxxx-xxx0-yyyy-yyy0 would have neighbours xx-xxxx-xxx1-yyyy-yyy0 and xx-xxxx-xxx0-yyyy-yyy1. Ie, we write 0x0000 and 0xffff to say system word 0 and look for interference at word addresses 0x01 and 0x100. An error would manifest as a single-bit error in the word, as far as the CPU sees.

Does anybody have appropriate experience from the field?

Jonathan.

V. Sridharan and D. Liberty, "A study of DRAM failures in the field," SC '12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, Salt Lake City, UT, USA, 2012, pp. 1-11, doi: 10.1109/SC.2012.13. Full text at archive.org

Ken Shirriff's excellent blog article about insides of RAM chips link

Minus Zero's exhaustive 41256 comparison chart and collection of datasheets

daver2 · Apr 1, 2024

The MARCH-C memory test is pretty good for detecting DRAM faults. There are various documents covering the failure modes for this test.

The 'walking' does not necessarily invoke chip to chip faults, but the setting or clearing of a bit within a single DRAM chip causing a bit flip elsewhere within the same DRAM device.

Dave

AK6DN · Apr 1, 2024

daver2 said:
The MARCH-C memory test is pretty good for detecting DRAM faults. There are various documents covering the failure modes for this test.

The 'walking' does not necessarily invoke chip to chip faults, but the setting or clearing of a bit within a single DRAM chip causing a bit flip elsewhere within the same DRAM device.

Dave

"Extended March C-" is what I based the test strategy of my MEMX memory diagnostic on.
In the attached pdf, see pp. 48-53 for a discussion about MARCH testing algorithms.
Rest of presentation is pretty useful too as a overview of memory failure characterization and testing strategies.

Code:

PDP-11 simulator V3.12-4
Disabling XQ
CPU, 11/44, FPP, NOCIS, autoconfiguration enabled, idle disabled, 4088KB

Memory Exerciser v1.41

Detected memory size is 3840KB (17000000)

Memory Control registers:  <none>

Test1: constant data patterns
Test1a: data pattern 000000 (0000000000000000)
Test1b: data pattern 177777 (1111111111111111)
Test1c: data pattern 000377 (0000000011111111)
Test1d: data pattern 177400 (1111111100000000)
Test1e: data pattern 007417 (0000111100001111)
Test1f: data pattern 170360 (1111000011110000)
Test1g: data pattern 031463 (0011001100110011)
Test1h: data pattern 146314 (1100110011001100)
Test1i: data pattern 052525 (0101010101010101)
Test1j: data pattern 125252 (1010101010101010)

Test2: unique physical block select

Test3: unique physical block address

Test4: extended march c- data test
Test4a: u(w0) - address ascending; write zero
Test4b: u(r0,w1,r1) - address ascending; read zero, write one, read one
Test4c: u(r1,w0) - address ascending; read one, write zero
Test4d: d(r0,w1) - address descending; read zero, write one
Test4e: d(r1,w0) - address descending; read one, write zero
Test4f: d(r0) - address descending; read zero

End pass 1. errors 0.

Memory failure modes and RAM test patterns

jonathanjo

Member

daver2

10k Member

AK6DN

Veteran Member

Attachments