jonathanjo
Member
- Joined
- Feb 18, 2024
- Messages
- 38
Hello Friends
I've been reading about memory failure modes, especially for dynamic RAM, and wanted to hear from people about what they've experienced most on DEC equipment.
Specifically I'm interested in whether there are many "cross-chip errors", or whether most are "single chip".
The system I'm debugging (main story here), uses three blocks of 18 chips 41256 DRAM, giving 1.5 MByte arranged in 16-bit words with byte parity. They are variously described as 256 rows by 1024 columns (Toshiba), 256x256x4 with a multiplexer (Siemens, Texas) or unspecified (Samsung).
Following diagram from Toshiba datasheet, link below.
As well as all-1s and all-0s, memory tests very usually write various bit patterns across bytes/words/long words. Such as walking 1s or 0s, or these from AK6DN's MEMX program
For this to detect anything more than bits stuck-at-1 or stuck-at-0, the patterns would have to trigger some kind of interference from one RAM chip to another. This would obviously be the case for shorted data lines, for example, or some kind of broken driver chip -- certainly problems which occur in the field.
But my question is this: I wonder if that is common in comparison to within-chip interference? If errors were approximately even across silicon area, we'd expect many more inter-row and inter-column effects, such as suggested by Sridharan & Liberty's 2012 paper (ref below). There just isn't very much circuitry shared between chips of the same word; and nonetheless they say multibit errors are approximately half of all errors. (Their study has 64kx4 chips, which allows multiple errors per 4-bit word; a "rank" is a given DIMM of 18 chips.)
If we knew the system-address to row- and column-address mapping, we should perhaps be looking for patterns across word-addressing, rather than across-bits. With our example 41256, if we suppose a the simplest 256x1024 arrangement with RA0-7 mapped to SA0-7, and CA0-9 mapped to SA8-17, then we'd have to write to bit in an adjacent column or row to trigger an error, as system address word xx-xxxx-xxx0-yyyy-yyy0 would have neighbours xx-xxxx-xxx1-yyyy-yyy0 and xx-xxxx-xxx0-yyyy-yyy1. Ie, we write 0x0000 and 0xffff to say system word 0 and look for interference at word addresses 0x01 and 0x100. An error would manifest as a single-bit error in the word, as far as the CPU sees.
Does anybody have appropriate experience from the field?
Jonathan.
V. Sridharan and D. Liberty, "A study of DRAM failures in the field," SC '12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, Salt Lake City, UT, USA, 2012, pp. 1-11, doi: 10.1109/SC.2012.13. Full text at archive.org
Ken Shirriff's excellent blog article about insides of RAM chips link
Minus Zero's exhaustive 41256 comparison chart and collection of datasheets
I've been reading about memory failure modes, especially for dynamic RAM, and wanted to hear from people about what they've experienced most on DEC equipment.
Specifically I'm interested in whether there are many "cross-chip errors", or whether most are "single chip".
The system I'm debugging (main story here), uses three blocks of 18 chips 41256 DRAM, giving 1.5 MByte arranged in 16-bit words with byte parity. They are variously described as 256 rows by 1024 columns (Toshiba), 256x256x4 with a multiplexer (Siemens, Texas) or unspecified (Samsung).
Following diagram from Toshiba datasheet, link below.
As well as all-1s and all-0s, memory tests very usually write various bit patterns across bytes/words/long words. Such as walking 1s or 0s, or these from AK6DN's MEMX program
Code:
.word ^b0000000000000000 ;\
.word ^b1111111111111111 ; \
.word ^b0000000011111111 ; |
.word ^b1111111100000000 ; |
.word ^b0000111100001111 ; | -- table of patterns
.word ^b1111000011110000 ; |
.word ^b0011001100110011 ; |
.word ^b1100110011001100 ; |
.word ^b0101010101010101 ; /
.word ^b1010101010101010 ;/
For this to detect anything more than bits stuck-at-1 or stuck-at-0, the patterns would have to trigger some kind of interference from one RAM chip to another. This would obviously be the case for shorted data lines, for example, or some kind of broken driver chip -- certainly problems which occur in the field.
But my question is this: I wonder if that is common in comparison to within-chip interference? If errors were approximately even across silicon area, we'd expect many more inter-row and inter-column effects, such as suggested by Sridharan & Liberty's 2012 paper (ref below). There just isn't very much circuitry shared between chips of the same word; and nonetheless they say multibit errors are approximately half of all errors. (Their study has 64kx4 chips, which allows multiple errors per 4-bit word; a "rank" is a given DIMM of 18 chips.)
If we knew the system-address to row- and column-address mapping, we should perhaps be looking for patterns across word-addressing, rather than across-bits. With our example 41256, if we suppose a the simplest 256x1024 arrangement with RA0-7 mapped to SA0-7, and CA0-9 mapped to SA8-17, then we'd have to write to bit in an adjacent column or row to trigger an error, as system address word xx-xxxx-xxx0-yyyy-yyy0 would have neighbours xx-xxxx-xxx1-yyyy-yyy0 and xx-xxxx-xxx0-yyyy-yyy1. Ie, we write 0x0000 and 0xffff to say system word 0 and look for interference at word addresses 0x01 and 0x100. An error would manifest as a single-bit error in the word, as far as the CPU sees.
Does anybody have appropriate experience from the field?
Jonathan.
V. Sridharan and D. Liberty, "A study of DRAM failures in the field," SC '12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, Salt Lake City, UT, USA, 2012, pp. 1-11, doi: 10.1109/SC.2012.13. Full text at archive.org
Ken Shirriff's excellent blog article about insides of RAM chips link
Minus Zero's exhaustive 41256 comparison chart and collection of datasheets
Last edited: