• Please review our updated Terms and Rules here

I want to define a standard format to store punched cards and decks

RetroAND

Experienced Member
Joined
Feb 14, 2023
Messages
432
Location
Andorra
Hello,

I have only found raw formats which are dumps of the cards, or just strings if the card uses some encoding. However I haven't found a format that could serve as long-term storage with archivistic purposes due to the lack of metadata and mechanisms of verification. For this reason I would like to propose what I have in mind.

Format PCD

The extension stands for "Punched Card Deck". It would be basically a zip with the extension changed. In the root of that file we would find a picture of the deck in JPEG format, a XML with the metadata related to the deck and a number of PCF files. The XML should contain a list of the PCFs as well as a CRC code to verify them, among other things. The order of the cards should also be specified and if there's a sequence pattern, also it should be noted in there.

Format PCF

The extension stands for "Punched Card File". It would also be another zip with the extension changed. It should contain a scan of the card in JPEG, an XML for the metadata and a binary file standing for the contents of the card. The binary should be in SIMH format (160 bytes) or in packed format (120 bytes). The XML should contain any information to describe the card, including the format, platform, sequence number if present, etc. There should also be a CRC for verifying the binary file.

As you can see, I have the structure and the parts altready thought. However, I would need help with the XMLs. I don't know how to define XML schemas and also would be great if someone could give me some hints about fields that could help describe it.

Thank you very much!
 
The metadata will vary greatly depending on factors such as the system or software using the cards and other context. A good start might just be to include a freeform description that the archiver could fill in with the information they believe relevant. If there was a way for other historians to add to or correct the metadata, as exists with Github or Wikis, that would be ideal but a minimal solution is to provide updates to the archivist who could update the description file.

XML is great for standardizing terms and automated searches, but establishing a useful taxonomy across all system types and software applications is going to be a herculean and perhaps unsolvable problem IMO. I applaud your goal of improving the information available with archived card decks. Having pictures of markings on the deck is a great idea. Decks have markings across the top and sometimes writing on cards that are lost when only the pattern of the holes is recorded.

Very helpful to have a lossless copy of all 960 hole positions on the card. Some cards may be encoded a format such as Hollerith which could be converted to ASCII, other formats make no sense to attempt to convert that way. Even cards which are punched in Hollerith using only the keys on a keypunch present some challenges - the logical not and the cent sign characters do not exist in ASCII unless you are recording the ASCII in Unicode with more than one byte per character.
 
Thank you for joining the debate.

Of course this would end being big and complicated, but we could work by versioning and start with something more minimal, than add more fields as required.

The deck format should contain the target computer on its metadata, if known.
Also, the form factor of the card should be referenced in the card file.

But other than maybe descriptions, sequence numbers or type of card (text/binary) I don't know much about cards and for this reason I started the debate.

The binary files present in the card format should be a lossless representation of the card.

I hope this answers some things... in any case, thank you for your feedback!
 
Look at the Library of Congress "bagit" format, rather than inventing your own container.


also, if you use zip, store the contents uncompressed
 
Bagit sounds good, but there's something I don't understand about it. It is then zipped or something? what extension does it use? Is is widely supported? Sorry to ask that much, just don't know much about it.

In any case, thank you very much for the input!
 
It looks like it's a directory structure and not a single file. You could pack the directory structure into a zip if you like, I guess? But from what I can tell from looking very quickly, the RFC does not prescribe a way to pack up the directory tree into a single file.

I'm ambivalent about zipping things. For punched cards, it seems unlikely that you will ever amass so many that compression of the data on the cards itself is necessary. Uncompressed data has some advantages: you can grep it more easily, for example, although character encoding may be a challenge. I'd be curious to know if this has anything to do with why @Al Kossow recommended using uncompressed contents in ZIPs.

ETA: note that the RFC lists the following as a bagit feature:

Code:
2.  Direct file access.  Because BagIt specifies an actual filesystem
    hierarchy rather than a serialized representation of one, files
    can be accessed using standard operating system utilities,
    implementations do not need to process a potentially large
    archival file to extract a subset of data, and the format imposes
    no size limits for either individual files or a bag.
 
I had assumed so, but for compressing those images, you would probably want to rely on an image format with compression instead of ZIP's generic DEFLATE algorithm or what have you. The cards' data contents, meanwhile, I'd imagined you'd store in separate uncompressed files for convenience. You could either store a representation of the data on the cards (e.g. as strings of eight-bit bytes, assuming that cards were punched in a way that mapped a subset of possible punches of 12 holes into 8-bit bytes e.g.), or you could store the raw bits themselves. The former method would not be able to represent certain cards that were punched differently to that encoding, but it would give you a much easier time processing the data with programs on a modern computer. The latter method can represent any card that wasn't folded, spindled, or mutilated, but would be a pain to grep.
 
I would do raw representation, as I have binary cards myself. The encoding method, if any, should be declared in the card metadata, in my opinion. My current deck has cards with more than a single encoding.
 
Okay, Let's have an update.

I have defined a series of metadata. I can provide a link to anybody interested.

I also got permissions over a pythin code to identify the cards and I will use it to generate both more metadata and the binaries.

However, I would like to give emphasis on the descriptive elements of the format rather than the functional ones.
 
Here is a sample of the pcf file, for card PTM-10014 of the 1130 deck I have. It has the file extension of the zip unchanged.
 

Attachments

Punched cards! Buddy and I taking the same engineering class and working at the same factory on a night shift decided to go punch up our homework using the IBM keypunch machines in the office area. We punched up a couple hundred cards and felt real happy with ourselves ... took them to the University to run/compile on a CDC6400 and guess what ... all of the special characters were messed up! .... Seems that IBM punches and CDC punches were different with regard to the special characters. First introduction of many to not-standard "standards".
 
Punched cards! Buddy and I taking the same engineering class and working at the same factory on a night shift decided to go punch up our homework using the IBM keypunch machines in the office area. We punched up a couple hundred cards and felt real happy with ourselves ... took them to the University to run/compile on a CDC6400 and guess what ... all of the special characters were messed up! .... Seems that IBM punches and CDC punches were different with regard to the special characters. First introduction of many to not-standard "standards".
The plan is to document as much as possible of the deck and rely on undecoded data to store the content, so encoding conflicts like this won't happen.
 
I see discussion here about directories, metadata, filesystem "nativeness", etc.

Is there anything in the metadata of the tar or cpio formats that would satisfy the need here? Either format then compressed with a single-file compressor like gzip or bzip2 is standard for multi-file directory hierarchies, but I'm not sure if there is enough metadata allowance to store all the stuff you need to store to properly preserve a deck as filesystem objects.

There's also RJE, it may be possible old UNIX RJE bits have some pointers on handling a card deck as a first class data primitive, although RJE may never have had to solve this problem, I'm not familiar with the inner workings. Either way I'd be curious if anything canonical in the UNIX ecosystem could act as inspiration.
 
I see discussion here about directories, metadata, filesystem "nativeness", etc.

Is there anything in the metadata of the tar or cpio formats that would satisfy the need here? Either format then compressed with a single-file compressor like gzip or bzip2 is standard for multi-file directory hierarchies, but I'm not sure if there is enough metadata allowance to store all the stuff you need to store to properly preserve a deck as filesystem objects.

There's also RJE, it may be possible old UNIX RJE bits have some pointers on handling a card deck as a first class data primitive, although RJE may never have had to solve this problem, I'm not familiar with the inner workings. Either way I'd be curious if anything canonical in the UNIX ecosystem could act as inspiration.
I have been offered an alternative to my previous approach, which is to employ a single pdf stuffed with metadata. This second approach seems good, however I find it difficult to define and place my metadata schema. And I also need more specialized tools than I initially wanted.

In any case, the card files shall contain a picture or a scan for software recognition (because most people just don't have readers anymore). It's true that the space ratio between the image and the extracted data will be very unequal, but we all knew beforehand that punched cards are a very low density format which took a lot of space in comparison with other media...

I don't know about the canonical UNIX ecosystem that could work in this case, but this is because of my lack of knowledge.

In any case, I would need a lot of help, not only to correct/modify my current metadata model but also to find information about the encodings. So please, if anybody knows a company, country or model-specific encoding I would like ta ask that person to share the knowledge.

In any case, thank you all for your interest and help!
 
Well and what may help is defining here what metadata you are trying to capture. For me, if this file is supposed to be the contents of each card plus information about the deck as a whole, some items come to mind:

- Punch used to create the cards
- Vendor of the cards
- Count
- Date
- Computing System Involved (Hardware, OS/Monitor, expected card reader)
- Bitness/Endianness (which may simply be a property of the system)
- Encoding? (ASCII vs EBCDIC vs Hollerith vs...etc)
- Width (supporting non-80-character cards)
- Type of card (May or may not be important, for instance IBM ASM vs FORTRAN vs COBOL vs RPG vs...etc)
- Ordering information (for decks where the sequence number was not punched or is otherwise unavailable)

Is this the sort of metadata you had in mind?
 
This is rapidly turning into something where people with some knowledge of the subject
need to define the fields.
I can see most of what you are describing to be unreasonable from a cataloging/archiving point of view
especially a scan of EVERY card in the deck.
A box of cards has 2000 of them in it, and CHM has hundreds of boxes.
What you are proposing doesn't scale, and we use card readers, not document scanners.
 
This is rapidly turning into something where people with some knowledge of the subject
need to define the fields.
I can see most of what you are describing to be unreasonable from a cataloging/archiving point of view
especially a scan of EVERY card in the deck.
A box of cards has 2000 of them in it, and CHM has hundreds of boxes.
What you are proposing doesn't scale, and we use card readers, not document scanners.
Hello again Al,

With all the due respect, I would gladly hear from the experts. I want to learn from them, after all. That's why I have asked them help.

But we are facing various problems here...
The first of all is that we are in 2025, about 40 years have passed since the cards have been displaced by floppies and other media and still there's no unified way to store them digitally.

I think you are not understanding my approach at all. While the basic description unit is the card, the metadata system would inherit from the deck, so most metadata elements would be already filled and would only require modification in certain cases.

I am fully aware of the volume. You have stated the amount of cards the CHM has in multiple occasions. However this is 2025 (sic) and most of the card readers have been either discarded or are not in working order. It is much easier to find cards and decks than readers. I don't know in America, but in Europe the situation is very bleak. For this reason, the standards should be lowered if we want to preserve the remnants of this sector of the software. Of course, I don't expect to propose this and then everything starting to pop like daisies, it will take a lot of time and I am aware of it.

The scans are meant for two reasons. The first one, to store any other information not expressed in punches - especially handwriting. The second one, to extract the information of the cards by the use of software. Unfortunately the scans must be stored somewhere for future verification too. The software could produce erroneous results anytime.

This is a proposal, a draft. The intention is to preserve this software somehow. And its development should be constructive. Maybe we should add more metadata fields in order to benefit from still working card readers. What I want to define is not only a format to store them, but also the procedures to generate them. I don't see why the same format couldn't be generated in two different ways. The resulting weight would be different, but would enable everybody to save their cards, which is my main goal.

I invite you to participate.
 
Back
Top