• Please review our updated Terms and Rules here

IBM PC Character Set Information (Code Page 437)

mwalden

Member
Joined
Jun 7, 2025
Messages
16
Location
USA
Hello,

About a month ago I posted the following five related web pages on the character set included in the IBM PC and its descendant computers.

In the first link, I ask IBM engineer Dr. David J. Bradley questions about the character set and some other topics of interest to computer historians.

If you have any interest in IBM PC' character set (Code Page 437) then have a look at these links. I think you will find value in their content.

Dr. David J. Bradley on IBM PC's Character Set and More
https://mw.rat.bz/djb

The IBM PC Character Set Confusion Clarified
https://mw.rat.bz/confusion

IBM PC Code Page 437 to Unicode Mapping Table
https://mw.rat.bz/cp437map

IBM PC Technical Reference Character Set (00-FF) Quick Reference
https://mw.rat.bz/ibmpctr

IBM PC MDA ROM Font Character Table
https://mw.rat.bz/mdarom

At present, these pages have received less than a thousand user hits, so chances are that you have not visited them yet.

I hope that you find them useful and interesting.

I look forward to any feedback you might have here.

Cheers,
- Michael Walden
 
Nice work on sorting the CP437/Unicode mapping. It warms my pedantic little heart. But good luck getting everybody to switch to it. :-(
 
Hello,

About a month ago I posted the following five related web pages on the character set included in the IBM PC and its descendant computers.

In the first link, I ask IBM engineer Dr. David J. Bradley questions about the character set and some other topics of interest to computer historians.

If you have any interest in IBM PC' character set (Code Page 437) then have a look at these links. I think you will find value in their content.

Dr. David J. Bradley on IBM PC's Character Set and More
https://mw.rat.bz/djb

The IBM PC Character Set Confusion Clarified
https://mw.rat.bz/confusion

IBM PC Code Page 437 to Unicode Mapping Table
https://mw.rat.bz/cp437map

IBM PC Technical Reference Character Set (00-FF) Quick Reference
https://mw.rat.bz/ibmpctr

IBM PC MDA ROM Font Character Table
https://mw.rat.bz/mdarom

At present, these pages have received less than a thousand user hits, so chances are that you have not visited them yet.

I hope that you find them useful and interesting.

I look forward to any feedback you might have here.

Cheers,
- Michael Walden
Hello Michael,

About your question about the Datamaster designation, I found in the docs that IBM referred it as "Datamaster" and not "DataMaster".

Thank you for helping me get in contact with Dr. Bradley!

Greetings
 
Interesting

I would think that the most important it to look at what functionality would had been intended for each character, and how it was later used.
For example mapping the PC 437 codepage comma "," character to anything else than the common comma Unicode/ASCII character would just cause problems.

I would think that what you map to Greek beta is intended to act as a German eszett character and that is likely the functionally correct mapping.

Your suggestions are great for displaying something as close to codepage 437 as possible without having a copy of any IBM PC ROM font at hand though.

Also note that Unicode includes some bad things, like two different ways of writing some characters. This has resulted in weird problems like for a while when someone posted the characters åäöÅÄÖ (used in Swedish) on Instagram using an Apple device, and had automatic repost on Facebook setup, those posts would show up as aaoAAO on Android and/or Windows devices, even though the same user posting directly on Facebook with their Apple device would result in åäöÅÄÖ showing up correctly for Windows/Android users.

I don't know if there are any other cases where the same characters can be encoded in multiple ways in Unicode, but I wouldn't be surprised if that is the case.

The åäöÅÄÖ example is relevant as these characters are part of codepage 437, and thus there is an "Apple" and an "Everyone else" way of mapping codepage 437 to Unicode.
 
I would think that what you map to Greek beta is intended to act as a German eszett character and that is likely the functionally correct mapping.
But then why would it be in the "Greek letters" section of the charset, right between α and Γ, instead of in the European letters area of the charset?

Also note that Unicode includes some bad things, like two different ways of writing some characters.
This is not a bad thing; this is a good thing. Without this Unicode couldn't support round-trip format conversion. I've had experience with situations where I had to store things in multiple charsets because conversions couldn't round-trip; it's super-nice to be able to convert all input to Unicode, store it, and know for sure that if I convert it back to the original charset the user will get the original input, not something that's been changed.

I don't know if there are any other cases where the same characters can be encoded in multiple ways in Unicode
There are plenty, because plenty of source charsets may have a different idea from Unicode about whether two characters are "the same." A classic example is U+03A9 'Ω' GREEK CAPITAL LETTER OMEGA and U+2126 'Ω' OHM SIGN. (Unicode recommends you always use the former, even when writing things such as 'a 12Ω resistor'.)
 
@mwalden: welcome, nice to see ya here!

But then why would it be in the "Greek letters" section of the charset, right between α and Γ, instead of in the European letters area of the charset?

Likely because IBM itself never really made up its mind. It clearly looks like a Greek beta in the CGA, MDA, EGA, and PC Convertible fonts, and the 1986 'rev. 0' of the PS/2 MCGA/VGA firmware still had that representation, but the following year's rev. 1 looks like it was changed quite deliberately to a German eszett.

Maybe it had something to do with PC-DOS 3.2's code page support, which made IBM realize that cp437 was one character short of supporting German properly, while Greek was already getting a codepage of its own (cp851). *shrugs*
 
I think it's obvious that IBM did not hire a graphical artist or typography expert to create the character set. It was probably just some engineer working from what may have even been a hand-drawn list of characters. That's why a lot of the shapes aren't quite right and/or are inconsistent with other similar characters.

But at least the PC got separate characters for 1 and l, which the DataMaster didn't!
 
This is not a bad thing; this is a good thing. Without this Unicode couldn't support round-trip format conversion. I've had experience with situations where I had to store things in multiple charsets because conversions couldn't round-trip; it's super-nice to be able to convert all input to Unicode, store it, and know for sure that if I convert it back to the original charset the user will get the original input, not something that's been changed.
This depends on how you view it. With these two representations of the åäöÅÄÖ characters in Unicode, you can't do round trip format conversion to CP437 and back to Unicode unless the source happen to use whichever of the two Unicode representations that you then use when you convert CP437 back to Unicode.

Also I don't know of any other character encoding that has two different encodings for åäöÅÄÖ. I get why you want to differentiate between either a character that is a-zA-Z with an "appendix" v.s. a character that is certainly not a-zA-Z but looks like a-zA-Z with an "appendix" to allow for how different languages treat different characters. But I don't know of any language specifically using åäöÅÄÖ not treating them as separate characters. (I know that for example ë in Citroën and ï in naïve indicates that it should be pronounced separate from the adjacent o/i, but that isn't a correct way to use äöÄÖ in for example Swedish, especially since Swedish by default pronounces Citroen and naive (spelled naiv in Swedish) correctly without the ë or Ï).

Are there any languages that uses any of the åäöÅÄÖ characters as a version of a and/or o?
Likely because IBM itself never really made up its mind. It clearly looks like a Greek beta in the CGA, MDA, EGA, and PC Convertible fonts, and the 1986 'rev. 0' of the PS/2 MCGA/VGA firmware still had that representation, but the following year's rev. 1 looks like it was changed quite deliberately to a German eszett.

Maybe it had something to do with PC-DOS 3.2's code page support, which made IBM realize that cp437 was one character short of supporting German properly, while Greek was already getting a codepage of its own (cp851). *shrugs*
I would think that before VGA IBM just put up pros and cons on including whatever they could fit in above the standard ASCII characters (and put some fun symbols below 32), and they might have thought it was good enough to have the same code for both a greek and german character if they looked similar enough?
It for sure was an improvement over the older ISO 646 with separate 7-bit encodings for each language region. (The only exception where ISO 646 would be better would be if anyone from Sweden or Finland would had grown up under a rock and was tasked to read text written in Norwegian or Danish, or the other way around, as the Danish/Norwegian counterparts to äöÄÖ uses the same ISO-646 code as the Swedish/Finnish (and German) codes for äöÄÖ. This would be an exception that would never happen IRL though).

Re code pages: TBH I hated those. The only use for CP850 for Swedish is for people writing pretentions texts that would be beyond what is taught during the first 11-13 years of school, at least to exaggerate things a bit. But for some dumb reason PC-DOS and MS-DOS didn't allow loading a Swedish keyboard layout without also having some of those MODE CON PREPARE lines and whatnot starting with some DOS version. Before this there were separate small binaries for each keyboard layout, like keybsv.com and so on. Going off on a tangent, I've read at the os2museum blog that in some countries it was common to use third pary keyboard layout programs...
 
This depends on how you view it. With these two representations of the åäöÅÄÖ characters in Unicode, you can't do round trip format conversion to CP437 and back to Unicode
Which is fine. Unicode has never claimed to do Unicode → other charset → Unicode for the very good reason that if that can be done for a particular other charset, the only other charset that would be would be semantically the same as Unicode. (I.e., the exact same set of characters, with the only difference being that the code points might be different.)

This is a logical consequence of Unicode being able to do other charset → Unicode → other charset round trips.

Also I don't know of any other character encoding that has two different encodings for åäöÅÄÖ.
That's not surprising. I know of know other charset that claims to support other → itself → other round tripping, which is pretty much the only reason you'd need that.

But something seems weird about this. Explain again in which charset the letter 'ä' is correctly translated to 'a' in Unicode? What does 'a' in the original charset translate to in Unicode? Or does it not have an 'a' character?

...and they might have thought it was good enough to have the same code for both a greek and german character if they looked similar enough?
Or they simply decided not to support 'ß' at all. And someone later decided that they would use 'β' where they needed an 'ß' because they looked kind of similar, and in their particular application they didn't use 'β', or didn't care if the software couldn't distinguish the difference between the two. (There's plenty of software out there that doesn't care about the difference, because they let the user input neither, so they don't need to distinguish them.)
 
Or they simply decided not to support 'ß' at all. And someone later decided that they would use 'β' where they needed an 'ß' because they looked kind of similar, and in their particular application they didn't use 'β', or didn't care if the software couldn't distinguish the difference between the two. (There's plenty of software out there that doesn't care about the difference, because they let the user input neither, so they don't need to distinguish them.)
I am more inclined to believe that it was intended for dual-use all along, because the average American engineer speaks neither German nor Greek and hence nobody saw the difference during the PC's rushed design phase.

I mean: It is a fact, that firstly, Germany was a CP437 market during the CGA era, secondly, that IBM's own keyboards for the German market have always had Ä, Ö, Ü and ß keys, and that thirdly, IBM's own keyboard drivers mapped the ß key to that code point.
I am not aware of any keyboard layout with a β (beta) key mapped to that CP437 code point.
Most importantly, however, IBM's own CP437 documentation from 1984 describes that code point as "Sharp s Small", i.e. ß (eszett/sharp s): Graphic Character Sets and Code Pages - 00437
 
Most importantly, however, IBM's own CP437 documentation from 1984 describes that code point as "Sharp s Small", i.e. ß (eszett/sharp s): Graphic Character Sets and Code Pages - 00437
This documentation was written years after the charset was designed, and we have no indication that the person or people who wrote it had any connection with the original designers or design. I find it very interesting that you take this as far more probative than what one of the actual creators of the PC said.
 
Since it's in the middle of a group of Greek characters, it's pretty clear that it was originally intended be the Greek letter β, but German speakers found it close enough to ß to use it for that purpose -- enough so that by the time of VGA, it was changed to look a lot more like ß.
 
  • Like
Reactions: cjs
That decision, however, was made by IBM and not by random German users, and presumably as soon as it occurred to IBM's sales department, that, as far as their most important European market was concerned, they may just have produced the equivalent of an English character set with 25 letters.
Regardless of the font designer's original thoughts, the question remains: Was there ever an IBM keyboard driver mapping a beta key to that code point?
If not, that explains why IBM's document from 1984 says what it says.
 
...as far as their most important European market was concerned, they may just have produced the equivalent of an English character set with 25 letters.
Hardly. German orthography has long condoned replacing 'ß' with 'ss' if the former isn't available. Not having it on your computer might be more annoying than losing, say, 'ſ' (the long s), but not as annoying as losing 'c'. (And, while 'ſ' started falling out of use in English in the 19th century, it was in use in German into the 20th century.)

Was there ever an IBM keyboard driver mapping a beta key to that code point?
Was there ever an IBM keyboard with a 'β' key on a system that used CP437 and not a Greek code page when using that key?
 
The change also coincided with the Greek alphabet no longer being required learning as part of STEM courses, just as learning Latin as a language fell out of favor.

Plus, certainly by the era of EGA, if accurate representation of Greek characters was needed, programs like MathCAD would simply use graphics mode to do it.
 
Nice work on sorting the CP437/Unicode mapping. It warms my pedantic little heart. But good luck getting everybody to switch to it. :-(
Thanks for the complement. As you can tell I am into details (pedantic) too. I did my part, now its time for the rest of the world to join in! ;-) Wishful thinking on my part I guess.
 
Hello Michael,

About your question about the Datamaster designation, I found in the docs that IBM referred it as "Datamaster" and not "DataMaster".

Thank you for helping me get in contact with Dr. Bradley!

Greetings
For forum readers, RetroAND is referring to an email conversation we had. Thanks for finding the correct spelling and proving that Dr. David J. Bradley misremembered the name as "DataMaster."

Yes, I am happy to help you and I am happy you are happy too. :-)

-Michael
 
But then why would it be in the "Greek letters" section of the charset, right between α and Γ, instead of in the European letters area of the charset?


This is not a bad thing; this is a good thing. Without this Unicode couldn't support round-trip format conversion. I've had experience with situations where I had to store things in multiple charsets because conversions couldn't round-trip; it's super-nice to be able to convert all input to Unicode, store it, and know for sure that if I convert it back to the original charset the user will get the original input, not something that's been changed.


There are plenty, because plenty of source charsets may have a different idea from Unicode about whether two characters are "the same." A classic example is U+03A9 'Ω' GREEK CAPITAL LETTER OMEGA and U+2126 'Ω' OHM SIGN. (Unicode recommends you always use the former, even when writing things such as 'a 12Ω resistor'.)
@cjs, I agree with your reply to @MiaM. Thanks for supporting me in the "Greek letters" over "European letters" understanding.
 
@mwalden: welcome, nice to see ya here!



Likely because IBM itself never really made up its mind. It clearly looks like a Greek beta in the CGA, MDA, EGA, and PC Convertible fonts, and the 1986 'rev. 0' of the PS/2 MCGA/VGA firmware still had that representation, but the following year's rev. 1 looks like it was changed quite deliberately to a German eszett.

Maybe it had something to do with PC-DOS 3.2's code page support, which made IBM realize that cp437 was one character short of supporting German properly, while Greek was already getting a codepage of its own (cp851). *shrugs*
Yes, it is nice to see you here too!

I agree with you on the possibility that later (after the original IBM character set) IBM may have decided to change the Greek beta to the German eszett because of its omission and more important purpose over a Greek beta. CP437 is not an adequate encoding for Greek, except for in a physics and mathematics context.
 
I think it's obvious that IBM did not hire a graphical artist or typography expert to create the character set. It was probably just some engineer working from what may have even been a hand-drawn list of characters. That's why a lot of the shapes aren't quite right and/or are inconsistent with other similar characters.

But at least the PC got separate characters for 1 and l, which the DataMaster didn't!
What you say about the characters being done by an engineer sounds likely.

The Datamaster (as @RetroAND confirmed above, the m is lowercase!) did have a bad decision in my opinion too, having an "l" being used for a "1" is prone to confusion. In the Datamaster the Japanese character set has a "1" for the number one since there is no "l" to use for it. Strange...

Lastly, I have enjoyed several of your YouTube videos. Keep up the good work!
 
Back
Top