• Please review our updated Terms and Rules here

MS-DOS 4.0(1) Opensource on Github

Yeah, I think that Git is getting way more blame there than it should, and I made an extended comment about that. Summary: there is a really annoying Git issue that you need to be aware of, but that's easily fixed, ZIP files should not be trusted sources of timestamps either, and in general Git will provide better support than ZIP files for making sure you have the source as actually released, and not some random modified version. Oh, and I'm pretty familiar with how to deal with these issues (and already have ad-hoc tools for, e.g. handling archival timestamps), and I'm happy to help anybody who wants to try to make a proper, replicable archive copy of code.
 
I found this YouTube movie explaining how to do it. I would like to repeat it but I'm not a Linux user. At 01:50 a sed command is used to replace some bytes with something else. But I don't understand with what exactly. Can anybody explain what is done, please?

The issue being addressed is that someone (very incorrectly) ran the original code through a character set/encoding conversion tool, which changed a few of the bytes. That's bad enough, but worse yet, some of the original tools used to assemble this code are using that original character set/encoding and consider some of the new byte sequences to be invalid, and presumably barf on them.

Ideally the way to recover from this would be to go back and get the original files, before they were run through that conversion tool. This doesn't seem possible at the moment, so in an attempt to recover anyway, the presenter is running yet another conversion tool (of sorts, it's basically a hack) to at least partially reverse the conversion process, enough that the original tools will now accept this new source code.

The sed command in particular is as follows:

1714622920134.png

The for loop takes each of those filenames in turn, assigned to $f, and runs the sed command with the filename substituted in src/$f.

Sed itself is invoked with three options, albeit not in the most obvious way. The first is -i, which asks it to overwrite the input file instead of writing its edited output to the standard output. So src/$f will be replaced with the changed version of the file that sed makes.

The second option is combined with the third, but a more clear way of stating it would be -r -e ... (or, heck, -ire ... if you want). But anyway, the important thing is the -r, which much better known as -E, which says to use extended regular expressions rather than "standard" regular expressions. Essentially, this is just selecting the exact language that sed will use.

The third option is where the real work happens: -e [I]command[/I] gives sed a command to run. (This can be used multiple times to give it multiple commands or, if you have enough of them, you might put them in a file and ask sed to read commands from that.)

The command itself is of the form s/[I]match[/I]/[I]replacement[/I]/g, which says to process each line in the file and, anywhere you find match, replace it with replacement. This would normally be done only once per lines, but the g at the end says to repeat this as many times as needed on that line until every copy of match is replaced.

The particular match they're using uses the \xNN hex character notation, where \x41 would be an ASCII or UTF_8 'A' and so on. The two vertical bars in the match pattern separate alternatives, as in AB|CD would match the characters "AB" or the characters "CD". So here they're matching any one of three different possible sequences of characters, which are (in my usual Motorola hex notation):
  • $EF $BF $BD
  • $C4 $BF
  • $C4 $B4
The replacement part is much more simple, it's a straight '#' character. So, anywhere sed finds one of the three sequences above, it will be replaced with the single character '#'.

Those character sequences above are various characters in UTF-8 ('�', 'Ŀ' and 'Ĵ', I think) but they don't really have any meaning because they're the result of a nonsense conversion, as far as I can tell. So basically, the sed script just gets rid of them; presumably '#' is a comment character or within a comment or something else ignored or at least not invalid at the particular points in the file where the above strings occur.

Another question: what are these bytes doing in an ASM file at all?

Per above, someone apparently incorrectly ran a conversion program on the files that they ought not have run.

(Personally, I bet it was non-Vim users, since Vim users would let Vim do the appropriate character set conversion on the fly, keeping the file in the original encoding but converting to UTF-8 for display systems that require that. :-))
 
$EF $BF $BD is U+FFFD, which some editor or other tool configured for UTF-8 put in to replace what it considered invalid characters. This is the only string that needs to be replaced by a single character, and only in a few files, others aren't mangled.

$C4, $B4 and $BF are the original line drawing characters: since some sequences of them can also be valid UTF-8, they weren't always replaced.

I've converted all the text files to DOS format, replaced the UTF-8 sequence with $FE (▪), and then manually corrected all the files where it appeared. The fixedsrc.zip file contains the hopefully correct result, with no other modifications from what is on GitHub (except for the also corrected paths in SETENV.BAT, so that it will build).

I've also made an experimental change to BIOS\MSINIT.ASM to support more than one primary partition per drive, like newer DOS versions do. The original code would only choose one (the first it finds, even if isn't the active one). It works with 4 primary partitions on one drive, but haven't tested how it interacts with extended partitions yet!

EDIT: there's a problem with 5 files in CMD\GRAFTABL. working on fixing that
EDIT2: done!
 

Attachments

  • MULTPART.ZIP
    29.7 KB · Views: 6
  • fixedsrc.zip
    5.6 MB · Views: 5
Last edited:
  • Like
Reactions: cjs
I've converted all the text files to DOS format, replaced the UTF-8 sequence with $FE (▪), and then manually corrected all the files where it appeared. The first ZIP file contains the hopefully correct result, with no other modifications from what is on GitHub.
It would be worth sending in a PR with that, as well as making it available from a fork repo on GitHub if the PR isn't accepted. In that will make it dead easy to see exactly what you changed, without having to unzip files, make sure you understand where your two separate sets of files are and what they are, and run diff -r on the two directory trees.
 
It would be worth sending in a PR with that, as well as making it available from a fork repo on GitHub if the PR isn't accepted. In that will make it dead easy to see exactly what you changed, without having to unzip files, make sure you understand where your two separate sets of files are and what they are, and run diff -r on the two directory trees.

I don't have a GitHub account and no idea how any of this works. And I'm fine with someone else doing that, as I said the code should be the same as the original except for charset and line endings (which as far as I understand would appear in a diff as every line being changed...)
 
I don't have a GitHub account and no idea how any of this works. And I'm fine with someone else doing that...
Ok, fair enough. At least we've still made some progress here! Hopefully someone will take this up. (Not me; I have little time for and less interest in DOS 4.0. I just feel compelled to comment when someone screws up preservation of historical files.) They're going to have to maintain a fork repo, though, since I just noticed that:

1714652327760.png

so you can't file PRs against it.

...as I said the code should be the same as the original except for charset and line endings (which as far as I understand would appear in a diff as every line being changed...)
This isn't a new problem, and git diff has loads of useful options for dealing with this kind of thing, including options like --ignore-cr-at-eol. That said, I'd still probably split this into two commits: one to fix the unfortunate change in line endings (oh why oh why) and another to fix the even more unfortunate re-coding problem (triple-oh-why-oh-why), just so that the two changes are more easily reviewed and verified.
 
Ok, fair enough. At least we've still made some progress here! Hopefully someone will take this up. (Not me; I have little time for and less interest in DOS 4.0. I just feel compelled to comment when someone screws up preservation of historical files.) They're going to have to maintain a fork repo, though, since I just noticed that:

View attachment 1279068

so you can't file PRs against it.


This isn't a new problem, and git diff has loads of useful options for dealing with this kind of thing, including options like --ignore-cr-at-eol. That said, I'd still probably split this into two commits: one to fix the unfortunate change in line endings (oh why oh why) and another to fix the even more unfortunate re-coding problem (triple-oh-why-oh-why), just so that the two changes are more easily reviewed and verified.
Well your suggestion has been helpful anyway, I found 5 files where my (DOS-based, written in assembly) conversion tool has eaten a line for some reason. It's the font data for GRAFTABL, nothing else was affected - I'm working on fixing them right now.

edit: uploaded new zip file. "diff --strip-trailing-cr" now shows only lines where I fixed the special characters, and the two paths in setenv that had to be changed to get it to build.
 
Last edited:
  • Like
Reactions: cjs
I've also made an experimental change to BIOS\MSINIT.ASM to support more than one primary partition per drive, like newer DOS versions do. The original code would only choose one (the first it finds, even if isn't the active one). It works with 4 primary partitions on one drive, but haven't tested how it interacts with extended partitions yet!

Have now tested it in QEMU, and of course it doesn't work with extended partitions. One of the system structures this code uses being named BDSM is only fitting :)
 
It would be worth sending in a PR with that, as well as making it available from a fork repo on GitHub if the PR isn't accepted. In that will make it dead easy to see exactly what you changed, without having to unzip files, make sure you understand where your two separate sets of files are and what they are, and run diff -r on the two directory trees.
I believe the folks at Microsoft responsible for open-sourcing this are aware of the issue and intend to fix it when they've got time
 
Back
Top