The real downer was that there were languages that used two consecutive nulls as an operator or part of one. You can imagine the kludges that resulted from that (can you say "equivalent digraph"?)
Sadly such conventions continue in character encoding to this day. It's called UTF-8... you want a kludge, there it is. That it falls back to single byte for the standard 7 bit ASCII is nice and all, but you get into more complex non latin-1 languages it becomes a bloated mess and you're better off using UTF-16... and UTF-8 is a pain in the ASS to process at the per byte level since a character can have a variable byte length, and that second byte could contain what those of us used to working in ASCII would consider to be control codes, rendering those handy control codes -- even basic ones like /r and /n -- much more difficult to isolate. I've reached the point in a number of programs where internally I run UTF-16 regardless of what character encoding I'm reading in.
... Which seems to be part of Webkit and Blink's internal handling of it as well. Makes sense really to decode on read once rather than have to brute-force check every character during string processing. "Day job" I'm working on an Electron (similar to nw.js) web crapplet right now, and we're "stuck" using json or xml on data that would be SO much simpler and faster to just use FS, GS, RS, and US on and just strip control codes we don't want from the data, instead of screwing around escaping the data with entities taking already oversized formats and making the data bigger too.
... and yeah, I remember the headaches of 6 bit character encodings from my dalliance with DiBOL and importing data from older systems to new ones... and imagine it was only worse the further back you go.
Uhg, DiBOL, now there's a language I don't miss.
If you look at old communications about the IBM 1620, you'll find similar concerns expressed by Dijkstra about how badly character handling was accomplished by the hardware. For example, it's possible to read or write a "numeric blank" character, but not to test for the presence of one--or do any arithmetic or logical operation on one. You can move one around, but that's it. Try to add one to a number and you got an error stop.
That had to make importing data from other systems FUN; pre-processing before sending it to that platform likely being the best approach.
When writing a business BASIC compiler on an x80 architecture, it was obvious that null-terminated strings were not the way to go.
... and really that's the conclusion I think Microsoft came to when working on Windows; I know during the culture-clash between IBM and M$ came to a head on OS/2, that was very much the type of thing that IBM seemed to love shoving their head up their backside to smell their own farts on and insist on having "Because it's how all our other 'big business' client systems work". They were so resistant to change and stuck in the mud when it came to efficient code vs. "This is how we've always written software"
Again though, what one can expect from the "paid by the k-loc" scam artists that made up a lot of business programming -- especially in the big iron world -- at the time. Something made all the more scary when you consider that so much of the software then ran on interpreters.
Instead, I settled on a "descriptor" system wherein the size, length and dimension of a character array was tracked, which made string handling very straightforward. For example, If one defined "+" as a concatenation operator, one could write "A$ = A$+B$+C$+D$" and handle the strings optimally.
Because you could easily sum the space to allocate for the new value from the dimension.
In C, this simple operation involves a lot of "we need to find the length of a string" stuff. But the difference is that strings are a part of the BASIC language, unlike C, where they're mostly a notational convenience for handling arrays of characters. Were strings part of standard K&R C, you'd likely have seen a descriptor (or other metadata) system used.
Which is really part of why I find C extremely limited. If anything, to use a word you used that describes it best -- strings in C are a kludge. Kind of like how objects in C++ feel just shoe-horned in there any-old-way.
Though at least it's not pervasively object based like Java... more like pervertedly object based.
C is what it is--basically a semi-portable assembly language.
Except for where it totally isn't, but that's the problem of being TOO portable. Architectures are often a bit too different for a generic language to "one size fits all". As the Air Force quickly learned at the dawn of the jet age, it's the
flaw of averages.. An average size actually fits nobody.
As far as overhead goes in x86 C compilers, it all depends on what you're after. If you write your prologue in assembly to fit your needs
... at which point I might as well just write the whole thing in assembly. I was hoping for an easy off the shelf solution to give me something better than what I was getting out of TP7, and that I could compile from the host OS instead of from inside DOSBox. Trying the different C compilers they just weren't giving me that -- at best a step sideways, at worst several steps backwards. It's actually kind of a laugh that MOST of my "problems" actually disappeared when I gave up on high level languages altogether. at least two thirds of Paku Paku 2.0's codebase was already in assembler, so porting it to all be "near" (except the timer ISR of course) wasn't a big deal... and stripping out the overhead of variable passing and far calls is netting me that extra wedge of speed I need for the fastest game level to run full out on a unexpanded Junior.
I'm relatively certain most of the issues I have with C for my current set of x86 projects would evaporate if I upped the target to a 16mhz 386SX as the minimum... but with a 128k DOS 2.x PCJr as the minimum target? Not so much. Even less so if I drop that to a 64k cassette based machine. NO matter what I do it's not going to fit the 40k free after booting DOS 2.x on a 64k rig. Shame... Though I'm tempted to see if I can restrict myself to DOS 1.1 compatibility since the only real offender would be the file operations for the high scores table now as I've gone monolithic on the executable -- no more separate data files in the distro for anything but scores.
-- and I'm even tinkering with making it a self modifying file to get rid of that too!
Real fun is going to be figuring out the best way to load it from tape -- does cassette basic blow up in your face if you try to blind append machine language to a basic program (given there's no CLOADM) or am I just best off loading a .bas that when run does a def_seg, bload a .m and call? Pretty sure DATA and POKE ain't gonna cut it on what's looking to be a ~32k executable.
People often scoffed at tape on the PC, but it was in use on other platforms far past the PC's introduction -- I always felt it would have gotten a lot more use if they had provided a proper way for simply loading machine language programs from basic on one easy pass. I think that would have sunk the Commodore side if early on the trick of appending the machine language raw to the .bas file hadn't caught on. Once I kitbash a cable for the Junior (assuming I can track down a proper connector) I'm going to try that method first just to see what it does, since done "properly" it would be a single load and run. Of course, CREATING that .BAS file is gonna be tons of fun... 5150 and Junior are tape file compatible to each-other, right?
Cassette based PC software is really not something I've seen a lot of people try to do anything with, hence my interest in doing it.
Either way I'm likely going to pull support for the fancier sound cards from the "tape" version -- just stick to junior/tandy/speaker... especially with my 120hz arpeggio two voice speaker sound being a lot nicer than the priority based single voice I was using. Can axe the high score saving code for that build too since where would you put it? Still working on the Junior only build too using the linear 160x100 mode instead of the tweaked text mode, that can have a much smaller executable when I'm done and will have even more overhead -- that version is the one I'm tinkering with the idea of making a cartridge out of. (assuming I ever get that far)
Though I'm also working on a proper 1.5 bit speaker driver for systems that are fast enough to handle it... doesn't need to be too fancy as really for Paku Paku and most of the games I'm working on for said target platforms I really don't need more than two voices anyways.
Crazy idea, has anyone ever tried using the cassette port on the 5150 or Jr. as a second voice, the way TRS-80 users used theirs for audio? Might be fun... though I don't think at that speed i'd have the time to service the timer that fast if it's a strictly on/off affair.
Oh, and @Scali, interesting result... not what I got here, but I've moved on from C and high level languages for this purpose entirely. In a way what Chuck(g) said about languages is the problem; it's really not mattering which high level language I'm trying, they're all giving me the same results, and those results do not line up with my needs... even though TP7 remains closest in terms of size and speed compared to the alternatives I've tried... at least not without brute-forcing large sections of the build process which again, at that point I might as well just use assembly... and lots of macros.
Again at best, C ended up being a step sideways from what I had, not forwards. As always,
ever forward.