Some of the Pocket Computers used a pair of 4 bit CPUs.
https://www.floodgap.com/retrobits/tpm/138.html has a description of how one worked.
Nitpick: Floodgap's description is pretty vague and not really accurate. The TL;DR is that one of the chips was the "CPU" running user programs and the other was an I/O processor handling the keyboard and display. This is not an unusual architecture in and of itself, plenty of computers have multiple "CPUs" once you count I/O handlers, and doesn't count as some kind of "merged" CPU. (For instance, most computers with detatched keyboards, and plenty that don't, have a microcontroller in there handling the matrix scanning.) It seems a *little* weird to us today, maybe, to have I/O processors are the same (or nearly the same) as the "main" CPU, but in the 8 bit era this was par for the course.
For more examples of this you don't have to look any further than your typical terminal; "hardwired" serial terminals without CPUs were pretty much extinct for new designs by 1977 or so because at this point it became cheaper (and a *lot* more flexible) to drop a CPU or microcontroller onto your circuit board than it was to implement anything but the most rudimentary text handling and cursor controls with hardware state machines. Therefore probably the majority of 8-bit S100 computers had a CPU handling the display and keyboard that was of comparable power to the one in the box. (Heck, some "video cards" for S100 machines were effectively just terminals on a card eliminating the serial connection, and had their own Z80 CPU on them. Machines like the Heath/Zenith H/Z-89 have dual Z80 CPUs for this very reason; the Z-89 is just a Z80-powered Z-19 terminal with a second PCB containing the "computer" stuffed into it.)
Possibly a better example of two CPUs working in extremely tight lockstep might be the how Commodore's old IEEE disk drives (2040/4040/8050/etc) worked. These drives contained not one but *two* 6502-family CPUs (a 6502 and a 6504, I believe?) clocked 180 degrees out of phase with each other, allowing them to share a common memory buffer with no contention. One CPU ran the "DOS" and communicated with the host computer, the other was a dedicated disk controller running code to handle the actual disk mechanics, etc. These CPUs ran at the same 1mhz as the 6502 in your Commodore PET, so technically buying one of these drives tripled the theoretical power of your computer. I think it may have even been technically possible to manipulate the memory of the "DOS" processor in the drive to run arbitrary code, like people do with the C64's 1541, so in a pinch you might even be able to use one of them as a "coprocessor" for user code, although whether anyone ever actually did that or not I have no idea.
That said, though, the two CPUs in a Commodore 4040 are still doing separate "things", they're not somehow merged into a "super-6502".
There is also a smart contraption named KimKlone which adds registers to a regular 6502.
FWIW, this is kind of reminiscent to how the 8087 math coprocessor for the 8086 works. (And also the fairly rarely used 8089 I/O coprocessor.) These chips run in lockstep with the main CPU snooping instruction fetch transactions. grabbing relevant opcodes directly off the bus and, when necessary, grabbing the bus by instigating a DMA request. Obviously it's not quite the same thing because Intel specifically assigned an "escape" bit sequence to be used to prefix coprocessor opcodes (thereby allowing a software handler to respond appropriately if these opcodes came up without the coprocessor installed, as opposed to whatever undefined behavior might happen when an "illegal" opcode is thrown at a CPU that doesn't know how to trap it), but it was once upon a time a "legit" technique. Intel moved away from this idea to a "private" coprocessor connection with the 286 and later because having to duplicate all this bus interface circuitry in every coprocessor is pretty stupidly inefficent.
A subset of this "trapping" technique was sometimes used to patch errors in system ROMs, back when memory was expensive enough to actually justifying slapping a PCB with several chips into a system to salvage the existing ROM vs. just replacing it.
Extended words in CPU with only ADC/SBC/shifts/rotations by reducing the data path by two bits to generate carry out as carry in the next CPU.
... so how is that supposed to work exactly? I mean, honestly curious if this is a thing you've seen.
I would definitely say that bit-slice "CPUs" do not count here. They are, by definition, not "complete" CPUs on their own, and are explicitly designed to be used in arbitrarily ganged configurations.