Not to drag this any further off topic, but looking at the Amstrad "forced code" I think it's a bit clearer. The force-fed code does not look at the address bus, but simply feeds the instruction stream to the CPU. I think that is the reason they do 256 LDI instructions instead of an LDIR. The LDIR operates by hold the PC at the instruction start and effectively repeating the LDIR instruction until B==0. But that won't work if the force-feed mechanism isn't watching the PC. Basically, until the OUT 0F8H instruction executes, the PC is irrelevant. I'm not sure how they are differentiating between the instruction byte fetches and the memory accesses of the LDIs, though. The M1 signal identifies the opcode bytes, but not the operand fetches for OUT, LD DE, and JP. So mysteries remain. Since this forced-code is coming from the main ASIC directly connected the CPU (while the code it loads is apparently coming from the printer controller ROM), it's possible that ASIC is aware of the exact byte stream to fetch and thus differentiates instruction bytes from data bytes that way (possibly "seeing" the LDI and enabling a separate mechanism for the subsequent data read - i.e. the next /MREQ+/RD after fetching the LDI is always to printer controller ROM).
It's easier to understand by thinking of LDI not as LDI, but as "Read from RAM, Write to RAM" - HL is ignored, and as you note, BC is irrelevant, so neither have to be acknowledged.
M1+RD+MREQ indicates it's time to load the "LDI" onto the data bus. !M1+RD+MREQ Indicates it's time to load program data onto the bus. It could use the address bus, but as it's not connected to the system address but, it must use a counter that would exist internally.
But the printer IC can't see any of the signals it needs to do this.
ref:
https://www.retroisle.com/amstrad/pcw/Technical/Hardware/pcw_cpu.gif
Then needs to be something to maintain state - but the printer chip doesn't see M1, so it's clear this is happening within the gate array, and the gate array is also preventing RAM data being sent to the data bus while reading the bootstrap. It looks like extra signals are sent to the printer chip to set it up to transfer it's data sequentially. Given there are bidirectional data bus signals involved and both ICs are active during the bootstrap, it's entirely possible and quite likely that the LDI command isn't coming from the printer chip itself, but is a response from the gate array since it's controlling state - and 256 bytes of transfer might back onto some other counter within the gate array that is necessary for something else - which potentially eliminates the need for an additional counter dedicated to the bootstrap, though there must still be a counter in the printer IC to track which byte it's sending in the bootstrap code.
Also, the gate array would disable RAM reads, because it is the interface to the DRAM, effectively presenting the RAM data to the CPU - There is no direct connection between the RAM and the CPU - so tristating itself when it knows the printer chip is sending information isn't a huge imposition.
I always used to think the gate array ran the full bootstrap, but it's interesting to note they used the printer controller IC for the data rather than the gate array.
Amstrad was always looking to save a few dollars, so this might have been little more than a way to avoid paying for an EPROM when they had extra chip-based real-estate in the printer IC. It also would make it difficult to copy the PCW, which would have been an Amstrad concern at the time.
Though that's just conjecture.
On the topic of interesting use of LDI and block commands, and to get back to the original topic, when mapping RAM to DISK space through hardware via I/O, I mentioned earlier that you can only use OTDR and INDR and not OTIR and INIR -
This is because the counter is provided by the z80 then, and you can use the upper address lines A8 to A15 to transfer up to 256 bytes of memory using the z80 block commands -
The reason for this is that A8 to A15 reflect the state of the C register which is decremented during block commands, and if being used as an index to RAMDISK, you need to ensure the destination counter (DE) is locked to the C register.
This method contrasts to the Amstrad method by using the z80 address bus to indicate which byte in RAM is being selected through I/O commands.