I thought that the CPU will always perform a data transfer equal to it's bus width for practical reasons. Since each port/memory address is DEFINED to be 8-bit word size in the IBM PC and beyond, wouldn't a 16-bit transfer always get two memory addresses/ports worth of data and shift/mask the unnecessary bits? What am I misunderstanding? I know I'm probably missing something, I'm just not sure what
...
The technique you described won't work with I/O devices or memory writes, because CPU doesn't really know what goes into "unused" parts of the word, and with I/O because unintended access (even read) can affect I/O device state.
Some 16-bit architectures access memory and I/O ports strictly using 16-bit words (for example PDP-11). But it's completely not the case for x86 architecture. x86 can do 8-bit, 16-bit, 32-bit (for 386 and later) memory and I/O accesses, and 64-bit (Pentium and above*) memory accesses.
To perform this, x86 CPUs have "byte enable" control lines, that determine what bytes are to be addressed within one word (word is 2 bytes for 8086, 286 and 386SX, 4 bytes for 386DX and 486, 8 bytes for Pentium)...
In x86 it is possible that the addressed "logical" word (16-bit or 32-bit in 386 or later) crosses physical word boundary, in this case CPU performs two access cycles - one for lower part of the word and one for the higher part.
Additionally IBM AT and compatibles include data bus steering logic that handles the case where I/O or memory data path width is smaller than the width CPU data path. This logic splits one CPU's access cycle into two (or 4 in case of 32-bit) access cycles. For example if you have a 8-bit extension ROM connected to the 386 motherboard, and a program tries to read a 32-bit word from there, it will be split into four successive 8-bit read accesses.
*Pentium and some later x86 CPUs, while technically being 32-bit CPUs use 64-bit data path for memory access. This is to achieve better throughput.