* Fix issue where multiple clusters would be loaded 0x800 apart rather than
contiguously.
* Get rid of some global variables, saving space.
* si points to the start LBA address rather than the partition table entry,
saving a bit of space as we no longer need to use offsets.
* In read_cluster_chain, drop push/popad as they in fact do not
save the upper 16 bytes of the registers. Instead, just push
and pops the registers that are used.