Matthias Kramm's Blog

Freespin's loader

Fast loading code on a Commodore disk drive has been an esoteric art for decades. There is a plethora of loaders, and the current state of the art loaders (LFT's spindle and Krill's loader) even manage to decode disk data on the fly, and transfer an entire track in three rotations of the floppy.

As such, it's odd that we couldn't find any existing loader that was suitable for our floppy-only demo, freespin. Let me explain.

The Floppy Drive

A Commodore 1541 floppy drive has a 6502 processor running at 1 Mhz, two 6522 I/O chips, 2048 bytes of RAM. It has two motors: One for stepping the read head and one for rotating the disk. Apart from the read head, it has only one sensor: A phototransitor detects the notch in a 5 1/4 disk that indicates the disk is writable.

The read head moves over the disk at a speed of 4.97 rotations per second, and can be positioned radially at ~80 different positions (tracks), 36 of which are typically used.

The low-level encoding stores a one bit as a switch of magnetic polarity, and a zero bit as no switch. One track can store between 48-60k bits (reference) this way.

For redundancy (and because more than two zeros in a row are more difficult to detect), data on a disk is GCR encoded. A group of 4 bits (a nibble) is expanded to 5 bits (a quintuple). While the more advanced drives (1571 etc.) have hardware support for decoding GCR data, on a 1541 the decoding has to be done by the main CPU.

This is an example for the data that's read, and the data that was originally stored:

GCR Data:      01111 10010 01110 10110 01111 01001 10010 01011
Original Data: 0101  0010  0100  0110  0101  1000  0010  0001

Because GCR quintuples are five bits, they don't align quite as nicely with byte boundaries. Using the data from the example above, in the bytes we read from the disk, the quintuples end up at bit positions 7, 2, 5, 0, 3, 6, 1, 4.

Bit            76543210 76543210 76543210 76543210 76543210
GCR Data:      01111100 10011101 01100111 10100110 01001011
               ^    ^     ^    ^     ^     ^    ^     ^
Start of       |    |     |    |     |     |    |     |
quintuple      0    1     2    3     4     5    6     7

As you can see, four of those quintuples (the ones at 7, 6, 5, 4) even span across two bytes.

The freespin loader

Other loader systems use lookup tables and specialized code for every position modulo 5, and thus consume most of the 2kb of RAM.

That wouldn't fly for our freespin demo. We need to load those 2kb of RAM to display effects, not have all of them contain loader code.

Loader goals:

  1. Load a new part (~1.7kb of data) in a very short time, preferrably one disk rotation.
  2. Consume a very small amount of resident RAM (preferrably, less than 256b)
  3. Always take the same amount of time for loading, to keep the music in sync.
  4. Be able to load the standard GCR encoding, so we can distribute the demo using a run-of-the-mill .d64 image.

Now, the first item rules out using the floppy's ROM, which can only read data into some parts of the 2kb memory, and needs multiple disk rotations to do so. The second item rules out all existing loaders. Even the smallest need at least 512 bytes of RAM.

So we're in a pickle: We need to decode on the fly, we need to gobble everything in in one rotation, but we have a preciously little amount of memory to do so.

But, there's one thing unique to 1541tros (like freespin) that we can use: Because the parts are all really tiny (1.7kb per part) we need to store only about 80kb of data on the disk. Disks normally hold 160kb. (Another thing unique to freespin: We never actually have to transfer the data anywhere. An advantage of being right next to the metal.)

2:1 Encoding

As explained earlier, normal GCR encoding stores data as 5:4. Four data bits get turned into five bits on disk. What if we relax that ratio? Say, turn only one data bit into two bits on disk, for a 2:1 ratio?

As mentioned, we want to keep the default encoding, so we have to work with what GCR gives us. There are 96 possible combinations for the first byte of a GCR sequence:

01010-001 01010-010 01010-011 01010-101 01010-110 01010-111
01011-001 01011-010 01011-011 01011-101 01011-110 01011-111
10010-001 10010-010 10010-011 10010-101 10010-110 10010-111
10011-001 10011-010 10011-011 10011-101 10011-110 10011-111
01110-001 01110-010 01110-011 01110-101 01110-110 01110-111
01111-001 01111-010 01111-011 01111-101 01111-110 01111-111
10110-001 10110-010 10110-011 10110-101 10110-110 10110-111
10111-001 10111-010 10111-011 10111-101 10111-110 10111-111
01001-001 01001-010 01001-011 01001-101 01001-110 01001-111
11001-001 11001-010 11001-011 11001-101 11001-110 11001-111
11010-001 11010-010 11010-011 11010-101 11010-110 11010-111
11011-001 11011-010 11011-011 11011-101 11011-110 11011-111
01101-001 01101-010 01101-011 01101-101 01101-110 01101-111
11101-001 11101-010 11101-011 11101-101 11101-110 11101-111
11110-001 11110-010 11110-011 11110-101 11110-110 11110-111
10101-001 10101-010 10101-011 10101-101 10101-110 10101-111

For the second byte, we have 96 combinations, as well, but only 32 of those can be selected since one of the GCR quintuples of the first byte overlaps the second byte:

For partial quintuples 01001, 11001, 01101, 11101, and 10101:
01-01010-0 01-01010-1
01-01011-0 01-01011-1
01-10010-0 01-10010-1
...
01-11101-0 01-11101-1
01-11110-0 01-11110-1
01-10101-0 01-10101-1

For partial quintuples 01010, 10010, 01110, 10110, 11010, and 11110:
10-01010-0 10-01010-1
10-01011-0 10-01011-1
10-10010-0 10-10010-1
...
10-11101-0 10-11101-1
10-11110-0 10-11110-1
10-10101-0 10-10101-1

For partial quintuples 01011, 10011, 01111, 10111, and 11011:
11-01010-0 11-01010-1
11-01011-0 11-01011-1
11-10010-0 11-10010-1
...
11-11101-0 11-11101-1
11-11110-0 11-11110-1
11-10101-0 11-10101-1

So in other words, there are 96*32 = 3072 possible combinations for the first two GCR bytes. You would expect that you can easily find some kind of mapping from that to, say, a single byte, and indeed, a simple XOR betweeen those first two bytes yields all numbers from 0 to 255:

0 = 0x55 ^ 0x55  // 01010-101 ^ 01-01010-1
1 = 0x55 ^ 0x54  // 01010-101 ^ 01-01010-0
...

Which leads us to this remarkably simple piece of 2:1 decoding logic:

decoded = gcr0 ^ gcr1

Or, in code:

clv
bvc *      ; wait for first GCR byte
lda $1c01  ; read first GCR byte
clv
bvc *      ; wait for second GCR byte
eor $1c01  ; read second GCR byte
sta decoded, x

Again, we're not actually "decoding" GCR here. What we originally encoded in GCR is different from what we decode above. But the point is that we can control what is going to get decoded. We just have to jump through some hoops when generating a d64 disk image.

Turns out that the XOR method works nicely (and is, in fact, used in one place in freespin), but the ratio of 2:1 is a bit too low, after all. One sector (of originally 256 bytes) now only holds 160 bytes. 1.7kb would encode as 14 blocks, so more than half of what a track can store. (A track can hold between 17 and 21 blocks) This means loading might take a whole disk rotation, which is ~11 frames.

However, for freespin, we'd like to load a new part in at most seven frames, since every time we load, we'll also have a gap in the music. (Keep in mind that the motor also generates the soundtrack. Most of the music runs at 107bpm, so a quarter note lasts exactly seven frames)

If we find an encoding that takes up less than half a track, we can store everything twice, thus making due with half a disk rotation (~6 frames)

3:2 Encoding

Turns out we improve the ratio, while still keeping the decoding code simple. We just need a slightly more fancy decoding formula. For example:

decoded0 = (gcr0 + C0) ^ (gcr1 + C1)
decoded1 = (gcr0 + C2) ^ (gcr2 + C3)

This decodes three GCR bytes into two data bytes, so a sector decodes into ~213 bytes.

Possible constants that seem to work are e.g. C1, C2, C3, C4 = 133, 102, 68, 80.

Here's the code:

clv
bvc *
lda $1c01
adc #C0
sta tmp1
adc #C2
sta tmp2
lda #C1
clv
bvc *
adc $1c01
eor tmp1
sta decoded, x
lda #C3
clv
bvc *
adc $1c01
eor tmp2
sta decoded+1, x

Note that this code is not an entirely faithful translation of the above formula. We're not doing any "clc"! In other words, the "adc" operations will occasionally overshoot by one, depending on whether the previous "adc" overflowed. That's fine and doesn't hurt, it just means the encoder needs to be aware of it and has to massage the data so everything comes out correctly at the end.

Actually, it turns out that leaking the carry from one adc to the next actually helps the encoder. It's easier to find an encoding for a block with the leaking than without. (Without it, we regularily saw blocks we couldn't encode, requiring us to either change the block in question, or adjust the constants. This still happens, but much less frequently now)

We used finite domain constraint solving to generate the encoded data. Encoding one data block takes about 50ms. To derive "good" constants, we had a desktop machine spend a few weeks encoding random data. The constants 133, 102, 68, 80 can encode, on average, 3521 out of 3522 blocks. There are formulas with better ratios. Read on.

A small tweak

If you're not scared to make the encoder do a bit more backtracking, then the below formula gives you a much better chance at always finding an encoding for a given block. With the right constants (try: 216, 115, 121, 139), it never failed to failed to encode any block for us. (We tried several millions)

clv
bvc *
lda $1c00
adc #C0
adc tmp1
sta tmp1
adc #C1
sta tmp2
// Omit clv/bvc *. The adcs might have cleared the V flag.
lda #C2
adc $1c00
eor tmp1
sta decoded, x
clv
bvc *
lda $1c00
adc #C3
eor tmp2
sta decoded+1, x

Note that this also makes tmp1 contain a sum of all third encoded bytes at the end (plus a contant, plus a value that depends on how often the "adc #C3" overflowed). This can be used as the first step towards a checksum.

The freespin loader

In the freespin demo, all of this, together with timekeeping logic, is distilled into 254 bytes of loading code, which is resident at $0702.

The loader is called 17 times during the demo. Whenever it is called, the screen goes blank for exactly 7 frames (a quarter of a music beat) while the memory from $0010 - $0702 is refreshed, after which the new part starts at the start address that got loaded into $0700/$0701.

(We initially had discardable parts of the loader tail code in the stack page ($0100-$01ff), since most parts overwrite that memory. That allowed us to free up some of $0700-$07ff. But in the end, it proved more useful and tidy to have it all in one place.)

We also always load almost the entire zero page, since it usually contains code. With both zero page and stack coming from the loader, parts have to do no initialization at all - they can start rendering to the screen as soon as they're called.

Loader source code

Is any of this stuff useful for C64 demos?

The only remaining question is whether this is useful for anything beyond the special case of doing a drive-only demo. Could traditional demo fastloaders profit from this, since they don't need the floppy RAM for anything except for loading?

One thing that comes to mind is preloading. Sometimes, a part in a demo is just not ready yet to receive new data, but data is flying by on the disk. With more floppy RAM available, you can store seven, maybe even eight, sectors in advance, then burst-transfer them when the time is right.

Storing slightly fewer bytes per sector also has interesting implications for interleaving: Serial transfer is just slightly too slow to send 256 bytes in the time an "odd" block is flying by, but you can transfer 213 bytes.

Finally, maybe you do need the floppy RAM for something else, even in a demo. It's not unheard of to use the floppy drive as a coprocessor. :)