Writing Tight 6502 Series Machine Code

Last Updated on November 29, 2015 by Dave Farquhar

This article appeared in the final issue of Twin Cities 128/64, published by Parsec, Inc. of Salem, Mass., sometime after April 1994. Parsec never paid for the article, so under the terms of Parsec’s contract, all rights reverted back to me 30 days after Parsec failed to remit payment.

So now I’m re-asserting my rights to the article. You’ll find the editing poor–all my semicolons appear to have been replaced by commas, for instance–and the writing full of cliches. But I would have been 16 or 17 when I wrote it, and I don’t think it’s a bad effort for a 17-year-old. And the article had some pretty clever tricks. I have to admit I’d forgotten 90% of what was in the article, but I recognize my own writing when I see it.

I’d like to thank Mark R. Brown, former managing editor of INFO magazine, for finding the article and bringing it to my attention. And one final word: Although I wrote this with the Commodore 128 in mind, the same tricks apply to any computer or console based on a 6502 or derivative.

No language in existence can match the balance of speed and efficiency of pure, hand-written machine code. Interpreted BASIC programs are often shorter, and C programs can often come close to matching the speed of machine language (ML), but ML is many times faster than interpreted BASIC, and much more compact than C or compiled BASIC.

The “bag of tricks” possessed by the 6502 family of processors, including the 8502 found in the C-128 is somewhat limited, but its contents are often rather unique and effective. Before I start presenting tricks, let me give you a warning. During the actual coding process, it is best to ignore the majority of these tricks and program traditionally, for the purposes of debugging. Only after you are satisfied with the integrity of your code should you “open the bag,” so to speak.

This first trick is older than the 6502 family itself, but still well worth repeating. Never follow a JSR instruction with an RTS. Instead, use JMP. This only saves you one byte of code and nine clock cycles, but with the limited speed and memory capacity of the 8502, it all adds up. The reasoning behind this is simple: all subroutines end in RTS anyway, so why RTS to yet
another RTS instruction, when one could do the job? This will also save stack space, which mayor may not matter to your program.

Another old but often neglected trick involves compare operations (CMP, CPX, and CPY) to zero. All commands which affect the Z flag of the processor’s status perform the function of a compare to zero already, as a byproduct. The actual compare instruction performed depends on the instruction: an INY will imply a CPY #$00, while an LDA implies a CMP #$00. Because of this, I always comment out such instructions, unless it is a statement such as:
LDA #$OA:CPY #$00.

Do not delete the instructions, however, leaving them in as comments makes your original intent clearer, and easy to find in those rare cases when the instructions are necessary.

This last trick spawns another: loops. They are very common, so consequently, they should be made as efficient as possible. Many people simply load one register with zero, perform an operation, increment the register, and compare it to the limit. But, because of the above trick, it would be shorter and faster to load the register with the limit, decrement it, and exit via the implied compare to zero.

What do you do when you have two subroutines that are identical except for the first instruction? Consider the two following examples:

prpl lda .asc "+"
     inc $d020
     inc $d021
     jmp $ffd2

prmi lda .asc "-"
     inc $d020
     inc $d021
     jmp $ffd2

Here are two ways to combine them:

prpl lda .asc "+"
     bne scr ;.A not equal to 0, so jump is
             ; unconditional
prmi lda .asc "-"
scr  inc $d020
     inc $d021
     jmp $ffd2
----
prpl lda .asc "+"
     .byt $2c
prmi lda .asc "-"
     inc $d020
     inc $d021
     jmp $ffd2

The” .byt $2c” instruction hides the second LDA instruction from PRPL, but calls to PRMI are unaffected. To hide a one byte instruction, such as a TYA, use ” .byt $24″. This method is 5 bytes shorter than the second method and 8 bytes shorter than the first, and only 3 cycles slower than the first.

Nearly all programs need an area of memory to temporarily store their variables. Many simply use a few bytes of memory immediately following their code. However, by using zero-page locations instead (check a memory map, there is a lot of free space peppered around that range, including Sfb-$fe), you can save at least one byte and clock cycle EVERY time the variable is accessed. This can be a real boon to speed-intensive applications such as graphics, or high speed I/O routines.

The 65C02 processor has a BRanch Always op code– a relative branching operation similar to JMP with a limited range but relocatable and requiring one byte less. It is not present in the 8502, but it can be simulated with the branching op codes it does have. For instance, after an LDA, STA, or equivalent sequence, you can safely use BNE as an unconditional jump, unless of course if you had used a 0, in which case, you should use a BEQ. In other cases, I like to simply try what branch operations are available, using a sequence similar to this code fragment:

  pha
  beq +
  brk
+ inc $d021
  dec $d020
  brk

If the branch did not work, the program simply exits to the ML monitor, but if it did work, I am alerted by the change of screen colors before breaking into the monitor. If none of them work, use the sequence SEC:BCS x instead. You lose some speed, and the code is no longer shorter than it would have been with JMP, but it is still relocatable.

This article does not even scratch the surface of what the C-128 has to offer. The MMU has a lot of tricks up its sleeve, and the Z-80 CAN be activated in 128 mode (which offers some interesting possibilities, for instance, it can effortlessly copy an area of memory in only 3 instructions). Hopefully I can cover some more of these in a future article.

Dave Farquhar

David Farquhar is a computer security professional, entrepreneur, and author. He has written professionally about computers since 1991, so he was writing about retro computers when they were still new. He has been working in IT professionally since 1994 and has specialized in vulnerability management since 2013. He holds Security+ and CISSP certifications. Today he blogs five times a week, mostly about retro computers and retro gaming covering the time period from 1975 to 2000.

2 thoughts on “Writing Tight 6502 Series Machine Code”

Paul
May 24, 2011 at 2:27 pm
Quite a bit out of my element here, but I’m curious. Were you ever familiar with Allwrite, a word processor written for the TRS-80? My understanding was that it was written in machine language by the author and his son. I (we) found it a very good program at the time that took advantage of 128 K of memory.
Paul
Dave FarquharPost author
May 24, 2011 at 7:22 pm
I’m not familiar with that particular program, but it sounds completely plausible. A large percentage of commercial software for the 8-bit computers was written in machine language, often by one or two people. I probably wrote this very article with a word processor called Speedscript, which was written in machine language and published as a magazine type-in. It was about 6K in length. By today’s standards it was little more than a text editor, but it packed a lot of power into 6K. Notepad is 176K, and Speedscript did a handful of things that Notepad doesn’t.
I never got to be good enough to write something as ambitious as a word processor. And unfortunately, unlike a number of the people I was learning this stuff with, I never figured out how to apply any of what I knew about 8-bit machines to programming PCs. I couldn’t wrap my mind around object-oriented programming.

Comments are closed.

The Silicon Underground

Writing Tight 6502 Series Machine Code

Like this:

Related stories by Dave Farquhar

2 thoughts on “Writing Tight 6502 Series Machine Code”