The source code can be found on GitHub.
In my previous post on neural networks we went through the basic concepts and simple mathematics involved in training and running a neural network. It’s so simple that it gave me an idea: could I run this on an 8-bit micro. Namely a BBC Micro.
I don’t think its really foreshadowing to say that yes, you can run a neural network on a BBC Micro, and below behold:
BBC Micro emulator powered by jsbeeb
Unsurprisingly its rather slow, and this is the “fast” 6502 version running at 4x speed - so imagine the BBC Basic version running at 1x speed! If you refresh the page the network is seeded with random weights and biases so you can watch different convergences.
So, how does it work? In exactly the same way.
The mathematics and setup are literally what we saw before just converted into 6502. I actually did this in two stages. Stage 1 was converting it to run in BBC Basic - I didn’t really worry about performance but it really was torturously slow and so I did the obvious optimisation: used a lookup table for the sigmoid function. That made it faster but I knew going in I’d have to use 6502.
And so stage 2 was to take this and port it over to 6502. If you want to take a look at the source then the key routines are:
| Routine | Description |
|---|---|
neg16 |
Two’s complement negation of a 16-bit value at a zero-page address. EORs both bytes with 0xFF and adds 1. |
mulq |
Q4.11 fixed-point multiply. Extracts the result sign, makes both operands positive, runs a 16×16→32 shift-and-add loop, then shifts right by 11 to re-normalise - a few more details are below. |
sigmd |
Sigmoid activation via 256-entry LUT. Clamps out-of-range inputs, otherwise offsets and indexes into a pre-built table of Q4.11 sigmoid values. |
fwdps |
Forward pass. Computes weighted sums plus biases for all four hidden neurons through sigmoid, then the output neuron the same way. |
bwdps |
Backpropagation. Computes output delta via sigmoid derivative, propagates error back through output weights to get hidden deltas, then updates all weights and biases using the learning rate. |
train |
Training loop over all four XOR samples. Loads each input/target pair, calls fwdps then bwdps, and stores the output predictions. |
render |
Flicker-free connection redraw. Waits for VSYNC, erases all lines in black, then redraws each connection colour-coded by weight magnitude (green/red/white). |
plotln |
Draws a single line between two screen coordinates read from the connection table via indirect indexed addressing. |
setgcl |
Sets the graphics foreground colour by writing VDU 18,0,n via OSWRCH. |
advcp |
Advances the connection table pointer by 10 bytes (one entry). |
rstcp |
Resets the connection table pointer back to the start of the table. |
When you do something like this you immediately you have to consider a key question - how are you going to do floating point maths in 8-bit registers. Fixed point maths was common back in the day and you have to decide how many bits to use for the integer portion and how many bits for the fractional portion. I went with a Q4.11 scheme: 4 integer bits, 11 fractional bits and 1 sign bit. I picked this just from my observations of the neural net running in TypeScript. It felt like the right balance of integer range and precision.
Addition and subtraction just work, we don’t really need to do anything. Multiplication is a bit more complex. If you multiply two Q4.11 numbers together you get a Q8.22 result. You need to shift it right by 11 bits to get back to Q4.11.
The mulq routine handles this in stages. First it stashes the result sign (XOR of the two sign bits) and makes both operands positive — the 6502 has no signed multiply, so it’s easier to work in magnitudes and reapply the sign at the end. Then it does a standard 16×16→32-bit shift-and-add loop: 16 iterations, shifting the multiplier right one bit each time, conditionally adding the multiplicand into the upper half of the result.
The fixed-point correction is the clever bit. The 32-bit result sits in mr[0..3] (little-endian). The code takes bytes 1 and 2 (skipping byte 0), which is an implicit 8-bit right shift. Then three explicit LSR / ROR cascades shift right by another 3 bits. That gives 8 + 3 = 11 total, exactly the Q4.11 fractional width. The result lands in t0 ready for use.
I can’t take the credit for any of that cleverness - these were techniques established by the incredible coders of the day.
The nice thing about the code is that it makes use of one of the great features about the BBC - you could mix and match BBC Basic and 6502. It really was very cool - you could gradually convert a program over to 6502 a piece at a time and, if you want, leave the bits best suited to BASIC in BASIC and the bits that needed to run fast in 6502. It really was an amazing machine for coding on. In this example the Sigmoid lookup table is still built in BASIC - we’re only running it once at the start, no need to break our brain converting this into 6502.
I have to say Claude Code was invaluable in getting this to work in 6502 without spending crazy amounts of time on it. It wasn’t perfect but by breaking the problem down into parts it was pretty fast to get going and helped me prove that yes, you can run a neural network on an 8-bit micro.
The emulation is provided by the amazing jsBeeb.