Great project! I’m amazed by the ingenuity of running a CNN on a TI-84 Plus CE.
Given the ez80 CPU's limitations with floating-point operations, do you think fixed-point arithmetic could significantly speed up inference without adding too much complexity?
One funny topic is "why did it take us so long to develop advanced neural networks?" Part of the current advance is about improving hardware performance but I'd imagine if you went back to the 1980s with the knowledge we have you could have gotten much more out of neural nets than we did then.
As for fixed point vs floating I got into an argument with a professor in grad school about it. I'd argue that floating point often gets used out of familiarity and convenience, if you look at fields such as signal processing it is very common to do algorithms like FFT in fixed point, it seems absurd that an activation in a neural net could range from (say) 10^-6 to 10^6 and have the first digit be meaningful the whole way. If everything was properly scaled and normalized you shouldn't need to waste bits on exponents but maybe when you have a complex network with a lot of layers that normalization is easier said than done.
> why did it take us so long to develop advanced neural networks?
I have a pet theory (with some empirical support) that Marvin Minsky put somewhat of a multi-year kibosh on "perceptron"/neural-net research due to his (true, but arguably irrelevant) assertion that trained neural nets were essentially "nondeterministic black boxes" which was basically heretical.
That's not just a pet theory. I'll tell you though that
https://en.wikipedia.org/wiki/Perceptrons_(book)
is a brilliant book from the early days of computer science which is asking questions about the fundamentals of algorithms and what you can and can't do with different methods. (Like the result that it takes at least N log N comparisons to sort N items)
> I'd argue that floating point often gets used out of familiarity and convenience
You're probably right. I can't tell you how often I've seen floats used in places they really shouldn't have been, or where nobody bothered to even write an algorithm that got the dynamic range for _floats_ (much less fixed-points) correct.
> if you look at fields such as signal processing it is very common to do algorithms like FFT in fixed point
FFT is almost uniquely suited to a fixed-point implementation. All the multiplications are with powers of roots of unity, and a typical problem size in that domain only involves O(10k) additions, keeping the result firmly in an appropriately scaled fixed-point input's dynamic range.
> If everything was properly scaled and normalized you shouldn't need to waste bits on exponents but maybe when you have a complex network with a lot of layers that normalization is easier said than done.
It's much harder to accomplish that even in vanilla neural networks (compared to FFT), much less if they have anything "interesting" (softmax, millions of inputs, ...) going on.
Take a look at a single matmul operation, or even just a dot product, with f32 vs i32 weights. Unless you do something in the architecture to prevent large weights from ever being lined up with large inputs (which would be an odd thing to attempt in the first place; the whole point of a dot-product in a lot of these architectures is to measure similarity between two pieces of information, and forcing inputs to be less similar to the weights in that way doesn't sound intuitively helpful), you'll eventually have to compute some product `x y` where both are roughly as large as you've allowed with whatever normalization scheme you've used up to that point. The resulting large activation will be one of the more important pieces of information from that dot product, so clipping the value is not a good option. With i32 inputs, you're left with one of:
1. Cap the inputs to 15 bits of precision (as opposed to 23 with f32)
2. Compute the outputs as i64, perhaps even larger (or still partially restricting input precision) when you factor in the additions and other things you still have to handle
3. Rework the implementation of your dot product to perhaps, somehow, avoid large multiplications (not actually possible in general though; only when the result fits in your target dynamic range)
The first option is just straight-up worse than an f32 implementation. The third isn't general-purpose enough (in that you again start having to design a network amenable to fixed points). The second might work, but to keep the bit count from exploding through the network you immediately require some form of normalization. That's probably fine if you have a layer-norm after every layer, but it still complicates the last layer, and it doesn't work at all if you have any fancy features at that particular matmul precluding layer normalization (like enforcing a left-unitary Jacobian at each composite layer (including any normalization and whatnot)).
You might, reasonably, counter that you don't need more than 15 bits of precision anyway. That's a fair counter (mostly, other than that the argument you're making is that by not wasting space on the exponent you have room for more real data), but all the common <f32 floating-point implementations have a higher precision than the required reduction in precision from an equally-sized fixed-point. Plus, you still need _some_ kind of normalization after the matmul to keep the dynamic range correct, and doing that by tweaking the model architecture precludes some features you might otherwise reasonably want to include, so you need to do something like separately keep track of a multiplicative scaling factor at each layer, which itself screws with how most activation functions behave, ....
Plus, at a hardware level, almost nobody provides operations like fmadd for integers/fixed-point yet. As a collective action problem, you're starting off at least twice as slow if you try to popularize fixed-point networks.
Not to mention, when you use fixed-points you generally start having to build larger computational blocks rather than working in smaller, re-usable units. As a simple example, the order of operations when computing a linear projection almost doesn't matter for floats, but for fixed-points you have all the same precision problems I mentioned for multiplications earlier, with the added anti-benefit of the internal division actively eating your precision. You can't have a component that just computes similarity and then applies that similarity without also doubling the bits of the return type and multiplying the result by something like (1<<32) if the divisor is large and some other hackery if it's small. Since that component is unwieldy to write and use, you start having to code in larger units like computing the entire linear projection.
I've probably gone on more than long enough. Floats are a nice middle ground when you're doing a mix of nearly unconstrained additions and multiplications, and they allow you to focus a bit more on the actual ML. For a fixed-point network to have advantages, I think it'd need to be built specifically to take advantage of fixed-point's benefits, especially since so many aspects of these networks instead explicitly play into its drawbacks.
Great!
Mostly unrelated, but I remember typing in a simple calculator program (hex codes I think) from a magazine for my Apple IIe that would then speak the answer aloud. I was fascinated by the fact you could type in things that then created the sound of a voice. Googling isn't yielding much but I'll dig around and see if I can find it.
Maybe you had a speech card for your IIe, like https://en.wikipedia.org/wiki/Echo_II_(expansion_card)
Not necessarily. There was a lot of experimentation at the time and things that did hacky things like blip the mono speaker at different frequencies to produce interesting sounds were not out of the question.
I remember a demo at my Commodore 64 users group one day...a guy loaded a bunch of stuff from a disk and output (with sound quality not that much worse than a car's crappy AM radio speaker) the first 10 or 15 seconds of "Kung Foo Fighting." Of course, the C64 had the VIC chip, but it was still freaking impressive.
You’re right, this was used to great effect in Castle Wolfenstein - where the guards would shout “Achtung” in German.
If you accessed (either read or write) memory address $c030 it would move the cone of the speaker in or out. Do it fast and you can create tones and voices.
I remember having the 1541 disk drive for my C64 play tunes.
I love these 8-bit projects. They force you to focus on only the most essential elements of your program, there isn’t room for anything else.
On one hand this made everything simpler, on the other hand you needed to have deep knowledge so you could create clever workarounds for things like not having floating point hardware.
Sometimes it feels like a superpower having grown up with that generation of computers, but I also feel it sometimes holds me back - always that little voice in your head saying “it’s wasteful to copy that whole string when you just need to change that one character in the middle” :-)
Interesting project, but there's an obvious issue here - you could just use a small microcontroller with hardware floating point support and get much better performance. The TI-84's ez80 CPU is really not suited for this kind of computation.
I did a similar project on an STM32F4 and got inference times under 100ms. The hardware FPU makes a massive difference compared to software float emulation.
That said, it's cool to see what you can push these old calculators to do. Repurposing the video RAM as a secondary heap was clever.
I first started coding in high school on the TI-84 calculators. My 1st language was TI-BASIC, 2nd was Z80 assembly - quite a big step - and I quit when I faced some tricky bugs that my teenage self could not figure out :-) Back then, I don't believe they had a C/C++ toolchain. Some time later, I tried using Small Device C compiler (SDCC), but encountered several compiler bugs which I couldn't fix but duly reported. Great to see there is such excellent tooling nowadays.
That's wild — focusing on the much narrower problem of digit recognition; I wonder now if hardware using op-amps could be designed to also tackle this smaller problem in the analog domain.
Cool project! And what a suitable domain name you got there Kerm :)
This dude has so many pixels on his calculator. Jealous!!