Hi everyone! I'm an Italian programmer, very passionate about computer architecture. Some time ago I realized that some architectures potentially much superior to the classic VonNeumann, have been forgotten because there were no reliable technical solutions to implement them.
One of these architectures particularly struck me and it is the "Dataflow architecture", in which there is no Program counter, but the instructions are executed in real contemporaneity whenever the operands are available.
In trying to find a technical solution that implements this type of architecture effectively (spoiler: yes, maybe I succeeded but I still have to test the real correct functioning!) I realized that this solution works more than well with three logical states. And at this point I discovered that perhaps it is better to implement a CPU with three states (ternary) instead of a white one. This as a first step to realize my dataflow solution.
Obviously a world has opened up to me regarding ternary CPUs, I have discovered not only that they are the best solution implemented for a computing device, but also that an increasing number of university and researcher papers are dealing with them in recent times.
This is also thanks to the enormous potential compared to normal CPUs; lower circuit complexity (and also lower consumption and lower heat production) but at the same time the truly formidable information representation capabilities compared to binary counterparts!
So here I am creating this ternary CPU and all the hardware and software to be able to use it immediately. In the link you have found some other details, but obviously for questions, suggestions or anything else you can insert your comment here!
Very cool to see that you've achieved a hardware prototype. Looks great.
If you're more interested in exploring your ideas about ternary dataflow computing, is there a reason you didn't start with an FPGA instead? The gap between idea and prototype is much smaller and there's even dataflow DSLs like Tapa if you don't want to write the HDL:
Thanks!
I would've expected some information on the architecture, maybe even a simulator. That should surely already have existed before any working hardware?
Let's try to make sense of that code shown:
r0 appears to be the zero register, "anyi" combines that with an immediate to load a pointer to the string to print into r3. The immediate field is 12 trits, so we have a maximum address range of 531441 (trytes?)... jsr println anyi r3 r0 #1324 jsr println anyi r3 r0 #1412 jsr println hlt
The encoding for the "jsr println" is always the same, so it's using absolute addresses, range is 3^20 or slightly less than 3.5G. Or would be if we were allowing negative addresses, since it appears to be balanced ternary (+--+00 = 144 decimal)
Next bit of code is println, the address matches the one from jsr.
I'm not sure if this is supposed to be a word-addressed machine. Or would be better if it actually was. The addressing unit appears to be 6-trit "trytes", but loads only work when aligned to 4? Which seems a bit (no pun intended) inconvenient for a ternary computer! Is there no "load tryte" instruction?println: ld r2 0(r3) ;get char jeq r2 r0 exit_println ;exit if nul out 50(r0) r2 ;output to console0? out 60(r0) r2 ;output to console1? addi r3 r3 #4 ;next char - why add 4? jmp println ;loop exit_println: anyi r4 r0 #10 ;ASCII LF out 60(r0) r4 ;output to console1? anyi r4 r0 #13 ;ASCII CR out 60(r0) r4 ;output to console1? jr r26 ;return (r26 = link register set by jsr)
Jump instructions are encoded relative to the next instruction, with a 12-trit immediate. Seems fine, but a bit inconsistent with how jsr uses absolute addresses. Also again every instruction is 4 trytes, so we could drop the low bits and shift... ah, I'm forgetting that it's ternary, never mind.
Separate I/O space, seems a bit un-RISCy. I personally happen to like x86, but even there it's considered a legacy feature nowadays.
---
This may sound harsh, but to me the architecture looks both fairly conventional, and at the same time like a bit of a mess, which would be fine for a hobby project, but not for something with a website making grandiose claims and clearly intending to be a commercial product.
That code in the figure is an old code, honestly it's an old screenshot that I don't even remember what it corresponds to, but your analysis is basically correct, even though I used absolute addresses before while now everything is relative to the current address. I'll try to answer a bit of everything:
- yes, R0 is constantly set to 0. - ANYI is a ternary function (ANY) that uses an immediate (ANYI). It does nothing but load the immediate value into the register indicated as the first operand. - Yes, it is a balanced ternary, and clearly negative addresses are correctly interpreted
>out 50(r0) r2 ;output on console0? out 60(r0) r2 ;output on console1?
- Exactly. 50 and 60 are the addresses of the two serial ports on the motherboard.
- R26 is the return register set by JSR
-Addressing is at tryte level and an instruction is always 4 trytes long. You are right that it seems a bit strange, but it is a compromise that I had to accept and that I think was the best solution to implement. (I am happy to note that your doubt was also our strong doubt during the design phase, it means that we are considering the same problems...)
-There are the dimensions of the data to operate on as a suffix of the instructions (like on motorola 68000) for which there is an LD.T (load tryte), LD.S (load short) and LD.W (load a whole word of 24 tries). There are also instructions that load/save by addressing the memory as if it were a stack for example PUSH.W (R20),R4
>Even again each instruction is 4 tries, so we could eliminate the low bits and >move... oh, I forgot that it is ternary, it does not matter. You think I didn't think about it? ;)
- The IO space is separate, in the end it cost me almost nothing and I did it.
The RISC features that I have kept are essentially 2: - Constant instruction length - Memory access only with Load/Store
In fact among the RISC things that I could remove in a future version are precisely the single access to memory only with LD/ST, it is true that it is a main feature of RISC, but it is also true that it could be a problem for code density. But it is something that I will see on the 48-trit architecture
>This may sound harsh, but to me the architecture looks both fairly conventional, and at the same time like a bit of a mess, which would be fine for a hobby project, but not for something with a website making grandiose claims and clearly intending to be a commercial product.
-That it is "conventional" ok, (even if none of the conventional architectures are ternary, but whatever...) that it is messy, what do you mean? The website says what the project actually is. I'm glad they are grandiose things, but that's it. And yes, we want to become a commercial product soon. If you have suggestions or other specific questions, I'm here to answer you ;)
If there's a LD.T instruction, it would've been the obvious choice to use that to load a character, instead of wasting an entire word on each. But maybe it didn't exist yet when that code was written.
The console I/O also seems to be dependent on external hardware to delay until the last character has been actually sent, or maybe your prototype was clocked slow enough that this wasn't necessary?
So, if the external data bus is always 24 trits, however addressing is in units of 6 trits, how does the hardware (CPU->memory interface) handle that? Is the address bus using binary instead of ternary?
What I meant by "conventional" is that other than using base 3, it isn't that different from other computer architectures, certainly not the idea you describe in the OP (that must have come later, since it isn't mentioned at all on your website). Some ternary operations might be useful for neural nets, but that wouldn't be an application for a general-purpose processor like this.
And some of the "messiness" might have been fixed, but again, there is absolutely no specification of what this architecture is supposed to be, so I was going from the one snippet of code that you have made public. And maybe you're still making changes, or do you intend to market a commercial microprocessor with absolutely no published specification?
Yes, of course to load a character I use LD.T
Uhmm..actually the OUT instruction ends immediately after its normal cycle. the character is sent to the mainboard that reads it and puts it in a buffer and passes from a buffer. (it's more complex, if you want we can discuss)
The external BUS is 24 trit. the motherboard converts it by encoding it into binary and the data is stored on 48bit RAM. We don't have native ternary RAM yet unfortunately, but it was the only way to have some memory and use the processor.
I don't quite understand what idea I describe in the OP you're referring to. And I don't understand what ternary operations might be useful for neural networks (we've had contact with Korean researchers asking about this processor for neural networks).
The specifications obviously exist, but I don't think it's a good idea to put them here on a generic site! I think the marketing phase will take a while longer and as you may have read on the site, we're still fixing things here and there and we need a marketing apparatus.
The comment/question about OUT was more speculation on how it worked on the prototype. But maybe this buffer is part of the architecture, in which case you'd have to consider what happens if it get completely filled (could happen easily with just a few MHz CPU, at standard serial port speeds).
Re. memory: of course with no ternary RAM, you have to somehow encode the trits into bits. But I was more curious about the address bus. So if your memory is organized like this:
how does that map to binary addresses for the RAM chip(s)? What if you had ternary RAM, would that change how you address it? Does address --0 map to the second lowest/highest (depending on endianness) tryte in word #0, or something entirely different?--- word #0 --0 --+ -0- -00 word #1 -0+ -+- -+0 -++ word #2
Maybe you thought about this a lot more than I have and figured out an ingenious solution. But I would have either gone with word addresses, or power-of-3 units.
And how all of this ties into non-Von Neumann architectures, I still have no clue.
Yes, the serial buffer is included in the motherboard. The problem is not so much for OUT but for IN, if the buffer fills up.
As for the bus addresses, the counter program starts from the lowest address. The mainboard obviously knows how much RAM is present and performs a translation to the first available low address. Nothing transcendental.
I did not understand the connection with VonNeumann architectures, this is a classic vonneumann architecture, only that the information is in base 3...
Anyway you seem to be really very expert, certainly more than me (I am mainly a programmer and I had to learn these things at low level practically by myself), you could actively help the project...
In your OP, you wrote that this ternary computer is really a "first step" towards a completely different dataflow-based architecture. The main page of your website proclaims some sort of revolutionary "Third Millenium Computing", with applications to AI and "research on algorithms", whatever that is supposed to mean.
>The mainboard obviously knows how much RAM is present and performs a translation to the first available low address.
How exactly is this done in hardware? I can't figure it out, so you must be the expert on that. Unless it's like a separate microcontroller doing div/mod in a loop to convert between the bases for every memory access, it couldn't be that, right? Right?
Ok, maybe that's not clear, probably my not-so-perfect English is a problem. The dataflow architecture was the initial idea. What I'm talking about now is the ternary processor, a normal VonNeumann processor with ternary data and ternary arithmetic. This may eventually form the basis for the dataflow processor, but right now I'm talking about and building the ternary processor (Of course dataflow architectures are quite different and unconventional, but that's in the future, I'm not talking about it now).
As for address management, as I said the mainboard does it all, but I didn't care to go that low in detail, it's all a simple VHDL function in an FPGA. It already comes to the FPGA in "ternary encoded bunary" from external circuits.
While not quite as terrible as my first idea, I suspect this simple function expands into much more logic than is reasonable, both in terms of size and propagation delay - the rest of the system is just small and low speed enough that you don't notice it.
The way I see it, cool hobby project, except you've already created a website promising next generation AI supercomputer chips, and then basically admit that you don't even know what goes on at the level of logic gates. And seem to avoid giving any technical details at all.
Designing a high-performance CPU is difficult enough to do in conventional binary logic, and is generally done by teams of people who know much more than you or I about all sorts of details on how to pipeline instruction execution efficiently, with branch prediction and speculative execution etc., and also the constraints imposed by manufacturing processes and physics itself.
You can't just assume someone can magically turn your ideas into such a CPU. And if they could, they could probably do it without you and whatever intellectual property you seem to be wanting to keep secret. Also, ternary being considered more efficient in some mathematical way doesn't necessarily mean an actual hardware implementation will be similarly efficient.
ok, now i can say for sure, you didn't read the site well at all. you didn't even read my answers well. maybe you would have understood that now mosfet level design is already happening but it only concerns the CPU (if you want to deal with the mainboard, fine, you will have read that it is open hardware). I don't understand what you need such low level technical details for, it is obvious that i am not here to share them. If you want to make a CPU do it too, in the end we will see who did it better. now excuse me I have some serious work to do.
Maybe it's a problem with my reading comprehension (or my web browser), but there seems to be no information about this open hardware - or really anything technical, other than the CPUs bus width and register count -, on your site or forum?
Super cool.
With the recent paper that apparently shows that ternary data used for AI can be faster and more energy efficient, I am intrigued that someone is experimenting with ternary in hardware.
I am a layman — can a conventional computer architecture be taken and ternary data and instructions grafted on to it? Or does the whole architecture need to be rebuilt from the ground up?
I guess I don't see, for example, any advantage to ternary addressing, only ternary data and operations.
Thanks! Ternary data is definitely more efficient than binary, regardless of your use, not just for AI. I honestly don't know if it's faster, can you give me information on where you read that?
I imagine it's possible even if quite difficult to merge the two architectures, but maybe it's really easier to start everything from scratch as far as hardware is concerned.
What do you mean by ternary addressing? The address bus is also in ternary so even in this case fewer wires and more addresses available!
Probably this one: https://arxiv.org/abs/2407.12327
But I suppose searching on "ternary LLM" would find others.
Wondering how negative voltage is routed around in hardware. Does it complicate things more or less than the increased data/addressing efficiency?
I think it is a positive voltage which is interpreted as a negative number.
This is an interesting project. Two questions: one, are you selling something? Can I buy a dev board? Can I look at your operating system? Or is this all closed source? If it's closed that's OK just wondering.
Second question, what do the logic gates look like in a ternary system? Is there a list of them somewhere?
Thanks! We sold about 30 CPUs of the old version, then we had to stop because the main component was not available for years, right after the pandemic. Now we use an alternative component that is also less expensive and we have added several new features. The current development board is not fully functional, we are working on ETH, RTC and also fixing a few other things. We expect it to be fully functional and available by April 2025. The OS is actually just very simple; it boots an image from SD, sets up the interrupt table, initializes a list of free memory and that's it for now. This part will be OpenSource so that we can take inspiration to make something better. The development board (mainboard) will also be open hardware and should be listed on GitHub. The macro assembler is freeware, but only because (currently) the code is really a bit messy!
For ternary logic functions you can refer to this site that inspired us: https://homepage.cs.uiowa.edu/~jones/ternary/
Hey! You can run native Malbolge programs: https://en.wikipedia.org/wiki/Malbolge !
How did you manufacture your integrated circuits? I am very curious to learn about that process since it seems like it must be quite custom, or is there a neat way to reuse existing circuits?
We do not currently have any integrated circuits, all the circuits (and PCBs) are entirely designed by me. What I said is that we are using a manufacturing process to do the VLSI layout to integrate everything into a single component, but currently it is not.
So, forgive me if I am wrong, but from my primitive YouTube "research", people said that the beauty of binary is that its relatively easy to see weather or not there is voltage, even with fluctuations of power. How does your processor compensate for that, meaning how does it measure out 3 different values. I could be wrong, and am willing to learn. But I enjoy your idea, and I am willing to help you especially when it comes to software.
Just as simple: it's easy to have positive voltage, negative voltage or no voltage ;)
Thanks for the help, maybe we need someone to write the basic software right now!
Oh I see how it would work thank you. And ofc I am willing to work, just need to know the ISA and how you want it, maybe even a primitive emulator.
Do you use the idea of the previous post to store 5 ternary digits in one eight bit byte ?
Discussed a few days ago here: https://news.ycombinator.com/item?id=42329307
Didn't the Russians develop a ternary computer system during the cold war?
Setun (1958) and Setun-70 (1970)
Dataflow processors have not been forgotten - there is one inside every high end CPU core. The code in main memory in the form of conventional x86 or ARM machine language gets translated first into "micro ops" in the decoder and then into dataflow machine language in the register rename + reservation stations + completion buffers (collectively implementing "out of order" execution).
But, as the old saying goes, "out of sight is out of mind".
Oh, I don't think at all that this method of operation (Tomasulo?) is remotely close to a real dataflow in terms of efficiency and complexity. But as I wrote, dataflow is another story ;)
I'm curious what the assembly looks like. Is it just normal binary Von Neumann running on your ternary processor, or is it something very different?
Either way, great work. This is very fascinating.
Thanks! Of course it can't be binary, but it reads purely ternary electrical signals. You can see an example of the assembly in one of the figures on the web page (even if it's an old listing).
Expect all the normal operations a binary computer/processor does (load, store, jumps, subroutines, etc.) with the addition of ternary logic functions. It's clear that the whole thing is encoded in balanced ternary :)
Beautiful! Is the concept of balanced ternary semiconductor gates elaborated somewhere?
I have updated the site with a references section. Here you will find some links to recent studies on native ternary semiconductors.
This is something I'm convinced we need for integer weights in LLMs.
Interesting! I am not familiar with how integer weights work for LLMs, can you give me more details?
This is what is being referenced: https://arxiv.org/html/2410.16144v2
Thanks!
The use case for Ternary logic I’m familiar with for is TCAMs in networking gear.
https://en.wikipedia.org/wiki/Content-addressable_memory#Ter...