This seems like an inevitability. That is, eventually, "AI" will be used to create adaptive interfaces for whatever the HCI user wants: graphical, immersive, textual, voice, and so on.
I'm spoiled by the "search the menu" thing in vscode, intellij and devtools. I wanted that in Photopea today because I couldn't find the darn paint bucket tool. If everything had built in AI (which seems to be happening whether it's good or bad) I could probably just say or type "paint bucket" which I guess is only marginally better than a fuzzy search.. but the iterative process of AIs where you can kind of lazily ask it for things and then correct it when it's wrong sure is nice.
MacOS has this by default on every app because it indexes all menu actions and makes them accessible via `cmd + shift + ?`.
Oh that's cool. I'm not a Mac user so I didn't know. Seems like a good thing to do at the OS level when possible!
Author seems to not engage with a core problem: humans rely on muscle memory and familiar patterns. Dynamic interfaces that change every session would force constant relearning. That's death by a thousand micro-learning curves, no matter how "optimal" each generated UI might be.
The solution is user interfaces that are stable, but infinitely customizable by the user for their personal needs and preferences, rather than being fixed until a developer updates it.
And make configs shareable across users.
If they're complicated guis, sure. A handful of options is probably fine. And I don't think we'd take away the ability to just type or shout at the AI if it doesn't provide the option you're looking for. "Give me a slider for font size", "give me more shades of red", "ok, let's focus on the logo. Give me some options for that"
>Acting like a computer means producing a graphical interface. In place of the charmingly teletype linear stream of text provided by ChatGPT, a model-as-computer system will generate something which resembles the interface of a modern application: buttons, sliders, tabs, images, plots, and all the rest. This addresses key limitations of the standard model-as-person chat interface:
Oh boy I can't wait for GPT Electron, so I can wait 60 seconds for the reply to come back and then another 60 seconds for it to render a sad face because I hit some guard rail.
Not forgetting the computing power required to generate that single sad face.
I appreciated the thought given in this piece. However in the age of LLMs these types of "what if we look at problems this way..." seem obsolete. Instead of asking the question, just use an LLM to help you build the proof of concept and see if it works.
Back in the pre-LLM days these types of thought pieces made sense as a call to action because the economics of creating sophisticated proof's of concept was beyond the abilities of any one person. Now you can create implementations and iterate at nearly the speed of thought. Instead of telling people about your idea, show people your idea.
But LLMs are nowhere near being able to do what you suggest, for anything that one person wouldn't have been able to do beforehand.
If I cared enough about guis I could implement what the op said in two months by myself with unlimited access to a good coding model, something like qwq.
The issue is training a multi modal model that can make use of said gui.
I don't believe that there is a better general interface than text however so I won't bother.
They absolutely are. I'm somewhat non-technical but I've been using Claude to hack MVPs together for months now.
No amount of repeating is ever, unfortunately, going to get this across; LLMs are founding a new kind of alchemy.
Not in my experience. Used properly an LLM is an immense accelerator. Every time this comes up on HN we get the same debate. There is the side that say LLMs are time wasting toys, and the other which says they are transformative. You need to know how critically ask questions and critique answers to use a search engine effectively. The same is true for an LLM. Once you learn how to pose your questions and the correct level to ask your questions it is a massive accelerator.
If we are never going to take the time to write, articulate, or even think about things anymore, how can we still feel like we have the authority or skills or even context to evaluate what we generate?
I'm kind of with you in that you could build something kind of like it based on a fast LLM. But what they are actually talking about is a new cutting edge ML model that takes a huge amount of data and compute to train.
I see your point, but that's not what I took away from the article. To me it seems like an alternate way to use existing models. In any case I think you could make a PoC that touched on main idea using an existing model.
Yes you can and there is at least one example, a web application where you enter a URL and the LLM automatically generates the page including links, you click a link, the LLM fills it in on the fly. I can't remember the name of it.
But they mention things like Oasis in the article that use a specialized model to generate games frame-by-frame.
> because the economics of creating sophisticated proof's of concept was beyond the abilities of any one person
What?
Are you trying to say it’s too expensive for a single worker to make a POC, or that one person can’t make a POC?
Either way that’s not true at all…
There have been one person software shops for a long long time.
Why stop there? Let it figure out how to please us without need for sliders etc. We'll just relax. Now that's paradigm shift.
That was my thought, is model as a computer the best we can do?
Isn’t that limiting our perspective of AIs models to being computers and what computers can’t do, so the model can’t do.
> That was my thought, is model as a computer the best we can do?
Nah, there's a better option. Instead of a computer, we could... go for treating it as a person.
Yes, that's inverting the whole point of the article/discussion here, but think about it: the main limitation of a computer is that we have to tell it step-by-step what to do, because it can't figure out what we mean. Well, LLMs can.
Textual chat interface is annoying, particularly the way it works now, but I'd say the models are fundamentally right where they need to be - it's just that a human person doesn't use a single thin pipe of a text chat to communicate with the world; they may converse with others explicitly, but that's augmented by orders of magnitude more of contextual inputs - sights, sounds, smells, feelings, memory, all combining into higher-level memories and observations.
This is what could be the better alternative to "LLM as computer": double down on tools and automatic context management, so the user inputs are merely the small fraction of data that's provided explicitly; everything else, the model should watch on its own. Then it might just be able to reliably Do What I Mean.
So what are the other models we could use ?
Perhaps metaphor would be better terminology than models.
AI as an animal we are trying to tame. Why does it have to be a machine metaphor?
Perhaps AI is a ecosystem upon which we all interact at the same time. The author pointed out that the one-on-one interaction is too slow for the AI - perhaps a many-to-one metaphor would be more appropriate.
I agree with the author that we are using the wrong metaphors when interacting with AI but personally I think we should go beyond repeating the mistakes of the past by just extending our current state, I.e. going from a physical desktop to a virtual „desktop“.
How about powerpoint as a metaphor? The challenge we face is how to explain something complex. But also do we not get into the issue that the Medium is the Message? That just by using voice rather than an image do we not change the meaning. And is that necessarily bad?
> And is that necessarily bad?
Selecting a metaphor implies that ones imagination is - at least partially - constrained by the metaphor. AI as a powerpoint would make using AI for anything other than presentations seem unusual since that what powerpoint is used for.
Also when the original author "models as computers" what does "computer" represent? A mainframe computer the size of small apartment, a smartphone, a laptop, turing machine or some collection of server racks. Even the term "computer" is broad enough to include many forms of interaction. I interact with my smartphone visually while with my server rack textually, yet both are computers.
At least initially, AI seems to be something completely different, almost god-like in its ability to provide us with insightful answers and creative suggestions. God-like meaning that judged from the outside, AI has the ability to provide comforting support in times of need, which is one characteristic of a god-like entity.
Powerpoint wasn't built to be a god-like provider of answers to the most important questions. It would indeed be a surprising if a PP presentation made the same impact as religious scriptures - to thousands/millions of people, not referring to individual experiences.
The only computations that an LLM does are backprops and forward passes. It can not run any arbitrary program description. Yes, it will hallucinate your program's output if you feed it some good enough starting prompt. But that's it.
An LLM with chain of thought and unbounded compute/context can run any program in PTIME: https://arxiv.org/abs/2310.07923 , which is a huge class of programs.
Note that this is an expressibility (upper) bound on transformers granted intermediate decoding steps. It says nothing about their learnability, and modern LLMs are not near that level of expressive capacity.
The authors also introduce projected pre-norm and layer-norm hash to facilitate their proofs, another sense in which it is an upper-bound on the current approach to AI, since these concepts are not standard. Nonetheless, the paper shows how allowing a number of intermediate decoding steps polynomial in input size is already enough to run most programs of interest (which are in P).
There are additional issues. This work relies on the concept of saturated attention, however as context length grows in real world transformers, self-attention deviates from this model as it becomes noisier, with unimportant indices getting undue focus (IIUC, due to precision issues and how softmax assigns non-zero probability to every token). Finally, it's worth noting that the more under-specified your problem is, and the more complex the problem representation is, then the quickly more intractable the induced probabilistic inference problem. Unless you're explicitly (and wastefully) programming a simulated turing machine through the LLM, this will be far from real-time interactive. Users should expect a prolog like experience of spending most of their time working out how to help search.
Trivia: Softmax also introduces another problem: the way softmax is applied forces attention to always assign importance to some tokens, often leading to dumping of focus on typically semantically unimportant tokens like whitespace. This can lead to an overemphasis on unimportant tokens, possibly inducing spurious correlations on whitespace, this propagating through the network with possibly unexpected negative downstream effects around whitespace.
> "An LLM with unbounded compute/context"
This isn't a thing we have, or will have.
It's like saying that a computer with infinite memory, CPU and power can certainly break SHA-256 and bring the world's economy down with it.
No, it's like saying "a computer can't crack SHA hashes, it can only add and subtract numbers together" "a computer can crack any SHA hash" "yes, given infinite time".
The fact that you need infinite time for some of the stuff doesn't mean you can't do any of the stuff.
I mean it doesn't need to compute all programs in a human length reasonable amount of time.
It just needs to be able to compute enough programs to be useful.
Even our current infrastructure of precisely defined programs and compilers isn't able to compute all programs.
It seems reasonable in the future be able to give an LLM the python language specification, a python program, and it iteratively returns the answer.
If it's executing a program, then the easiest way to make it more efficient, is to ditch the LLM and just execute the program. The LLM in this case is basically only (very very very very very inefficiently) approximating the very CPU it's running on. Just use the CPU to execute the program! And you won't be running it on an approximated processor, you'll be running it on a deterministic, reliable one that will not give the wrong answer ever (given the right program and correct input, of course, and assuming no hardware failures, which would affect LLMs too)
They're a neural network like any other. They are universal function approximators. Doesn't mean that approximating a function that executes a particular program doesn't need an intractably large neural network.
Woah super interesting, I didn't know about this. Will def read it! Seems like I was wrong?
genuinely what's the point of this comment? are you allergic to cool stuff? honestly curious as to what you were trying to achieve by this comment.
nowhere in this post does the author say that it's ready with the current state of models, or he'd use a foundation model for this. why the hate?
> communicating complex ideas in conversation is hard and lossy
True but..
> instead of building the website, the model would generate an interface for you to build it, where every user input to that interface queries the large model under the hood
This to me seems wildly _more_ lossy though, because it is by its nature immediately constraining. Whereas conversation at least has the possibility of expansiveness and lateral step-taking. I feel like mediating via an interface might become too narrow too quickly maybe?
For me, conversation, although linear and lossy, melds well with how our brain works. I just wish the conversational UXs we had access to were less rubbish, less linear. E.g. I'd love Claude or any of the major AI chat interfaces to have a 'forking' capability so I can go back to a certain point in time in the chat and fork off a new rabbit hole of context.
> nobody would want an email app that occasionally sends emails to your ex and lies about your inbox. But gradually the models will get better.
I think this is a huge impasse tho. And we can never make models 'better' in this regard. What needs to get 'better' - somehow - is how to mediate between models and their levers into the computer (what they have permission to do). It's a bad idea to even have a highly 'aligned' LLM send emails on our behalf without having us in the loop. The surface area for problems is just too great.
Yea forked conversations UX is definitely one of my most desired features.
>I'd love Claude or any of the major AI chat interfaces to have a 'forking' capability so I can go back to a certain point in time in the chat and fork off a new rabbit hole of context.
ChatGPT has this feature: forking occurs by editing an old message. It will retain the entire history, which can still be navigated and interacted with. The UX isn’t perfect, but it gets the job done.
I think that Cerebras and Groq would be fun to experiment with using normal LLMs for generating interfaces on the fly, since they are so fast.
what's the cost difference between groq/cerebras vs using something else for inferencing open source models? I'm guessing the speed comes at a cost?
0.6/1$ per M tokens in groq/cerebras vs 0.3$ per M tokens in deepinfra (for llama 3.3 70b)
But note the free tiers for groq and cerebras are very generous.
I don't know off the top of my head, only played with it a little not seriously.
fair enough
We're going to have to move past considering an LLM to just be a model.
It's a database. The WYSIWYG example would require different object types to have different UI components. So if you change what a container represents in the UI, all its children should be recomputed.
Need direct association between labels in the model space and labels in the UI space.
Databases don't hallucinate.
Correct, they just don't return anything. Which is the right behavior sometimes and the wrong behavior others.
> But gradually the models will get better. Even as they push further into the space of brand new experiences, they will slowly become reliable enough to use for real work.
This is an article of faith. Posts like this almost always boil down to one or two sentences like this, on which the entire rest of the post rests. It's postulated as a concrete fact, but it's really not.
We don't know how much better models will get, we don't know that they will get good enough to accomplish the tasks the author is talking about, hell, we don't even know if we'll ever see another appreciable increase in model quality ever again, or if we've already hit a local maximum. They MIGHT, by they also might not, we don't know.
This post would be a lot more honest if it started with "Hey, wouldn't it be neat if..."
I don't think models even need to get much, if any better to accomplish this.
After every response, ask the model for some alternatives, things that could be adjusted, etc. have it return a json response. Build the UI from that. Clicking on an element just generates a prompt and sends it back into the model
It's like smart replies or the word suggestions that are popping up on my virtual keyboard now but with a richer UI. It's not perfect, but I think it would be an improvement for many things.
On a related line of enquiry, both gemini-2-flash and sonnet-3.5-original can act like computers, interpreting and responding to instructions written in code. These two models are the only ones to do it reliably.
Here's a thread https://x.com/xundecidability/status/1867044846839431614
And example function for Gemini written in shell, where the system prompt is the function definition that interacts with the model. https://github.com/irthomasthomas/shelllm.sh/blob/main/shelp...
Very interesting paradigm shift.
Tangentially, I have considered the possible impact of thermodynamic computing in its application to machine learning models.
If (big if) we can get thermodynamic compute wells to work at room temperature or cheap microcryogenics, it’s foreseeable that we could have flash-scale AI accelerators (thermodynamic wells could be very simple in principle, like a flash cell)
That could give us the capability to run Tera-parameter models on drive-size devices using 5-50 watts of power. In such a case, it is foreseeable that it might become more efficient and economical to simulate deterministic computing devices when they are required for standard computing tasks.
My knee jerk reaction is “probably not” but still , it’s a foreseeable possibility.
Hard to say what the ramifications of that might be.
This is the wrong direction, it is retrograde trying to shoehorn NATURAL LANGUAGE UNDERSTANDING into existing GUI metaphors.
Instead of showing a “discoverable” palette of buttons and widgets which is limited by screen space just ASK the model what it can do and make sure it can answer. People obviously don’t know to do that yet so a simple on screen prompt to the user will be necessary.
Yes we should have access to “sliders” and other controls for fine tuning the output or maintaining a desired setting throughout generations but those are secondary to the ability of the models to make sweeping and cohesive changes and provide many alternatives for the user to CHOOSE from before they get to the stage of making fine grained adjustments.
Amateur question: is there a point possible where a llm is using less compute power to calculate a certain formula compared to regular computation?
When the LLM knows how to simplify/solve the formula and the person using it doesn't, it could be much more efficient than directly running the brute-force/inefficient version provided by the user. A simple example would be summing all numbers from 0 to a billion; if you ask o1 to do this, it uses the O(1) analytical solution, rather than the naive brute-force O(n) approach.
Though even in this case, it is enormously more efficient to simply sum the first billion integers rather than find an analytic solution via a 405b parameter LLM...
Yes an llm could do it since it can predict the next token for pretty much anything. But what's the error margin you are ready to tolerate?
i think lots of apps are going to go in the adaptive/generate ui direction - even if it starts a lot simpler than generating the code
Perhaps an UI passed on a Salvador Dali painting - perhaps we should also be questioning our UI concepts of slider, button, window and co.
I have always held the opinion that the so-called LUI is a fad. We humans are terrible at communicating, we want crutches / guard rails to help us getting the message through. That is to say I do agree with the author to an large extent.
I think the proof that this is a good article is that people's reactions to it are taking them in so many different directions. It might or might not be very actionable this year (I for one... would like to see a lower level of hallucination and mansplaining in LLM output before it starts to hide itself behind a dynamically generated UI) but it seems, for sure, good to think with.
Sometimes it feels like we've taught computers to do long division by simulated hand just a billion times less efficiently.
Gotten to a point where i have a visceral reaction to any intersection of AI and Psychological thought. As a human it dependably makes me feel sick. We're going to see a lot of changes that are good, and not so good.
I wish we had some kind of Central Processing Unit to do this instead of relying on hallucinating remote servers that need a subscription.
Can it run DOOM yet?
“Wherever they use AI as a tool they will, in the end, do the same with human beings.”
Why is that in quote marks? I couldn't find any matches in TFA nor elsewhere.
And as to the sentence itself, I'm unclear on what exactly it's saying; people have been using other people at tools from before recorded history. Leaving aside slavery, what is it that you would say that HR departments and capitalism in general do?