OK, AI playing video games is cool. But you know what's really really cool? It looks like SIMA 2 is controlling the mouse and reading the screen at something approaching 30+fps. WANT. Computer use agents are so slow right now, this is really something. I wonder what the architecture is for this.
I desperately want an AI agent that can use my phone for me. Just something that takes instructs for each screen and execute it.
"Open Chrome"
"Go to xyz.com"
"open hamburger menu"
"Click login"
etc. etc.
Isn't that what the voice a11y tools have been doing for years. Why do you need AI for that.
https://support.google.com/accessibility/android/answer/6151...
Droidrun did a Show HN recently. It's exactly that.
Its even cooler if humans find something to be excited about in this world, since AI is replacing everything we do.
The gap between high level and low level control of robots is closing. Right now thousands of hours of task specific training data is being collected and trained on to create models that can control robots to execute specific tasks in specific contexts. This essentially turns the operation of a robot into a kind of video game, where inputs are only needed a in low-dimensional abstract form, such as "empty the dishwasher" or "repeat what I do" or "put your finger in the loop and pull the string". This will be combined with high-level control agents like SIMA 2 to create useful real-world robots.
I work on a much easier problem (physics-based character animation) after spending a few years in motion planning, and I haven’t really seen anything to suggest that the problem is going to be solved any time soon by collecting more data.
Why? Physics of large discrete objects (such as a robot) isn't very complicated.
I thought it's fast accurate OCR that's holding everything back.
The problem becomes complicated once the large discrete objects are not actuated. Even worse if the large discrete objects are not consistently observable because of occlusions or other sensor limitations. And almost impossible if the large discrete objects are actuated by other agents with potentially adversarial goals.
Self driving cars, an application in which physics is simple and arguably two dimensional, have taken more than a decade to get to a deployable solution.
I hope we can get some (ideally local) version of this we can use as a "gaming minion". There's a lot of games where I probably would have played more if I could delegate the grind. If they're not that competent, it adds to the fun a little even.
I've always wanted an AI that can play my video games for me, so that I can spend my time doing more fun and fulfilling things, like cleaning the toilet, folding my laundry, washing my dishes, taking out the garbage. Now I will no longer have to worry about the annoying chores in life, like drawing art, writing poetry, or playing video games
This is what the wow bots were. They had a crazy level of agency even without AI.
sorry this is kind of nuts to me. You want something to play video games for you because the video game isn't fun? Just play a game that is fun. The point of the game is to play it
It could be fun in a factorio sense. Maybe the whole game becomes to delegate a bunch of smart robots and handle organization etc.
I mean, that's literally a RTS?
One thing I do with games is automate the grind. To me, that is part of the fun. I have built lego robots to press a sequence of buttons repeatedly, or programmed microcontrollers using circuitpython to press a series of keys or click the mouse at given intervals to grind various in-game currency and such. It's so common for me to do these kinds of things that I now instinctively look for places in gameplay that I can automate. I haven't done anything as complicated as using computer vision to look at the screen and respond to it, but I did see that Anthony Sottile did this to catch shiny pokemon https://youtu.be/-0GIY5Ixgkk and doing something like this has been out there on my horizon.
I would love Minecraft with more intelligent villagers I could boss around to mine and build for me.
You should look into modding. There have got to be a ton of automation and NPC scripting mods out there without any sort of AI model necessary.
So factorio?
No.
Agree. It would be cool to populate my Valheim server with a bunch of agents that are in competition.
>We’ve observed that, throughout the course of training, SIMA 2 agents can perform increasingly complex and new tasks, bootstrapped by trial-and-error and Gemini-based feedback.
>In subsequent training, SIMA 2’s own experience data can then be used to train the next, even more capable version of the agent. We were even able to leverage SIMA 2’s capacity for self-improvement in newly created Genie environments – a major milestone toward training general agents across diverse, generated worlds.
Pretty neat, I wonder how that works with Gemini, I suppose SIMA is a model (agent?) that runs on top of it?
That’s what it sounded like to me, a plain text interface between two distinct systems.
That’s what Claude Plays Pokémon is.
It's like the factorio moment where you unlock the roboport. No more manual changes to the world, drone swarms to build housing, roads, bridges, parks etc. so exciting.
I get why they do it, they are a business. I just wish Google would get off their ivory tower and build in the open more like they used to (did they? maybe I'm misremembering...).
They've acquired this bad habit of keeping all their scientific experiments closed by default and just publishing press releases. I wish it was open-source by default and closed just when there's a good reason.
Don't get me wrong, I suppose this is more of a compliment. I really like what they are doing and I wish we could all participate in these advances.
Same! I want to play with this so bad!
Dreamer v3 was open, v4 coming soon?
This is obviously just a research project, but I do wonder about the next steps:
* After exploring an learning about a virtual world, can anything at all be transferred to an agent operating in the real world? Or would an agent operating in the real world have to be trained exclusively or partially in the real world?
* These virtual worlds are obviously limited in a lot of important ways (for example, character locomotion in a game is absolutely nothing like how a multi-limbed robot moves). Does there eventually need to be more sophisticated virtual worlds that more closely mirror our real world?
* Google seems clearly interested in generalized agents and AGI, but I'm actually somewhat interested in AI agents in video games too. Many video games have companion NPCs that you can sort of give tasks to, but in almost all cases, the companion NPCs are nearly uncontrollable and very limited in what they can actually do.
The end goal is to marry the lessons learned about HOW to learn in a virtual world with a high fidelity world model that's currently out of reach for this generation of AI. In a year or two once we have a world model that's realistic enough and fast enough, robots will be trained there and then (hopefully) generalize easily to the real world. This is groundwork trying to understand how to do that without having the models required to do it for real.
Look into the sim2real problem in robotics
At 0:52 in their demo video, there is a grammatical inconsistency in the agent's text output. The annotations in the video are therefore suspected to be created by humans after the fact. Is Google up to their old marketing/hyping tricks again?
> SIMA 2 Reasoning:
> The user wants me to go to the ‘tomato house’. Based on the description ‘ripe tomato’, I identify the red house down the street.
The scene just before you describe has the user write "ripe tomato" in the description - you can see it in the video. The summary elides it, but the "ripe tomato" instruction is also clearly part of the context.
I can't speak to the content of the actual game being played, but it wouldn't surprise me if there was an in-game text prompt:
> "The house that looks like a ripe tomato!"
that was transformed into a "user prompt" in a more instructional format
> "Go to the tomato house"
And both were used in the agent output. At least the Y-axes on the graphs look more reasonable than some other recent benchmarks.
Would be cool to see if they could make it play starcraft too and pit it against alphastar.
From what I see, SIMA only focuses on games where you control a single avatar from a 1st/3rd person perspective, and would assume that switching to a non-embodied game where you need to control the whole army at once would require significant retraining.
I'm almost 100% confident AlphaStar would win that match, but I'd love to watch it.
Isn't most of this demo no man's sky? The voiceover doesn't make it clear that the world is not generated by SIMA.
It's hard to keep up with the many different models and pace of progress.
Genie 3 is Google's world generating model: https://deepmind.google/blog/genie-3-a-new-frontier-for-worl...
This is not a world generating model.
It is a game playing model.
And my post is saying that if you don't really know better, from the narration, you'd think google also generated the world. At least that was my impression, and I'm vaguely familiar with these things.
If it can get through those lengthy glitchy nms story mission tutorials quickly it's already a super intelligence.
as much as some AI annoys me. This would be great for making games more accessible.
Yet another blogpost that looks super impressive, until you get to the bottom and see the charts assessing held out task performance on ASKA and MineDojo and see that it's still a paltry 15% success rate. (Holy misleading chart batman!) Yes, it's a major improvement over SIMA 1, but we are still a long way from this being useful for most people.
To be fair, it's 65% on all tasks (with a 75% human baseline) and 15% on unseen environments. They don't provide a human baseline for that, but I'd imagine it's much more than 15%.
It really feels like we are determined to simulate every possible task in every possible environment instead of building true intelligence.
I personally am extremely impressed about it reaching 15% on unseen environments. Note that just this year, we were surprised that LLMs became capable of making any progress whatsoever in GBA Pokemon games (that have significantly simpler worlds and control schemes).
As for "true intelligence" - I honestly don't think that there is such a thing. We humans have brains that are wired based on our ancestors evolving for billions of years "in every possible environment", and then with that in place, each individual human still needs quite a few years of statistical learning (and guided learning) to be able to function independently.
Obviously I'm not claiming that SIMA 2 is as intelligent as a human, or even that it's on the way there, but based on recent progress, I would be very surprised if we don't see humanoid robots using a approaches inspired by this navigate our streets in a decade or so.
I'm curious what's your definition of "true intelligence"
[flagged]
Could you please stop posting flamebait and breaking the site guidelines? You've unfortunately been doing it repeatedly, including this dreadful thread from a couple weeks ago: https://news.ycombinator.com/item?id=45781981. I realize the other person was doing it also, but you (<-- I don't mean you personally, but all of us) need to follow the rules regardless of what other people are doing.
Comments like what your account has been posting are not what this site is for, and destroy what it is for, so if you wouldn't mind reviewing https://news.ycombinator.com/newsguidelines.html and taking the intended spirit of the site more to heart, we'd be grateful.
It seems pretty clear to me that they're trying to develop AGI humanoid assistants/workers without the messy and expensive real world hardware. Basically approaching the problem from the other end than a company like Tesla that built a robot and are now trying to figure out how to make a computer drive it without needing constant hand holding.