I'm currently making 2bit to 8bit GGUFs for local deployment! Will be up in an hour or so at https://huggingface.co/unsloth/Qwen3-Coder-480B-A35B-Instruc...
Also docs on running it in a 24GB GPU + 128 to 256GB of RAM here: https://docs.unsloth.ai/basics/qwen3-coder
Looks like the docs have a typo:
That should be recommended token output, as shown in the official docs as:Recommended context: 65,536 tokens (can be increased)
Adequate Output Length: We recommend using an output length of 65,536 tokens for most queries, which is adequate for instruct models.
Oh thanks - so the output can be any length you like - I'm actually also making 1 million context length GGUFs as well! https://huggingface.co/unsloth/Qwen3-Coder-480B-A35B-Instruc...
Do 2bit quantizations really work? All the ones I've seen/tried were completely broken even when 4bit+ quantizations worked perfectly. Even if it works for these extremely large models, is it really much better than using something slightly smaller on 4 or 5 bit quant?
Oh the Unsloth dynamic ones are not 2bit at all - it's a mixture of 2, 3, 4, 5, 6 and sometimes 8bit.
Important layers are in 8bit, 6bit. Less important ones are left in 2bit! I talk more about it here: https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs
Not an AI researcher here so this is probably common knowledge for people in this field, but I saw a video about the quantization recently and wondered exactly about that, if it's possible to compress a net by using more precision where it counts and less precision where it's not important. And also wondered how one would go about deciding which parts count and which don't
Great to know that this is already a thing and I assume model "compression" is going to be the next hot topic
Yes you're exactly thinking correctly! We shouldn't quantize a model naively to 2bit or 4bit, but we should do it smartly!
How do you pick which one should be 2, which one should be 4, etc. Is this secret sauce? or, something open?
Oh I wrote about it here: https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs We might provide some scripts for them in the future!
Thanks! But, I can't find any details on how you "intelligently adjust quantization for every possible layer" from that page. I assume this is a secret?
I am wondering about the possibility that different use cases might require different "intelligent quantization", i.e., quantization for LLM for financial analysis might be different from LLM for code generation. I am currently doing a postdoc in this. Interested in doing research together?
Oh we haven't yet published about it yet! I talk about in bits and pieces - we might do a larger blog on it!
Yes different use cases will be different - oh interesting! Sorry I doubt I can be of much in our research - I'm mainly an engineering guy so less research focused!
How do you decide which layers are the important ones?
I wrote approximately in the blog about it and linked some papers! I also wrote about it here - https://unsloth.ai/blog/dynamic-4bit - one has to inspect the activation and weight quantization errors!
So you are basically looking at "fMRI" of the "brain" while it's doing a wide range of things and cutting out the things that stay dark the most?
Oh that's a good analogy! Yes that sounds right!
> The key reason to use Unsloth quants is because of our deep involvement in fixing critical bugs across major models
sounds convincing, eh ... /s
On the less cynical note, approach does look interesting but I'd also like to understand how and why does it work, if it works at all.
Oh we actually fixed bugs! We fixed a few bugs in Gemma - see https://news.ycombinator.com/item?id=39671146, a gradient accumulation bug see https://news.ycombinator.com/item?id=41859037, Phi bugs, Llama bugs and more! See https://unsloth.ai/blog/reintroducing for more details!
What does your approach with dynamics weights has to do with those bugs? All those bugs seem uncorrelated to the technique.
Oh apologies I got confused - it's because when we calculate our dynamic quants, we have to do it on the fixed model!
For example in Phi 3 for example, the end of sentence token was wrong - if we use this, then our quants would be calibrated incorrectly, since chatting with the model will use the actual correct token.
Another is Llama 4 - https://github.com/ggml-org/llama.cpp/pull/12889 in which I fixed a RoPE issue - if we didn't fix it first, then again the calibration process would be incorrect.
Ok, this then goes to say that your approach doesn't work without applying whatever fixes to the vanilla models. What I'm trying to understand is the approach itself. Why does it and how does it work?
Oh I wrote a bit about it in https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs and https://unsloth.ai/blog/deepseekr1-dynamic if that helps!
If you don't mind divulging, what resources and time did it take to dynamically quantize Qwen3-Coder?
It takes a few hours to compute the imatrix on some calibration dataset since we use more than 1-3 million tokens of high quality data. Then we have to decide on which layers to quantize to higher bits or not, which takes more time. And the quantization creation also takes some hours. Uploading also takes some time as well! Overall 8 hours maybe minimum?
What cluster do you have to do the quantizing? I'm guessing you're not using a single machine with a 3090 in your garage.
Oh definitely not! I use some spot cloud instances!
But you can get one of these quantized models to run effectively on a 3090?
If so, I'd love detailed instructions.
The guide you posted earlier goes over my (and likely many others') head!
Oh yes definitely! Oh wait is the guide too long / wordy? This section https://docs.unsloth.ai/basics/qwen3-coder-how-to-run-locall... shows how to run it on a 3090
Kind of you to respond! Thanks!
I have pretty bad ADHD. And I've only run locally using kobold; dilettante at DIY AI.
So, yeah, I'm a bit lost in it.
Oh sorry - for Kobold - I think it uses llama.cpp behind the hood? I think Kobold has some guides on using custom GGUFs
Thanks Daniel. Do you recommend any resources showing the differences between different quantizations?
Oh our blog https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs compares the accuracy differences for each different quantization method for Llama 4 Scout and also Gemma 3 27B - they should apply to other quants (like Qwen 3 Coder)
I had given up long time ago on self hosted transformer models for coding because the SOTA was definetly in favor of SaaS. This might just give me another try.
Would llama.cpp support multiple (rtx 3090, no nvlink hw bridge) GPUs over PCIe4? (Rest of the machine is 32 CPU cores, 256GB RAM)
How fast you run this model will strongly depend on if you have DDR4 or DDR5 ram.
You will be mostly using 1 of your 3090s. The other one will be basically doing nothing. You CAN put the MoE weights on the 2nd 3090, but it's not going to speed up inference much, like <5% speedup. As in, if you lack a GPU, you'd be looking at <1 token/sec speeds depending on how fast your CPU does flops, and if you have a single 3090 you'd be doing 10tokens/ec, but with 2 3090s you'll still just be doing maybe 11tok/sec. These numbers are made up, but you get the idea.
Qwen3 Coder 480B is 261GB for IQ4_XS, 276GB for Q4_K_XL, so you'll be putting all the expert weights in RAM. That's why your RAM bandwidth is your limiting factor. I hope you're running off a workstation with dual cpus and 12 sticks of DDR5 RAM per CPU, which allows you to have 24 channel DDR5 RAM.
1 CPU, DDR4 ram
How many channels of DDR4 ram? What speed is it running at? DDR4-3200?
The (approximate) equation for milliseconds per token, is:
Time for token generation = (number of params active in the model)*(quantization size in bits)/8 bits*[(percent of active params in common weights)/(memory bandwidth of GPU) + (percent of active params in experts)/(memory bandwidth of system RAM)].
This equation ignores prefill (prompt processing) time. This assumes the CPU and GPU is fast enough compute-wise to do the math, and the bottleneck is memory bandwidth (this is usually true).
So for example, if you are running Kimi K2 (32b active params per token, 74% of those params are experts, 26% of those params are common params/shared expert) at Q4 quantization (4 bits per param), and have a 3090 gpu (935GB/sec) and an AMD Epyc 9005 cpu with 12 channel DDR5-6400 (614GB/sec memory bandwidth), then:
Time for token generation = (32b params)*(4bits/param)/8 bits*[(26%)/(935 GB/s) + (74%)/(614GB/sec)] = 23.73 ms/token or ~42 tokens/sec. https://www.wolframalpha.com/input?i=1+sec+%2F+%2816GB+*+%5B...
Notice how this equation explains how the second 3090 is pretty much useless. If you load up the common weights on the first 3090 (which is standard procedure), then the 2nd 3090 is just "fast memory" for some expert weights. If the quantized model is 256GB (rough estimate, I don't know the model size off the top of my head), and common weights are 11GB (this is true for Kimi K2, I don't know if it's true for Qwen3, but this is a decent rough estimate), then you have 245GB of "expert" weights. Yes, this is generally the correct ratio for MoE models, Deepseek R1 included. If you put 24GB of that 245GB on your second 3090, you have 935GB/sec speed on... 24/245 or ~10% of each token. In my Kimi K2 example above, you start off with 18.08ms per token spent reading the model from RAM, so even if your 24GB on your GPU was infinitely fast, it would still take... about 16ms per token reading from RAM. Or in total about 22ms/token, or in total 45 tokens/sec. That's with an infinitely fast 2nd GPU, you get a speedup of merely 3 tokens/sec.
Inspired me to write this, since it seems like most people don't understand how fast models run:
https://unexcitedneurons.substack.com/p/how-to-calculate-hom...
Thank you for that writeup!
In my case it is a fairly old system I built from cheap eBay parts. Threadripper 3970X with 8x32GB dual channel 2666Mhz DDR4.
Oh yes llama.cpp's trick is it supports any hardware setup! It might be a bit slower, but it should function well!
Thanks for the uploads! Was reading through the Unsloth docs for Qwen3-Coder before I found the HN thread :)
What would be a reasonable throughput level to expect from running 8-bit or 16-bit versions on 8x H200 DGX systems?
Oh 8*H200 is nice - for llama.cpp definitely look at https://docs.unsloth.ai/basics/qwen3-coder-how-to-run-locall... - llama.cpp has a high throughput mode which should be helpful.
You should be able to get 40 to 50 tokens / s in the minimum. High throughput mode + a small draft model might get you 100 tokens / s generation
Thank you for your work, does the Qwen3-Coder offer significant advantage over Qwen2.5-coder for non-agentic tasks like just plain autocomplete and chat?
Oh it should be better, especially since the model was specifically designed for coding tasks! You can disable the tool calling parts of the model!
I've been reading about your dynamic quants, very cool. Does your library let me produce these, or only run them? I'm new to this stuff.
Thank you! Oh currently not sadly - we might publish some stuff on it in the future!
Would a version of this ever be possible to run on a machine with a 16GB gpu and 64gb RAM?
What will be the approx token/s prompt processing and generation speed with this setup on RTX 4090?
I also just made IQ1_M which needs 160GB! If you have 160-24 = 136 ish of RAM as well, then you should get 3 tokens to 5 ish per second.
If you don't have enough RAM, then < 1 token / s
Any idea if there is a way to run on 256gb ram + 16gb vram with usable performance, even if barely?
Yes! 3bit maybe 4bit can also fit! llama.cpp has MoE offloading so your GPU holds the active experts and non MoE layers, thus you only need 16GB to 24GB of VRAM! I wrote about how to do in this section: https://docs.unsloth.ai/basics/qwen3-coder#improving-generat...
awesome documentation, I'll try this. thank you!
Cool, thanks! I'd like to try it
It just got uploaded! I made some docs as well on how to run it at https://docs.unsloth.ai/basics/qwen3-coder
hello sir
hi!
> Qwen3-Coder is available in multiple sizes, but we’re excited to introduce its most powerful variant first
I'm most excited for the smaller sizes because I'm interested in locally-runnable models that can sometimes write passable code, and I think we're getting close. But since for the foreseeable future, I'll probably sometimes want to "call in" a bigger model that I can't realistically or affordably host on my own computer, I love having the option of high-quality open-weight models for this, and I also like the idea of "paying in" for the smaller open-weight models I play around with by renting access to their larger counterparts.
Congrats to the Qwen team on this release! I'm excited to try it out.
> I'm most excited for the smaller sizes because I'm interested in locally-runnable models that can sometimes write passable code, and I think we're getting close.
Likewise, I found that the regular Qwen3-30B-A3B worked pretty well on a pair of L4 GPUs (60 tokens/second, 48 GB of memory) which is good enough for on-prem use where cloud options aren't allowed, but I'd very much like a similar code specific model, because the tool calling in something like RooCode just didn't work with the regular model.
In those circumstances, it isn't really a comparison between cloud and on-prem, it's on-prem vs nothing.
30B-A3B works extremely well as a generalist chat model when you pair with scaffolding such as web search. It's fast (for me) using my workstation at home running a 5070 + 128GB of DDR4 3200 RAM @ ~28 tok/s. Love MoE models.
Sadly it falls short during real world coding usage, but fingers crossed that a similarly sized coder variant of Qwen 3 can fill in that gap for me.
This is my script for the Q4_K_XL version from unsloth at 45k context:
llama-server.exe --host 0.0.0.0 --no-webui --alias "Qwen3-30B-A3B-Q4_K_XL" --model "F:\models\unsloth\Qwen3-30B-A3B-128K-GGUF\Qwen3-30B-A3B-128K-UD-Q4_K_XL.gguf" --ctx-size 45000 --n-gpu-layers 99 --slots --metrics --batch-size 2048 --ubatch-size 2048 --temp 0.6 --top-p 0.95 --min-p 0 --presence-penalty 1.5 --repeat-penalty 1.1 --jinja --reasoning-format deepseek --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn --no-mmap --threads 8 --cache-reuse 256 --override-tensor "blk\.([0-9][02468])\.ffn_._exps\.=CPU"
I love Qwen3-30B-A3B for translation and fixing up transcripts generated by automatic speech recognition models. It's not the most stylish translator (a bit literal), but it's generally better than the automatic translation features built into most apps, and it's much faster since there's no network latency.
It has also been helpful (when run locally, of course) for addressing questions-- good faith questions, not censorship tests to which I already know the answers-- about Chinese history and culture that the DeepSeek app's censorship is a little too conservative for. This is a really fun use case actually, asking models from different parts of the world to summarize and describe historical events and comparing the quality of their answers, their biases, etc. Qwen3-30B-A3B is fast enough that this can be as fun as playing with the big, commercial, online models, even if its answers are not equally detailed or accurate.
> good faith questions
yep, when you hire an immigrate software engineer, you don't ask them if Israel has a right to exist, or whether Vladivostok is part of china. Unless you are a DoD vendor which there won't be an interview anyway.
Give devstral a try, fp8 should fit in 48GB, it was surprisingly good for a 24B local model, w/ cline/roo. Handles itself well, doesn't get stuck much, most of the things work OK (considering the size ofc)
I did! I do think Mistral models are pretty okay, but even the 4-bit quantized version runs at about 16 tokens/second, more or less usable but a biiiig step down from the MoE options.
Might have to swap out Ollama for vLLM though and see how different things are.
> Might have to swap out Ollama for vLLM though and see how different things are.
Oh, that might be it. Using gguf is slower than say AWQ if you want 4bit, or fp8 if you want the best quality (especially on Ada arch that I think your GPUs are).
edit: vLLM is better for Tensor Parallel and also better for batched inference, some agentic stuff can do multiple queries in parallel. We run devstral fp8 on 2x A6000 (old, not even Ada) and even with marlin kernels we get ~35-40 t/s gen and 2-3k pp on a single session, with ~4 parallel sessions supported at full context. But in practice it can work with 6 people using it concurrently, as not all sessions get to the max context. You'd get 1/2 of that for 2x L4, but should see higher t/s in generation since you have Ada GPUs (native support for fp8).
Currently, the goal of everyone is creating one master model to rule them all, so we haven't seen too much specialization. I wonder how much more efficient smaller models could be if we created language specialized models.
It feels intuitively obvious (so maybe wrong?) that a 32B Java Coder would be far better at coding Java than a generalist 32B Coder.
I’ll fill the role to push back on your Java coder idea!
First, Java code tends to be written a certain way, and for certain goals and business domains.
Let’s say 90% of modern Java is a mix of: * students learning to program and writing algorithms * corporate legacy software from non-tech focused companies
If you want to build something that is uncommon in that subset, it will likely struggle due to a lack of training data. And if you wanted to build something like a game, the majority of your training data is going to be based on ancient versions of Java, back when game development was more common in Java.
Comparatively, including C in your training data gives you exposure to a whole separate set of domain data for training, like IoT devices, kernels, etc.
Including Go will likely include a lot more networking and infrastructure code than Java would have had, which means there is also more context to pull from in what networking services expect.
Code for those domains follow different patterns, but the concepts can still be useful in writing Java code.
Now, there may be a middle ground where you could have a model that is still general for many coding languages, but given extra data and fine-tuning focused on domain-specific Java things — like more of a “32B CorporateJava Coder” model — based around the very specific architecture of Spring. And you’d be willing to accept that model to fail at writing games in Java.
It’s interesting to think about for sure - but I do feel like domain-specific might be more useful than language-specific
Don't we also find with natural languages that focusing on training data from only a single language doesn't actually result in better writing in the target language, either?
jetbrains have done this with their mellum models that they use for autocompletion, https://ollama.com/JetBrains
fine tuned rather than created from scratch though.
Been using ggerganov’s llama vscode plugin with the smaller 2.5 models and it actually works super nice on a M3 Max
I'm on a m1 max with 64gb ram, but i never use this vscode plugin before. Should I try?
Is this the one? https://github.com/ggml-org/llama.vscode it sems to be built for code completion rather than outright agent mode
What languages do you work in? How much code do you keep? Do you end up using it as scaffolding and rewriting it, it leaving most of it as is?
Languages: JS/TS, C/C++, Shader Code, Some ESP Arduino code. Not counting all the boilerplate and CSS that I dont care about too much.
It very much reminds of tabbing autocomplete with IntelliSense step by step, but in a more diffusion-like way.
but my tool-set is a mixture of agentic and autocomplete, not 100% of each. I try to keep a clear focus of the architecture, and actually own the code by reading most of it, keeping straight the parts of the code the way i like.
small models can never match bigger models, the bigger models just know more and are smarter. the smaller models can get smarter, but as they do, the bigger models get smart too. HN is weird because at one point this was the location where I found the most technically folks, and now for LLM I find them at reddit. tons of folks are running huge models, get to researching and you will find out you can realistically host your own.
> small models can never match bigger models, the bigger models just know more and are smarter.
They don't need to match bigger models, though. They just need to be good enough for a specific task!
This is more obvious when you look at the things language models are best at, like translation. You just don't need a super huge model for translation, and in fact you might sometimes prefer a smaller one because being able to do something in real-time, or being able to run on a mobile device, is more important than marginal accuracy gains for some applications.
I'll also say that due to the hallucination problem, beyond whatever knowledge is required for being more or less coherent and "knowing" what to write in web search queries, I'm not sure I find more "knowledgeable" LLMs very valuable. Even with proprietary SOTA models hosted on someone else's cloud hardware, I basically never want an LLM to answer "off the dome"; IME it's almost always wrong! (Maybe this is less true for others whose work focuses on the absolute most popular libraries and languages, idk.) And if an LLM I use is always going to be consulting documentation at runtime, maybe that knowledge difference isn't quite so vital— summarization is one of those things that seems much, much easier for language models than writing code or "reasoning".
All of that is to say:
Sure, bigger is better! But for some tasks, my needs are still below the ceiling of the capabilities of a smaller model, and that's where I'm focusing on local usage. For now that's mostly language-focused tasks entirely apart from coding (translation, transcription, TTS, maybe summarization). It may also include simple coding tasks today (e.g., fancy auto-complete, "ghost-text" style). I think it's reasonable to hope that it will eventually include more substantial programming tasks— even if larger models are still preferable for more sophisticated tasks (like "vibe coding", maybe).
If I end up having a lot of fun, in a year or two I'll probably try to put together a machine that can indeed run larger models. :)
> Even with proprietary SOTA models hosted on someone else's cloud hardware, I basically never want an LLM to answer "off the dome"; IME it's almost always wrong! (Maybe this is less true for others whose work focuses on the absolute most popular libraries and languages, idk.)
I feel like I'm the exact opposite here (despite heavily mistrusting these models in general): if I came to the model to ask it a question, and it decides to do a Google search, it pisses me off as I not only could do that, I did do that, and if that had worked out I wouldn't be bothering to ask the model.
FWIW, I do imagine we are doing very different things, though: most of the time, when I'm working with a model, I'm trying to do something so complex that I also asked my human friends and they didn't know the answer either, and my attempts to search for the answer are failing as I don't even know the terminology.
> I feel like I'm the exact opposite here (despite heavily mistrusting these models in general): if I came to the model to ask it a question, and it decides to do a Google search, it pisses me off as I not only could do that, I did do that, and if that had worked out I wouldn't be bothering to ask the model.
When a model does a single web search and emulates a compressed version of the "I'm Feeling Lucky" button, I am disappointed, too. ;)
I usually want the model to perform multiple web searches, do some summarization, refine/adjust search terms, etc. I tend to avoid asking LLMs things that I know I'll find the answer to directly in some upstream official documentation, or a local man page. I've long been and remain a big "RTFM" person; imo it's still both more efficient and more accurate when you know what you're looking for.
But if I'm asking an LLM to write code for me, I usually still enable web search on my query to the LLM, because I don't trust it to "remember" APIs. (I also usually rewrite most or all of the code because I'm particular about style.)
>you might sometimes prefer a smaller one because being able to do something in real-time, or being able to run on a mobile device, is more important than marginal accuracy gains for some applications.
This reminds me of ~”the best camera is the one you have with you” idea.
Though, large models are an http request away, there are plenty of reasons to want to run one locally. Not the least of which is getting useful results in the absence of internet.
All of these models are suitable for translation and that is what they are most suitable for. The architecture inherits from seq2seq and original transformers was created to benefit Google translations.
For coding though it seems like people are willing to pay a lot more for a slightly better model.
The problem with local vs remote isn't so much about paid. It is about compliance and privacy.
For me, the sense of a greater degree of independence and freedom is also important. Especially when the tech world is out of its mind with AI hype, it's difficult to feel the normal tinkerer's joy when I'm playing with some big, proprietary model. The more I can tweak at inference time, the more control I have over the tools in use, the more I can learn about how a model works, and the closer to true open-source the model is, the more I can recover my child-like joy at playing with fun and interesting tech-- even if that tech is also fundamentally flawed or limited, over-hyped, etc.
> HN is weird because at one point this was the location where I found the most technically folks, and now for LLM I find them at reddit.
Is this an effort to chastise the viewpoint advanced? Because his viewpoint makes sense to me: I can run biggish models on my 128GB Macbook but not huge ones-- even 2b quantized ones suck too many resources.
So I run a combination of local stuff and remote stuff depending upon various factors (cost, sensitivity of information, convenience/whether I'm at home, amount of battery left, etc ;)
Yes, bigger models are better, but often smaller is good enough.
The large models are using tools/functions to make them useful. Sooner or later open source will provide a good set of tools/functions for coding as well.
I'd be interested in smaller models that were less general, with a training corpus more concentrated. A bash scripting model, or a clojure model, or a zig model, etc.
Well yes tons of people are running them but they're all pretty well off.
I don't have 10-20k$ to spend on this stuff. Which is about the minimum to run a 480B model, with huge quantisation. And pretty slow because for that price all you get is an old Xeon with a lot of memory or some old nvidia datacenter cards. If you want a good setup it will cost a lot more.
So small models it is. Sure, the bigger models are better but because the improvements come so fast it means I'm only 6 months to a year behind the big ones at any time. Is that worth 20k? For me no.
The small model only needs to get as good as the big model is today, not as the big model is in the future.
There's a niche for small-and-cheap, especially if they're fast.
I was surprised in the AlphaEvolve paper how much they relied on the flash model because they were optimizing for speed of generating ideas.
Not really true. Gemma from Google with quantized aware training does an amazing job.
Under the hood, the way it works, is that when you have final probabilities, it really doesn't matter if the most likely token is selected with 59% or 75% - in either case it gets selected. If the 59% case gets there with smaller amount of compute, and that holds across the board for the training set, the model will have similar performance.
In theory, it should be possible to narrow down models even smaller to match the performance of big models, because I really doubt that you do need transformers for every single forward pass. There are probably plenty of shortcuts you can take in terms of compute for sets of tokens in the context. For example, coding structure is much more deterministic than natural text, so you probably don't need as much compute to generate accurate code.
You do need a big model first to train a small model though.
As for running huge models locally, its not enough to run them, you need good throughput as well. If you spend $2k on a graphics card, that is way more expensive than realistic usage with a paid API, and slower output as well.
> small models can never match bigger models, the bigger models just know more and are smarter
Untrue. The big important issue for LLMs is hallucination, and making your model bigger does little to solve it.
Increasing model size is a technological dead end. The future advanced LLM is not that.
> and now for LLM I find them at reddit. tons of folks are running huge models
Very interesting. Any subs or threads you could recommend/link to?
Thanks
join us at r/LocalLlama
Basically just run ollama and run the quantized models. Don't expect high generation speeds though.
which sub-reddits do you recommend?
The "qwen-code" app seems to be a gemini-cli fork.
https://github.com/QwenLM/qwen-code https://github.com/QwenLM/qwen-code/blob/main/LICENSE
I hope these OSS CC clones converge at some point.
Actually it is mentioned in the page:
we’re also open-sourcing a command-line tool for agentic coding: Qwen Code. Forked from Gemini Code
Also, kudos to Gemini CLI team for making it open source (unlike claude) and that too easily tunable to new models like Qwen.
It would be great if it starts supporting other models too natively. Wouldn't require people to fork.
What seems to be typical these days is that big companies ship the first tool very fast, in poor condition (applies to Gemini CLI as well), and then let the OSS ecosystem fix the issues. Backend is closed so the app is their best shot. Then after some time the company gets the most credit and not all the contributors.
I tried to use Jetbrains official Kotlin MCP SDK recently and it couldn't even serve the MCP endpoint on an URL that was different than what the default was expected to be...
They had made a bunch of hard-coded assumptions
> They had made a bunch of hard-coded assumptions
Or they simply did that because it is much faster. Adding configuration options requires more testing and input handling. Later on, they can then accept PR where someone needs it a lot, saving their own time.
> then let the OSS ecosystem fix the issues
That's precisely half of the point of OSS and I am pretty much okay with that.