HNNewShowAskJobs
Built with Tanstack Start
Kimi Released Kimi K2.5, Open-Source Visual SOTA-Agentic Model(kimi.com)
219 points by nekofneko 6 hours ago | 70 comments
  • Tepix4 hours ago

    Huggingface Link: https://huggingface.co/moonshotai/Kimi-K2.5

    1T parameters, 32b active parameters.

    License: MIT with the following modification:

    Our only modification part is that, if the Software (or any derivative works thereof) is used for any of your commercial products or services that have more than 100 million monthly active users, or more than 20 million US dollars (or equivalent in other currencies) in monthly revenue, you shall prominently display "Kimi K2.5" on the user interface of such product or service.

    • endymi0nan hour ago |parent

      One. Trillion. Even on native int4 that’s… half a terabyte of vram?!

      Technical awe at this marvel aside that cracks the 50th percentile of HLE, the snarky part of me says there’s only half the danger in giving something away nobody can run at home anyway…

    • dheera3 hours ago |parent

      > or more than 20 million US dollars (or equivalent in other currencies) in monthly revenue, you shall prominently display "Kimi K2.5" on the user interface of such product or service.

      Why not just say "you shall pay us 1 million dollars"?

      • vessenes2 hours ago |parent

        ? They prefer the branding. The license just says you have to say it was them if you make > $250mm a year on the model.

      • viraptor2 hours ago |parent

        Companies with $20M revenue will not normally have spare $1M available. They'd get more money by charging reasonable subscriptions than by using lawyers to chase sudden company-ending fees.

        • laurentb5 minutes ago |parent

          it's monthly :) $240M revenue companies will absolutely find a way to fork $1M if they need to. Kimi most likely sees the eyeballs of free advertising as more profitable in the grander scheme of things

      • clayhacks3 hours ago |parent

        I assume this allows them to sue for different amounts. And not discourage too many people from using it.

    • Imustaskforhelp3 hours ago |parent

      Hey have they open sourced all Kimi k2.5 (thinking,instruct,agent,agent swarm [beta])?

      Because I feel like they mentioned that agent swarm is available their api and that made me feel as if it wasn't open (weights)*? Please let me know if all are open source or not?

      • XenophileJKOan hour ago |parent

        I'm assuming the swarm part is all harness. Well I mean a harness and way of thinking that the weights have just been fine tuned to use.

  • bertili2 hours ago

    The "Deepseek moment" is just one year ago today!

    Coincidence or not, let's just marvel for a second over this amount of magic/technology that's being given away for free... and how liberating and different this is than OpenAI and others that were closed to "protect us all".

  • jumploops5 hours ago

    > For complex tasks, Kimi K2.5 can self-direct an agent swarm with up to 100 sub-agents, executing parallel workflows across up to 1,500 tool calls.

    > K2.5 Agent Swarm improves performance on complex tasks through parallel, specialized execution [..] leads to an 80% reduction in end-to-end runtime

    Not just RL on tool calling, but RL on agent orchestration, neat!

    • mohsen116 minutes ago |parent

      Parallel agents are such a simple, yet powerful hack. Using it in Claude Code with TeammateTool and getting lots of good results!

  • jdeng23 minutes ago

    Glad to to see open source models are catching up and treat vision as first-class citizen (a.k.a native multimodal agentic model). GLM and Qwen models takes different approach, by having a base model and a vision variant (glm-4.6 vs glm-4.6v).

    I guess after Kimi K2.5, other vendors are going to the same route?

    Can't wait to see how this model performs on computer automation use cases like VITA AI Coworker.

    https://www.vita-ai.net/

  • simonwan hour ago

    Pretty cute pelican https://tools.simonwillison.net/svg-render#%3Csvg%20viewBox%...

    • mythzan hour ago |parent

      doesn't work, looks like the link or SVG was cropped.

  • teiferer13 minutes ago

    Can we please stop calling those models "open source"? Yes the weights are open. So, "open weight" maybe. But the source isn't open, the thing that allows to re-create it. That's what "open source" used to mean. (Together with a license that allows you to use that source for various things.)

  • Barathkannaan hour ago

    A realistic setup for this would be a 16× H100 80GB with NVLink. That comfortably handles the active 32B experts plus KV cache without extreme quantization. Cost-wise we are looking at roughly $500k–$700k upfront or $40–60/hr on-demand, which makes it clear this model is aimed at serious infra teams, not casual single-GPU deployments. I’m curious how API providers will price tokens on top of that hardware reality.

    • reissbaker39 minutes ago |parent

      Generally speaking, 8xH200s will be a lot cheaper than 16xH100s, and faster too. But both should technically work.

    • bertilian hour ago |parent

      The other realistic setup is $20k, for a small company that needs a private AI for coding or other internal agentic use with two Mac Studios connected over thunderbolt 5 RMDA.

      • Barathkannaan hour ago |parent

        That won’t realistically work for this model. Even with only ~32B active params, a 1T-scale MoE still needs the full expert set available for fast routing, which means hundreds of GB to TBs of weights resident. Mac Studios don’t share unified memory across machines, Thunderbolt isn’t remotely comparable to NVLink for expert exchange, and bandwidth becomes the bottleneck immediately. You could maybe load fragments experimentally, but inference would be impractically slow and brittle. It’s a very different class of workload than private coding models.

        • zozbot234an hour ago |parent

          If "fast" routing is per-token, the experts can just reside on SSD's. the performance is good enough these days. You don't need to globally share unified memory across the nodes, you'd just run distributed inference.

          Anyway, in the future your local model setups will just be downloading experts on the fly from experts-exchange. That site will become as important to AI as downloadmoreram.com.

        • bertilian hour ago |parent

          People are running the previous Kimi K2 on 2 Mac Studios at 21tokens/s or 4 Macs at 30tokens/s. Its still premature, but not a completely crazy proposition for the near future, giving the rate of progress.

          • NitpickLawyer28 minutes ago |parent

            > 2 Mac Studios at 21tokens/s or 4 Macs at 30tokens/s

            Keep in mind that most people posting speed benchmarks try them with basically 0 context. Those speeds will not hold at 32/64/128k context length.

      • zozbot234an hour ago |parent

        That's great for affordable local use but it'll be slow: even with the proper multi-node inference setup, the thunderbolt link will be a comparative bottleneck.

      • embedding-shapean hour ago |parent

        I'd love to see the prompt processing speed difference between 16× H100 and 2× Mac Studio.

        • zozbot234an hour ago |parent

          Prompt processing/prefill can even get some speedup from local NPU use most likely: when you're ultimately limited by thermal/power limit throttling, having more efficient compute available means more headroom.

        • Barathkannaan hour ago |parent

          I asked GPT for a rough estimate to benchmark prompt prefill on an 8,192 token input. • 16× H100: 8,192 / (20k to 80k tokens/sec) ≈ 0.10 to 0.41s • 2× Mac Studio (M3 Max): 8,192 / (150 to 700 tokens/sec) ≈ 12 to 55s

          These are order-of-magnitude numbers, but the takeaway is that multi H100 boxes are plausibly ~100× faster than workstation Macs for this class of model, especially for long-context prefill.

  • Reubend4 hours ago

    I've read several people say that Kimi K2 has a better "emotional intelligence" than other models. I'll be interested to see whether K2.5 continues or even improves on that.

    • storystarling3 hours ago |parent

      yes, though this is highly subjective - it 'feels' like that to me as well (comapred to Gemini 3, GPT 5.2, Opus 4.5).

  • vinhnx2 hours ago

    One thing caught my eyes is that besides K2.5 model, Moonshot AI also launched Kimi Code (https://www.kimi.com/code), evolved from Kimi CLI. It is a terminal coding agent, I've been used it last month with Kimi subscription, it is capable agent with stable harness.

    GitHub: https://github.com/MoonshotAI/kimi-cli

  • Alifatiskan hour ago

    Have you all noted that the latest releases (Qwen3 max thinking, now Kimi k2.5) from Chinese companies are benching against Claude opus now and not Sonnet? They are truly catching up, almost at the same pace?

    • zozbot234an hour ago |parent

      The benching is sus, it's way more important to look at real usage scenarios.

  • throwaw12an hour ago

    Congratulations, great work Kimi team.

    Why is that Claude still at the top in coding, are they heavily focused on training for coding or is it their general training is so good that it performs well in coding?

    Someone please beat the Opus 4.5 in coding, I want to replace it.

  • zmmmmm4 hours ago

    Curious what would be the most minimal reasonable hardware one would need to deploy this locally?

    • NitpickLawyer3 hours ago |parent

      I parsed "reasonable" as in having reasonable speed to actually use this as intended (in agentic setups). In that case, it's a minimum of 70-100k for hardware (8x 6000 PRO + all the other pieces to make it work). The model comes with native INT4 quant, so ~600GB for the weights alone. An 8x 96GB setup would give you ~160GB for kv caching.

      You can of course "run" this on cheaper hardware, but the speeds will not be suitable for actual use (i.e. minutes for a simple prompt, tens of minutes for high context sessions per turn).

    • simonwan hour ago |parent

      Models of this size can usually be run using MLX on a pair of 512GB Mac Studio M3 Ultras, which are about $10,000 each so $20,000 for the pair.

    • toshan hour ago |parent

      I think you can put a bunch of apple silicon macs with enough ram together

      e.g. in an office or coworking space

      800-1000 gb ram perhaps?

  • pu_pe2 hours ago

    I don't get this "agent swarm" concept. You set up a task and they boot up 100 LLMs to try to do it in parallel, and then one "LLM judge" puts it all together? Is there anywhere I can read more about it?

    • vessenes2 hours ago |parent

      You can read about this basically everywhere - the term of art is agent orchestration. Gas town, Claude’s secret swarm mode, or people who like to use phrases like “Wiggum loop” will get you there.

      If you’re really lazy - the quick summary is that you can benefit from the sweet spot of context length and reduce instruction overload while getting some parallelism benefits from farming tasks out to LLMs with different instructions. The way this is generally implemented today is through tool calling, although Claude also has a skills interface it has been trained against.

      So the idea would be for software development, why not have a project/product manager spin out tasks to a bunch of agents that are primed to be good at different things? E.g. an architect, a designer, and so on. Then you just need something that can rectify GitHub PRs and bob’s your uncle.

      Gas town takes a different approach and parallelizes on coding tasks of any sort at the base layer, and uses the orchestration infrastructure to keep those coders working constantly, optimizing for minimal human input.

      • IanCalan hour ago |parent

        I'm not sure whether there are parts of this done for claude but those other ones are layers on top of the usual LLMs we see. This seems to be a bit different, in that there's a different model trained specifically for splitting up and managing the workload.

    • Rebuff5007an hour ago |parent

      I've also been quite skeptical, and I became even more skeptical after hearing a tech talk from a startup in this space [1].

      I think the best way to think about it is that its an engineering hack to deal with a shortcoming of LLMs: for complex queries LLMs are unable to directly compute a SOLUTION given a PROMPT, but are instead able to break down the prompt to intermediate solutions and eventually solve the original prompt. These "orchestrator" / "swarm" agents add some formalism to this and allow you to distribute compute, and then also use specialized models for some of the sub problems.

      [1] https://www.deepflow.com/

    • rvnx2 hours ago |parent

      You have a team lead that establishes a list of tasks that are needed to achieve your mission

      then it creates a list of employees, each of them is specialized for a task, and they work in parallel.

      Essentially hiring a team of people who get specialized on one problem.

      Do one thing and do it well.

    • jonkoops2 hours ago |parent

      The datacenters yearn for the chips.

  • spaceman_20205 hours ago

    Kimi was already one of the best writing models. Excited to try this one out

    • Alifatisk2 hours ago |parent

      To me, Kimi has been the best with writing and conversing, its way more human like!

  • striking4 hours ago

    https://archive.is/P98JR

  • Topfi3 hours ago

    K2 0905 and K2 Thinking shortly after that have done impressively well in my personal use cases and was severely slept on. Faster, more accurate, less expensive, more flexible in terms of hosting and available months before Gemini 3 Flash, I really struggle to understand why Flash got such positive attention at launch.

    Interested in the dedicated Agent and Agent Swarm releases, especially in how that could affect third party hosting of the models.

    • msp263 hours ago |parent

      K2 thinking didn't have vision which was a big drawback for my projects.

  • hmate9an hour ago

    About 600GB needed for weights alone, so on AWS you need an p5.48xlarge (8× H100) which costs $55/hour.

  • monkeydust2 hours ago

    Is this actually good or just optimized heavily for benchmarks? I am hopefully its the former based on the writeup but need to put it through its paces.

  • Jackson__3 hours ago

    As your local vision nut, their claims about "SOTA" vision are absolutely BS in my tests.

    Sure it's SOTA at standard vision benchmarks. But on tasks that require proper image understanding, see for example BabyVision[0] it appears very much lacking compared to Gemini 3 Pro.

    [0] https://arxiv.org/html/2601.06521v1

  • pplonski864 hours ago

    There are so many models, is there any website with list of all of them and comparison of performance on different tasks?

    • Reubend4 hours ago |parent

      The post actually has great benchmark tables inside of it. They might be outdated in a few months, but for now, it gives you a great summary. Seems like Gemini wins on image and video perf, Claude is the best at coding, ChatGPT is the best for general knowledge.

      But ultimately, you need to try them yourself on the tasks you care about and just see. My personal experience is that right now, Gemini Pro performs the best at everything I throw at it. I think it's superior to Claude and all of the OSS models by a small margin, even for things like coding.

      • Imustaskforhelp3 hours ago |parent

        I like Gemini Pro's UI over Claude so much but honestly I might start using Kimi K2.5 if its open source & just +/- Gemini Pro/Chatgpt/Claude because at that point I feel like the results are negligible and we are getting SOTA open source models again.

        • wobfanan hour ago |parent

          > honestly I might start using Kimi K2.5 if its open source & just +/- Gemini Pro/Chatgpt/Claude because at that point I feel like the results are negligible and we are getting SOTA open source models again.

          Me too!

          > I like Gemini Pro's UI over Claude so much

          This I don't understand. I mean, I don't see a lot of difference in both UIs. Quite the opposite, apart from some animations, round corners and color gradings, they seem to look very alike, no?

    • coffeeri4 hours ago |parent

      There is https://artificialanalysis.ai

      • pplonski863 hours ago |parent

        Thank you! Exactly what I was looking for

  • DeathArrow5 hours ago

    Those are some impressive benchmark results. I wonder how well it does in real life.

    Maybe we can get away with something cheaper than Claude for coding.

    • oneneptune4 hours ago |parent

      I'm curious about the "cheaper" claim -- I checked Kimi pricing, and it's a $200/mo subscription too?

      • NitpickLawyer4 hours ago |parent

        On openrouter 2.5 is at 0.60/3$ per Mtok. That's haiku pricing.

        • storystarling3 hours ago |parent

          The unit economics seem tough at that price for a 1T parameter model. Even with MoE sparsity you are still VRAM bound just keeping the weights resident, which is a much higher baseline cost than serving a smaller model like Haiku.

      • mrklol3 hours ago |parent

        They also have a $20 and $40 tier.

  • lrvick4 hours ago

    Actually open source, or yet another public model, which is the equivalent of a binary?

    URL is down so cannot tell.

    • typ3 hours ago |parent

      The label 'open source' has become a reputation reaping and marketing vehicle rather than an informative term since the Hugging Face benchmark race started. With the weights only, we cannot actually audit that if a model is a) contaminated by benchmarks, b) built with deliberate biases, or c) trained on copyrighted/privacy data, let alone allowing other vendors to replicate the results. Anyways, people still love free stuff.

      • Der_Einzige2 hours ago |parent

        Just accept that IP laws don't matter and the old "free software" paradigm is dead. Aaron Swartz died so that GenAI may live. RMS and his model of "copyleft" are so Web 1.0 (not even 2.0). No one in GenAI cares AT ALL about the true definition of open source. Good.

        • duskdozer7 minutes ago |parent

          Good?

    • Tepix4 hours ago |parent

      It's open weights, not open source.

  • mangolie5 hours ago

    they cooked

  • billyellow5 hours ago

    Cool

  • rvz3 hours ago

    The chefs at Moonshot have cooked once again.