Hello HackerNews!
I’m excited to share what we’ve been working on at nCompass Technologies: an AI inference* platform that gives you a scalable and reliable API to access any open-source AI model — with no rate limits. We don't have rate limits as optimizations we made to our AI model serving software enable us to support a high number of concurrent requests without degrading quality of service for you as a user.
If you’re thinking, well aren’t there a bunch of these already? So were we when we started nCompass. When using other APIs, we found that they weren’t reliable enough to be able to use open source models in production environments. To resolve this, we're building an AI inference engine that enable you, as an end user, to reliably use open source models in production.
Underlying this API, we’re building optimizations at the hosting, scheduling and kernel levels with the single goal of minimizing the number of GPUs required to maximize the number of concurrent requests you can serve, without degrading quality of service.
We’re still building a lot of our optimizations, but we’ve released what we have so far via our API. Compared to vLLM, we currently keep time-to-first-token (TTFT) 2-4x lower than vLLM at the equivalent concurrent request rate. You can check out a demo of our API here:
https://www.loom.com/share/c92f825ac0af4ab18296a16546a75be3
As a result of the optimizations we’ve rolled out so far, we’re releasing a few unique features on our API:
1. Rate-Limits: we don’t have any
Most other API’s out there have strict rate limits and can be rather unreliable. We don’t want API’s for open source models to remain as a solution for prototypes only. We want people to use these APIs like they do OpenAI’s or Anthropic’s and actually make production grade products on top of open source models.
2. Underserved models: we have them
There are a ton of models out there, but not all of them are readily available for people to use if they don’t have access to GPUs. We envision our API becoming a system where anyone can launch any custom model of their choice with minimal cold starts and run the model as a simple API call. Our cold starts for any 8B or 70B model are only 40s and we’ll keep improving this.
Towards this goal, we already have models like `ai4bharat/hercule-hi` hosted on our API to support non-english language use cases and models like `Qwen/QwQ-32B-Preview` to support reasoning based use cases. You can find the other models that we host here: https://console.ncompass.tech/public-models for public ones, and https://console.ncompass.tech/models for private ones that work once you've created an account.
We’d love for you to try out our API by following the steps here: https://www.ncompass.tech/docs/llm_inference/quickstart. We provide $100 of free credit on sign up to run models, and like we said, go crazy with your requests, we’d love to see if you can break our system :)
We’re still actively building out features and optimizations and your input can help shape the future of nCompass. If you have thoughts on our platform or want us to host a specific model, let us know at hello@ncompass.tech.
Happy Hacking!
* it's called inference because the process of taking a query, running it through the model and providing a result is referred to as "inference" in the AI / machine learning world. It's as opposed to "training" or "finetuning" which are processes used to actually develop the AI models that you then run "inference" on.
What are the trade-offs you've made to achieve this?
We focused mainly on the scheduling side of things. So we essentially prioritize prefills over decodes. In order to do this correctly, we had to monitor KV cache usage and whenever it's close to running out of memory, we schedule more decodes again.
So this means that you end up either having many decodes wait for prefills to complete or you end up scheduling decodes with prefills. Both scenarios result in slower decodes which is why we're seeing an increase in the ITL. This is the main tradeoff we've made.
So, while time to first token is lower, throughput might also be lower in most cases?
Per user throughput might be lower at the moment yes. We're working on GPU kernel level optimizations now to fix that.
But across all users on our system, the throughput is better because doing more prefills or a large number of grouped decodes has better utilization of the GPU.
The idea is that this works for someone who wants to build a product that is consistent across users in terms of initial response but can trade-off some E2E latency. It ensures that no one is waiting for a long time before getting the first response.
Since you're calling out your support for underserved models, can I request you support some SOTA embeddings models? Support for embeddings is poor from other providers with only a handful of outdated models and poor latency.
Hey, great that you mentioned this. We actually had BAAI/bge-m3 on our list of models to put up in the near future to see if people had use for it over an API. It's great to hear that this is something you're looking for. If you could let us know if there was a specific model you wanted to run, we can look into getting that put up soon.
Unrelated: During the dot-com boom, there was a company called nCompass Labs that developed one of the first content management systems (https://en.wikipedia.org/wiki/NCompass_Labs_Inc). Microsoft bought them in 2001. Their product was, "a plug-in for hosting ActiveX controls in Netscape Navigator named ScriptActive." ActiveX itself was a novelty, using C++ templates to define reusable and _downloadable_ web components.
All of this crap was happily replaced with JavaScript frameworks in later years. Yes, back in the early-2000s, your browser might literally download executable code just to render a custom button.
It now makes sense that when we tested the domain ncompass.com it took us to a Microsoft home page, which is why we're ncompass.tech :)
https://console.ncompass.tech/models has no models on it, just a "Get in Touch" button.
Hey, I'm one of the co-founders, thanks for letting us know! I've just run it a few times and seems fine on my end. It does take a second to load the models, but feel free to let me know if this persists.
Same here. Waited 10 seconds. Then gave up. If the list of models takes so long to load, why should I trust you with loading the models themselves? :)
Hey, that links corresponds to the private models list that only works once you've created an account. If you'd like to see the public models page, please check it out here: https://console.ncompass.tech/public-models.
We've put a wrong hyperlink on the website, but we've fixed that now, thanks for letting us know.
Regarding us being able to reliably host models versus setting up a website largely comes down to our technical background. All of us are hardware engineers so our front-end capabilities are not our strong suit :). But our experience as hardware engineers makes us confident in hosting the models themselves. Both 8B and 70B models, if cached, do actually load in exactly 40s, but please feel free to try out the system and see for yourself!
Looks like you put the private models link in your Show HN post text as well - it's worth fixing.
Are you planning to support any image or video generation models, or focusing on text for now?
Thanks for letting us know, we've updated it now!
Although we're currently only supporting text models, we do definitely have image and video generation models in our roadmap as these are very compute intensive models meaning they would benefit greatly from optimizations. We'd love to hear more about any specific models you're hoping to run! Please feel free to message us with further details (diederik.vink@ncompass.tech).
I've edited the text to include both links and explain the difference between them. Thanks!
What is Groq (rate limited) missing that you aren't?
That's a great question, but its hard to get enough insight into how Groq is serving models to properly know what's missing.
If I had to hazard a guess, it would be that their system architecture (# of chips and chip architecture itself) might not be designed for a high concurrency situation.