Hello HackerNews!

I’m excited to share what we’ve been working on at nCompass Technologies: an AI inference* platform that gives you a scalable and reliable API to access any open-source AI model — with no rate limits. We don't have rate limits as optimizations we made to our AI model serving software enable us to support a high number of concurrent requests without degrading quality of service for you as a user.

If you’re thinking, well aren’t there a bunch of these already? So were we when we started nCompass. When using other APIs, we found that they weren’t reliable enough to be able to use open source models in production environments. To resolve this, we're building an AI inference engine that enable you, as an end user, to reliably use open source models in production.

Underlying this API, we’re building optimizations at the hosting, scheduling and kernel levels with the single goal of minimizing the number of GPUs required to maximize the number of concurrent requests you can serve, without degrading quality of service.

We’re still building a lot of our optimizations, but we’ve released what we have so far via our API. Compared to vLLM, we currently keep time-to-first-token (TTFT) 2-4x lower than vLLM at the equivalent concurrent request rate. You can check out a demo of our API here:

https://www.loom.com/share/c92f825ac0af4ab18296a16546a75be3

As a result of the optimizations we’ve rolled out so far, we’re releasing a few unique features on our API:

1. Rate-Limits: we don’t have any

Most other API’s out there have strict rate limits and can be rather unreliable. We don’t want API’s for open source models to remain as a solution for prototypes only. We want people to use these APIs like they do OpenAI’s or Anthropic’s and actually make production grade products on top of open source models.

2. Underserved models: we have them

There are a ton of models out there, but not all of them are readily available for people to use if they don’t have access to GPUs. We envision our API becoming a system where anyone can launch any custom model of their choice with minimal cold starts and run the model as a simple API call. Our cold starts for any 8B or 70B model are only 40s and we’ll keep improving this.

Towards this goal, we already have models like `ai4bharat/hercule-hi` hosted on our API to support non-english language use cases and models like `Qwen/QwQ-32B-Preview` to support reasoning based use cases. You can find the other models that we host here: https://console.ncompass.tech/public-models for public ones, and https://console.ncompass.tech/models for private ones that work once you've created an account.

We’d love for you to try out our API by following the steps here: https://www.ncompass.tech/docs/llm_inference/quickstart. We provide $100 of free credit on sign up to run models, and like we said, go crazy with your requests, we’d love to see if you can break our system :)

We’re still actively building out features and optimizations and your input can help shape the future of nCompass. If you have thoughts on our platform or want us to host a specific model, let us know at hello@ncompass.tech.

Happy Hacking!

* it's called inference because the process of taking a query, running it through the model and providing a result is referred to as "inference" in the AI / machine learning world. It's as opposed to "training" or "finetuning" which are processes used to actually develop the AI models that you then run "inference" on.