Tanstack Start | Tell HN: I cut Claude API costs from $70/month to pennies

Tell HN: I cut Claude API costs from $70/month to pennies(undefined)

37 points by ok_orco a day ago | 21 comments

The first time I pulled usage costs after running Chatter.Plus - a tool I'm building that aggregates community feedback from Discord/GitHub/forums - for a day hours, I saw $2.30. Did the math. $70/month. $840/year. For one instance. Felt sick.

I'd done napkin math beforehand, so I knew it was probably a bug, but still. Turns out it was only partially a bug. The rest was me needing to rethink how I built this thing. Spent the next couple days ripping it apart. Making tweaks, testing with live data, checking results, trying again. What I found was I was sending API requests too often and not optimizing what I was sending and receiving.

Here's what moved the needle, roughly big to small (besides that bug that was costin me a buck a day alone):

- Dropped Claude Sonnet entirely - tested both models on the same data, Haiku actually performed better at a third of the cost

- Started batching everything - hourly calls were a money fire

- Filter before the AI - "lol" and "thanks" are a lot of online chatter. I was paying AI to tell me that's not feedback. That said, I still process agreements like "+1" and "me too."

- Shorter outputs - "H/M/L" instead of "high/medium/low", 40-char title recommendation

- Strip code snippets before processing - just reiterating the issue and bloating the call

End of the week: pennies a day. Same quality.

I'm not building a VC-backed app that can run at a loss for years. I'm unemployed, trying to build something that might also pay rent. The math has to work from day one.

The upside: these savings let me 3x my pricing tier limits and add intermittent quality checks. Headroom I wouldn't have had otherwise.

Happy to answer questions.

LTL_FTC a day ago
It sounds like you don’t need immediate llm responses and can batch process your data nightly? Have you considered running a local llm? May not need to pay for api calls. Today’s local models are quite good. I started off with cpu and even that was fine for my pipelines.
- ok_orco 14 hours ago |parent
  I haven't thought about that, but really want to dig in more now. Any places you recommend starting?
- kreetx a day ago |parent
  Though haven't done any extensive testing then I personally could easily get by with current local models. The only reason I don't is that the hosted ones all have free tiers.
- ydu1a2fovb 19 hours ago |parent
  Can you suggest any good llms for cpu?
  - R_D_Olivaw 17 hours ago |parent
    Following.
- queenkjuul a day ago |parent
  Agreed, I'm pretty amazed at what I'm able to do locally just with an AMD 6700XT and 32GB of RAM. It's slow, but if you've got all night...
toxic72 2 hours ago
consider this for addtl cost savings if local doesnt interest you - https://docs.cloud.google.com/vertex-ai/generative-ai/docs/m...
44za12 a day ago
This is the way. I actually mapped out the decision tree for this exact process and more here:
https://github.com/NehmeAILabs/llm-sanity-checks
- homeonthemtn 20 hours ago |parent
  That's interesting. Is there any kind of mapping to these respective models somewhere?
  - 44za12 19 hours ago |parent
    Yes, I included a 'Model Selection Cheat Sheet' in the README (scroll down a bit).
    I map them by task type:
    Tiny (<3B): Gemma 3 1B (could try 4B as well), Phi-4-mini (Good for classification). Small (8B-17B): Qwen 3 8B, Llama 4 Scout (Good for RAG/Extraction). Frontier: GPT-5, Llama 4 Maverick, GLM, Kimi
    Is that what you meant?
gandalfar a day ago
Consider using z.ai as model provider to further lower your costs.
- tehlike a day ago |parent
  This is what i was going to suggest too.
- viraptor a day ago |parent
  Or minimax - m2.1 release didn't make a big splash in the news, but it's really capable.
- ok_orco 14 hours ago |parent
  Will take a look!
- DANmode a day ago |parent
  Do they or any other providers offer any improvements on the often-chronicled variability of quality/effort from the major two services e.g. during peak hours?
deepsummer a day ago
As much as I like the Claude models, they are expensive. I wouldn't use them to process large volumes of data. Gemini 2.5 Flash-Lite is $0.10 per million tokens. Grok 4.1 Fast is really good and only $0.20. They will work just as well for most simple tasks.
joshribakoff a day ago
Have you looked into https://maartengr.github.io/BERTopic/index.html ?
dezgeg a day ago
Are you also adding the proper prompt cache control attributes? I think Anthropic API still doesn't do it automatically
arthurcolle a day ago
Can you discuss a bit more of the architecture?
- ok_orco a day ago |parent
  Pretty straightforward. Sources dump into a queue throughout the day, regex filters the obvious junk ("lol", "thanks", bot messages never hit the LLM), then everything gets batched overnight through Anthropic's Batch API for classification. Feedback gets clustered against existing pain points or creates new ones.
  Most of the cost savings came from not sending stuff to the LLM that didn't need to go there, plus the batch API is half the price of real-time calls.
DeathArrow a day ago
You also can try to use cheaper models like GLM, Deepseek, Qwen,at least partially.