I'm tasked with building a private AI assistant for a corpus of 10 million text documents (living in PostgreSQL). The goal is semantic search and chat, with a requirement for regular incremental updates.
I'm trying to decide between:
Bleeding edge: Implementing something like LightRAG or GraphRAG.
Proven stack: Standard Hybrid Search (Weaviate/Elastic + Reranking) orchestrated by tools like Dify.
For those who have built RAG at this scale:
What is your preferred stack for 2025?
Is the complexity of Graph/LightRAG worth it over standard chunking/retrieval for this volume?
How do you handle maintenance and updates efficiently?
Looking for architectural advice and war stories.
If it's < 100M, with vectors of 1024 size, you could fit all of that in ~100G of memory. So, maybe storing it in memory is an easy way to go about it. This ignores a lot of "database problems". If the docs are changing constantly, or uou have other scalability concerns, you may be better off using a "proper" vector db. There have been HN postings which indicate vector db choice matters. Do your research there.
ranked hierarchical pagination and intermediate context control. also, text documents in database or text data in worth of 10 million documents? If you OCR, why not cache result? Also, Lucene white space tokenization is pretty good for dumb exact or close enough to get a filtered result that might fit the context windows better. imagine having to ocr and llm, instantly. i would do everything to avoid architecting a system like that. not sure if you're pointing the right end of the stick at the right problem. are you intending to max out your allowed context? what's going on here? you can usually extract rough set before you llm so ideally you'd never exceed 50% of context. How big of responses do you expect? you have a lot of options, just throw everything at the problem that's easy to implement first and see what sticks. make sure you got terminal access whereever you do this for max flexibility. i obviously prefer aspnet with psql. what kind of data do you need indexed? lets say you have something stupid like origin and destination based on locations and you need geo index and maybe zipcode database, and some intermediate step to calculate assets within radius, calc some distances and make a decision, adding geo to any problem is a nightmare, but fun, but only the first time. cause you know how to do it now but it takes so long you don't want to. if you have terminal and source you have enough space to maneuver updates, it'll end up being probably a one line to execute an update that takes some time to rebuild your solution and then it seems to automatically slide it under the working app i never experienced any problems. as for database schema changes, push out your production release to where the time between schema changes go down to less than 5% or something extreme but be aware there could be schema changes that are hard to implement even later, but after you're in production it's much harder.
do you have an evaluation in place that necessitates complex stuffs? If not I'd start simple with proven stuffs and collect usage data to determine what's next