Hey everyone! Eric from Firecrawl here (YC S22), I wanted to share some takeaways from building AI chat products for enterprises at Mendable!

A year ago, Generative AI powered search seemed like the perfect quick win for companies taking their first steps with AI.

After building AI search for companies like MongoDB, Coinbase, and Snap at Mendable, we learned the nuances that make the difference between a demo and a production-ready system that actually drives value.

Before we came in, almost every company had a few engineers creating prototypes yet most of which fell short following contact with real users.

It all comes back to a simple truth: the system is only as good as the data going in.

As you scale, here’s why the simple approaches break:

- Context crowding: Correct context for a given query gets crowded out by bad context. Take the Snap AR docs for example, they have 4 different products on their developer documentation website and they all have getting started pages. If a user asks a vague query like “how do I get started” to a basic RAG chatbot, the answer is going to most likely be an incorrect amalgamation of the 4 guides. - Outdated data: Information and processes constantly iterate, and documentation is not always maintained. One of our first customers, Spectrocloud, was benchmarking our chatbot before going into prod and they found that one answer in particular was not correct. At first we thought that the model was hallucinating, but after manually searching the docs we found the outdated source info on an obscure part of the documentation. - Data cleanliness: If data isn't clean, performance worsens and costs soar. We powered the chatbot on the documentation for Langchain, and data cleanliness and specifically prompt injection was a huge issue. Many of the Langchain pages had prompt examples embedded in them, which confused the model. We also noticed that a lot of unnecessary extra information was in our index like navigation menus on every page. - Data access: Accessing a variety of data sources is often critical for companies, but it introduces a host of challenges. For example, at Firecrawl, we’ve seen that many large companies simply want to access web data from their own websites, but even this can involve complex permissioning, authentication, and data-fetching hurdles.

To mitigate these issues, companies building these apps should have a data strategy with the goal of curating and maintaining quality data. Here are some practical suggestions to guide your strategy:

- Metadata Management: Good metadata is your first defense against context crowding. Every piece of content should be tagged with essential details like product association, who created it, and who can access it. This enables advanced filtering and more accurate responses. - Data Maintenance: To keep data fresh and reliable, the teams that create content should be responsible for regular reviews and updates. When underlying information changes, the corresponding documentation needs to change with it. - Data Sanitation: Raw data rarely arrives in ideal form. Before ingestion, strip away unnecessary formatting and information while preserving the essential details. While each content source requires different handling, tools like Unstructured can help standardize this process. - Data Access & Integration: Build the infrastructure to access your data sources seamlessly. You'll need continuous data flow from knowledge bases, ticketing systems, websites, and more. Tools like Firecrawl can help build these pipelines and ensure high-quality data ingestion.

Startups like Glean, Unstructured, and our own Firecrawl have made some incredible progress on these problems, but no one has solved it all. No matter what tools emerge to make the process easier, having a robust data strategy is foundational to building production ready Generative AI Apps.

Thanks for reading and best of luck building AI apps!