Now that everybody and their mother are fuzzing in social media about LLM agents and agentic LLM systems (or something), are there actual examples of live applications that are based on an agentic LLM process flow?
I'd be curious to know and see such examples in order to derive some inspiration from them.
We've been using them to find novel vulnerabilities in open source web apps. The past 4 posts here have details:
- Auth bypass/arbitrary file read in Scoold: https://xbow.com/blog/xbow-scoold-vuln/
- SSRF in 2FAuth: https://xbow.com/blog/xbow-2fauth-ssrf/
- Stored XSS in 2FAuth: https://xbow.com/blog/xbow-2fauth-xss/
- Path traversal in Labs.AI EDDI: https://xbow.com/blog/xbow-eddi-path/
Each of those has an associated agent trace so you can go read exactly what the agent did to find and exploit the vulnerability.
An anecote that helps you maybe:
I do contracting work, we're building a text-to-sql automated business analyst. It's quite well-rounded: it tries to recover from errors, allows automatic creation of appropriate visualisations, has a generic "faq" component to help the user understand how to use the tool. The tool is available to some 10.000 b2b users.
It's just a bunch of prompts conditionally slapped together in a call graph.
The client needed AGENTIC AI, without specifying exactly what this meant. I spent two weeks pushing back on it, stating that if you replace the hardcoded call graph with something that has """free will""", accuracy and interpretability goes down whilst runtimes go up... but no, we must have agents.
So I did nothing, and called the current setup "constrained agentic ai". The result: High fives all around, everyone is happy
Make of that what you will... ai agents are at least 90% hype.
The hype of Agentic AI is to LLMs what an MBA is to business. Overcomplicating something with language that is pretty common sense.
I've implement countless LLM based "agentic" workflows over the past year. They are simple. It is a series of prompts that maintain state with a targeted output.
The common association with "a floating R2D2" is not helpful.
They are not magic.
The core elements I'm seeing so far are: the prompt(s), a capacity for passing in context, a structure for defining how to move through the prompts, integrating the context into prompts, bridging the non-deterministic -> deterministic divide and callbacks or what-to-do-next
The closest analogy that I find helpful is lambda functions.
What makes them "feel" more complicated is the non-deterministic bits. But, in the end, it is text going in and text coming out.
Sounds awesome. :D For real, the anecdote is hilarious and I find it easy to believe but also sounds cool what you are working on.
Well you work in the field for a while, and you accumulate anecdotes of colleagues dropping tactical sleep(5000)'s so they can shave some milliseconds of latency each week and keep the boss happy.
I love those stories but I could never do that with a straight face. However, the AI field is such an uphill battle against all the crap that LinkedIn influencers are pushing into the minds of the C-suite... I feel it's okay to get a bit creative to get a win-win here ;)
I’ve been doing a lot of work on semantic data architecture that better supports LLM analytics, did you use any framework or methodology to decide how exactly to present the data/metadata to the LLM context to allow it to make decisions?
Love that. Reminds me of a time I was asked to build a "machine learning algorithm" driven recommendation system... and eventually I realized that delivering a recommendation system based on one big BM25 search query was fine, and the people asking for it to use "machine learning" didn't actually understand or care about the difference.
Haha yes, the LLM era is "data science is the hottest new job" all over again.
I guess everything with an algorithm in it is AI if you look at it from enough of a distance...
Am I the only one who finds these types of comments arrogant? I mean, we get it, you know better and have been doing this for a long time and so forth...Sometimes I feel like it's just about relativizing whatever tech is popular right now. Just to come back two years later and say "oh well I've been telling people about this cool tech two years ago!"
Give a counter example then. I’ve been doing this for years: people want the hot new thing even if it’s the worst idea, you rebrand it, and everyone is happy. Then a few months later, people praise you for not having implemented that bad idea.
If you are looking for LLM agents that go off and do a bunch of work on their own, you will be supremely underwhelmed. Anyone who went straight to building agents without a human in some large loop found that they were trying to make the LLM do things it was extremely bad at.
The right approach to build toward agents is to start with something that gives pretty good responses to prompts and build up an agentic mode to let it do more and more in response to each prompt. It should be thought of as extending how much you get per prompt, and doing so by chaining together components you've already worked at making to good at.
Cursor (the LLM powered VS Code fork) has an agentic mode and they are doing this the right way. The normal chat window is good at producing changes to your code, and at applying them, at looking at lints, at suggesting terminal commands, at doing directory listings or RAG on your codebase. Agentic mode is tying those together to do more of the work you want with fewer prompts from you.
As a side note, while I know of several language model based systems that have been deployed in companies, some companies don't want to talk about it:
1. Its still perceived as an issue of competitive advantage
2. There is a serious concern about backlash. The public's response to finding out that companies have used AI has often not been good (or even reasonable) -- particularly if there was worker replacement related to it.
It's a bit more complicated with "agents" as there are 4 or 5 competing definitions for what that actually means. No one is really sure what an 'agentic' system is right now.
There is a very simple and obvious definition: it's agentic if it uses tool calls to accomplish a task.
This is the only one that makes sense. People want to conflate it with their random vague conceptions of AGI or ASI or make some kind of vague requirement for a certain level of autonomy, but that doesn't make sense.
An agent is an agent and an autonomous agent is an autonomous agent, but a fully autonomous agent is a fully autonomous agent. An AGI is an AGI but an ASI is an ASI.
Somehow using words and qualifiers to mean different specific things is controversial.
The only thing I will say to complicate it though is if you have a workflow and none of the steps give the system an option to select from more than one tool call, then I would suggest that should be called an LLM workflow and not an agent. Because you removed the agency by not giving it more than one option of action to select from.
This has been my experience. Lots of companies are implementing LLMs but are not advertising it. There's virtually no upside to being public about it.
With all the agencies and the YouTube demos of n8n and Make.com they should be everywhere.
I look at my workplace and I see places where they might fit in but if the reliability isn’t 99.5% they won’t be trusted and I think that’s a problem.
I made a toy in n8n that collects transactions in YNAB via API and matches them to Amazon orders in GMail. It then uses GPT-4o with vision to categorize the product pictures according to my budget’s categories but I have to add the order link to the transaction memo and add a flag for human review because it’s only 80% or so. It has sped up the workflow for sure but nowhere near good enough to set it and forget it.
Interesting! To me 80% hitrate sounds actually pretty good and awesome if it actually improves productivity, though understandably not something that could be left on it's own devices.
I had no idea about Make.com or n8n, they seem interesting. Thanks for the tip! Will check them out.
We built several LLM-powered applications that collectively served thousands of users. The biggest challenge we faced was ensuring reliability: making sure the workflows were robust enough to handle edge cases and deliver consistent results.
In practice, achieving this reliability meant repeatedly:
1. Breaking down complex goals into simpler steps: Composing prompts, tool calls, parsing steps, and branching logic. 2. Debugging failures: Identifying which part of the workflow broke and why. 3. Measuring performance: Assessing changes against real metrics to confirm actual improvement.
We tried some existing observability tools or agent frameworks and they fell short on at least one of these three dimensions. So we built our own: https://github.com/PySpur-Dev/PySpur
1. Graph-based interface: We can lay out an LLM workflow as a node graph. A node can be an LLM call, a function call, a parsing step, or any logic component. The visual structure provides an instant overview, making complex workflows more intuitive. 2. Integrated debugging: When something fails, we can pinpoint the problematic node, tweak it, and re-run it on some test cases right in the UI. 3. Evaluate at the node level: We can assess how node changes affect performance downstream.
We hope it's useful for other LLM developers out there
If we're going to have a conversation about agents or agentic it is really important we agree on which definition of those terms we are using for the purpose of this conversation.
If you ask two different people in the AI space to define "agent" you almost always get two slightly (or significantly) different definitions!
Here are just some of the definitions I've seen over time: https://news.ycombinator.com/item?id=42216217#42228364
For the purpose of this thread the most cynical definition, "LLMs that do something useful", might actually be the best fit!
The way I look at Agentic systems is that there are Tools an LLM can call out to, and do work with.
Last week Wednesday I participated in Anthropic's Model Context Protocol hackathon, and built a system with my team partner Zia to automatically search and find restaurants for your dietary preferences and group size.
It also automatically downloads social media of the restaurant to get a vibe for the place.
There's a video of it in action here: https://www.youtube.com/watch?v=c6vGrfHFyu8
And a Github repo here: https://github.com/zia-r/gotta-eat
I know of many, many LLM systems in production system, since that's what I've been helping companies build since the start of the year. Mostly it's pretty rote automation work but the cost savings are incredible.
Agentic workflows are a much higher bar that are just barely starting to work. I can't speak to their efficacy but here's a few of the ones that are sort of starter-level agents that I've started seeing some companies adopt:
Cost saving as in...? Hopefully not saving through making human employees redundant.
We use LLM agents to do proofreading and editing of transcripts after they are edited by people. They are good at applying our customer's specific requirements (e.g. capitalization, formatting, etc.) without us having our folks worry about any of that. We use https://transcriberai.com or https://otter.ai/ (there are a bunch) to create the first transcript for our transcriptionists.
We're running an agentic LLM system in production that generates marketing strategy. As of now, we're up to about 60 agents and as we add functionality we'll add more. And yes, it's not easy to get them to stay on track and cooperate. https://www.goguma.io
You'd probably have to define agents first. What large / mega caps call agents is LLM + RAG + API Calls to read data and trigger jobs. And there are plenty of those online
Yes, that's what it seems to me also, that often a RAG or similar is branded as an "agent". Though I personally understand an LLM agent as something that takes input x to use in LLM inference and then uses the output from that inference to create a new input for another LLM inference that includes the first output and so on, and repeats this >1 times.
That's an LLM workflow and not an agent if it's on rails created by a predefined workflow and doesn't make tool calls, or does not have any choice in tool calls. The tool calls are what give it agency.
Yeah. An agentic workflow is nothing but implementation of execution of a bunch of tasks and each task takes a little bit of help from the LLM. Honestly I believe this is applicable to companies that have workflows having a lot of manual tasks and automation of these workflows could be easier with the help of LLM agents.
The term "agent" is quite broad. In my definition, an LLM becomes an agent when it utilizes the tool usage option.
ChatGPT is a good example: you ask for an image, and you receive one; you ask for a web search, and the chatbot provides an answer based on that search.
In both cases, the chatbot has the ability to rewrite your query for that tool and is even able to call the tools multiple times based on the previous result.
I asked a similar question a few months ago: https://news.ycombinator.com/item?id=39886178
It seems the community has gotten more negative about agentic approaches since then, and it wasn’t pretty then.
We have a couple of systems at work that incorporate LLMs. There are a bunch of RAG chatbots for large documentation collections and a bunch of extract-info-from-email bots. I would none of these call an agent. The one thing that comes close to an agent is a very bot that can query a few different SQL and API data sources. Given a users text query, it decides on its own which tool(s) to use. It can also retry, or re-formulate its task. The agentic parts are mainly done in LangGrah.
Yesterday I recorded an example of an O'reilly auto parts customer service agent to show how users can invoke them using RAG - last part of this video https://youtu.be/Qk_pVHtgcyA
There are plenty of RAG-capable LLMs in production, but still few products/UX oriented toward agentic work.
An AI product that can make purchases and API requests to external services like delivery drivers, calendars, etc. is still needed to truly enable these "agents" - which right now are basically read-only domain-specific LLMs.
IMO "Agents" are a marketing term, they are simply software that use LLMs somewhere in the backend. Often daisy chained into a series of operations that may involve additional LLM calls or calls to other internal/external services.
One we've been using for meeting notes + action items works quite well https://fireflies.ai
no "Agents" have a specific technical meaning.. an engine is connected to tools.. simple example is a bash terminal environment.
Where did you get the idea that "LLM connected to tools" is the one true meaning of the term agents?
(I'm not trying to be accusatory here, just trying to understand how these opinions spread.)
It's the same that I've converged on: Agents are LLMs with agency, ie. can effect actions (by being connected to tools, APIs, etc). I break out the offerings into two groups: 1. those that are pre-built for specific tasks/areas (or can be no/low-code scripted using pre-built bits), 2. those that can be custom built to connect to you specific API/actions.
Another version of Agents is described in the paper "Agent Hospital: A Simulacrum of Hospital with Evolvable Medical Agents"[0] where instead of having one LLM prompted context, you create many that specialize in different areas, and have them communicate to distribute a task. The effect is that the outcome is higher quality than trying to use a single automated LLM.
Dawn Song final lecture at LLM class, UC Berkeley two weeks ago
Devin is closest for me. I’ve had it implement additional language locales and add dark mode to our UI.
Support bots. Scrapers. Personal assistants. Search startups like perplexity. Scammer bots. Bots that spread political agenda. "AI" memecoins.
When does a LLM customer support bot that is based for example on RAG architecture, become an LLM agent?
My take is that if the LLM outputs text for humans to read, that's not an agent. If it's making API calls and doing things with the results, that's an agent. But given the way "AI" has stretched to become the new "radium" [1], I'm sure "agent" will shortly become almost meaningless.
The definition of agent is blurry. I prefer to avoid that term because it does not mean anything in particular. These are implemented as chat completion API calls + parsing + interpretation.
As soon as we admit to ourselves that agent is just another word for context isolation among coordinated llm tasks.
Will agents still matter once models do a better job paying complete attention to large contexts?
They're ALL bullshit and there's a technical reason why.
Your rube goldberg contraption that you put together for your borderline-fradulent pitch deck is NOT an assembly-line nor is it a product anyone's gonna buy. Why?
Because cosine similarity search mathematically sucks a* , large context windows, while better, are nowhere close to being fast and practical ( maybe with a small exception of the generic sounding 1M context summaries you now get from gemini flash 2.0 exp ) . You probably don't have any kind of CI/CD setup, no testing at all zero, no benchmarking of accuracy, you probably can't even get lm_eval installed in the first place so no troubleshooting methodoloy, no formal iteration pipeline, you're not putting out a new model every 2 weeks and iterating upon, and YOU at this point probably can't find your own way to your own fkin toilet seat without Cursor's GPS showing your where it is and then writing a whole factory just to open the toilet seat.
You look at the youtube demos and it's just another investor slop to be sold to other sloppy investors. I even asked on uncle Elons twitter if anyone had a demo of agents doing anything in real life, and after 1/4million views the only thing that even worked AT ALL were spambots and Pliny's agent making a sh*tcoin. https://x.com/nisten/status/1808522547169763448
People cook something at home and immediately get delusional thinking they now have an assembly line that's just going to print money... have you ever actually looked at an industrial pasta-maker machine. Do YOU have the skills to make that? I'm sorry but no ammount of shrooms and microdosed-meth pills is gonna get you that.
Agents do not exist yet, they will sooner or later, but right now they're a concept more along the lines of scammy ledger-backed dbs.
You can always prove me wrong with a real life demonstration of an automated tool doing a complex ammount of steps that you'd normally expect an average-ish worker to do for you on a RELIABLE rate basis. I.E. Doing your taxes like your accountant or 10 year old hopefully does.
"Agents do not exist yet, they will sooner or later"
Which definition of agents are you using there?
Top notch monday morning rant!
Not sure why that was downvoted TBH, seems accurate to me.
Windsurf IDE from Codeium. Still some rough edges, but they’ve beat the Claude UI and Cursor for coding. Their code search is also next-level. Crazy efficiency gains for me for small-to-medium sized projects. Apparently, they have a ton of enterprise customers and are doing fast iteration loops relative to user signals (e.g. accepting diffs).
Agentic workflow is great only for demos without real business cases. Each agent can hallucinate, which will pass this hallucination to another agent. In the end, you have just garbage. But... It's better to be silent, we still need to inflate this bubble.
there are workflows where the outputs are "narratives" - an example is customer support. another example is summarization of text. characteristic of these use cases is that there is no one right answer. in these use cases agents fit in well.
issue however, is that the agents cannot be chained. ie, chaining requires deterministic outputs and not narratives.
>characteristic of these use cases is that there is no one right answer
I think what you mean is that they work best in cases where it's very hard to measure how well they are working.
And where it's also hard to tell who is doing the work! I'm reminded here of psychics and cold readers. They can easily convince people that they have great mental powers by outputting ambiguous text and letting the consumers of it do most of the work. You'll see similar effects with Meyers Briggs tests and other sorts of business astrology: some people feel like they get a lot of value out of them, but rigorous tests don't back that up.
LLM agents: not so much.
Actual real Intelligent Autonomous Agents - go to Mars, kick a rover... there's one. Go try front running on the markets, you'll meet about 6000 other ones trying to out run you.