Enterprise AI Agents: Bringing LLMs to Enterprise Data
When we cut through the hype surrounding generative AI , one thing is clear: every Generative AI company wants to be an “Agent AI” company, and every “Agent AI” company desperately wants to be an “Enterprise AI” company. Simply put, the enterprise is - and always has been - the most reliable and durable place to build a long lasting company. The enterprise is where sustainable revenues are, albeit much harder to crack. This has been substantiated in recent time by the shotgun marriages between massively funded generative AI startups and FAANG companies to further their focus on driving enterprise value.
At the outset of building Flip AI, we had a deep conviction to focus on the enterprise, and a belief that the most meaningful use cases for generative AI would be found there. The tasks and processes involved in software development lifecycle management (or “SDLC”) are a clear and compelling source of candidate use cases where generative AI can and will make inroads. This is largely attributable to the deep understanding of the problem at hand and the willingness to adopt new tools by the underlying user persona - the enterprise developer. The current market for generative AI powered SDLC tools is flooded with AI code assistants/agents - several of which are seeing impressive levels of adoption - but the vast majority of them (if not all of them) depend on third-party API-based foundational models. These have dubious data privacy and retention policies, leaving you wondering if your data is being used to train and improve models for others. Hype, curiosity, and “fear of missing out” (“FOMO”) have resulted in widespread trials of these technologies, but almost always in isolated or non-critical environments (i.e. dev/test/staging or enterprise “labs”). Bringing LLMs to enterprise data - rather than sending enterprise data out to a 3rd party model - is the real path forward for driving enterprise adoption of generative AI. Code copilots were always going to be the low hanging fruit for many startups to go after given the abundance of training data. The bigger pain for enterprise developers - and therefore the bigger opportunity- is and always will be in fighting fires in production: remediating systems back to health, preventing catastrophic events, and generally keeping production software available for your customers.
“Agentic AI systems can be described as AI systems that solve multi-step goal oriented workflows by taking actions based on reasoning, while interacting with external knowledge and data sources (CRM, Internet, Observability tools, Code Repository) and tools”Â
We define Enterprise AI Agents to have the following capabilities:
- AI that works on data internal to the enterprise
- Customizable within your organization
- No threats to data leakage or security
- Integrates with your tool chain
‍
What is the state of Agentic AI in software development?
Broadly speaking, any knowledge worker reasoning task that is skill-based can be solved by agentic workflows, but let’s focus on software development. Coding is where the majority of the startup innovation in this space seems to be concentrated. Keeping in mind that the space has also been plagued by rumors and accusations of faking demos and showcasing vaporware, our conversations with several early adopters of code copilots yield the following observations:
- Despite steadily improving benchmarks, no LLM has clearly pulled so far ahead that it is widely considered to be the clear leader, and there’s very little accuracy gap across the top 5 LLMs.
- Copilots do well in isolated coding tasks, but navigating through large codebases and executing complex, multi-step tasks is not their strength.Â
‍
Despite the hype, why haven’t AI agents really cracked the code yet?Â
Every generative AI platform and LLM provider has stepped up their investments in enabling agentic workflows by developing their own agent orchestration frameworks and data connectors; but largely speaking, agentic workflows remain repackaged prompt chains from 2023. Agent workflows have been largely thought to be homogenous, i.e simple, uniform and lack of task diversity, but the reality could not be further from this faulty assumption. To solve real-world enterprise problems, agentic workflows must mimic collaboration between myriad siloed expertise across the enterprise. The popular strategies among the research community to improve agent success rates is to do some combination of the following:Â
- Fine-grain task decomposition
- Use of more LLM calls for consensus finding
- Guardrails for LLM outputs
- Larger LLMs or use of multiple LLMs
Most of these options have significant cost implications, and are therefore impractical for most. We’ve seen companies respond to this reality with a disingenuous marketing hack: to simply rebrand everything as an “Agent AI,” formerly known as Generative AI, formerly known as ML, and not too long ago, “Big Data.” Regardless of what marketing term you choose, the key promise has always been the same: eliminate data silos so enterprises can make it to the “promised land” of seamlessly finding timely and accurate insights from their data. Sadly, this promise continues to go largely unfulfilled, and building “agents” that pass still-siloed data into 3rd party, cloud-hosted LLMs is not going to fulfill it. For agentic workflows to be truly useful, you must have agents that are experts in a specific domain within the enterprise, and you bring those agents to your data, not the other way around.
Circling back to software engineering tasks, while nobody has come close to cracking the “end to end” software engineering agents, the “AI developer” coding copilots have indeed added value to software developers. Beyond this though, we continue to hear perhaps the most consistent piece of customer commentary, which is how generative AI can meaningfully aid developers in non-coding tasks - documentation, bug fixing, and most importantly, observability.Â
‍
Why observability agents? Well, observability is a pain in the a$$
Interpret “a$$” how you’d like - for the enterprise developer, it’s quite literal: observability, incident management and remediation is almost certainly a gigantic pain in the behind. And if you’re the buyer of observability tools, then the dollar signs are probably front and center - observability is usually the second biggest IT spend for large enterprises, trailing only their cloud spend.Â
Despite the plethora of tooling, observability today has largely been reduced to alerts and monitoring. Their business models are completely rooted in storage as the unit of consumption and utility, which is fundamentally misaligned from what customers ultimately expect - reduction in MTTR (Mean Time to Resolve/Remediate). The result of this: observability incumbents are essentially big data companies that generate revenue primarily by charging customers to store observability data (most of which has a pretty short half-life), while providing basic visualizations, such as time series charts and flame graphs. No observability incumbent has mastered each modality of the MELT data stack (Metrics, Events, Logs, Traces), and this is clearly evidenced in the ubiquitous enterprise practice of adopting different observability tools for different observability data modalities. It is this very fragmentation that regularly causes teams on the front lines of managing incidents and system performance, such as SRE, production engineering, DevOps and software development, to suffer greatly. In this bargain, enterprises must also accept what they dread most when it comes to their business critical systems - prolonged downtime.Â
At Flip AI, we take an automation-first approach and not a chat-bot first one. We surface the story embedded in all of your observability data. We contextualize the story within the bounds of an anomalous event, or an impending anomalous event. We tell you the beginning, the middle and the end of this story so that developers don’t have to connect the dots manually. Imagine a world where the on-call is paged with a critical alert, and by the time they open the alert to understand what might be going on, an automated report has already been embedded there, a report that articulates the “what,” “where,” and “why” of the incident or anomaly in question. Speaking as plainly as we can, this is no longer a thought exercise - Flip AI customers experience this very workflow every day. And each of our customers will tell you that this has resulted in a step-function reduction in MTTR.Â
‍
How did Flip build Enterprise AI Agents for observability?
Flip was designed from the ground up to work autonomously in customer environments, which meant that Flip’s subsystems have always had to support diverse architectures and programming languages, as well as work with the frequent and unpredictable changes in data flows.Â
While most agentic systems are built by developers who treat the LLM as a black-box system, at Flip AI, we think about the entire stack to make agents work. Flip trains its mixture of experts (MoE) LLM to be a domain expert by training on over 100 billion tokens of DevOps data. The LLMs are trained to understand multiple modalities of data (code, metrics, events, logs, traces). The use of tools is treated as a first class citizen while training, which is important to enable zero-shot performance in novel tool usage scenarios. In addition to this, below are tenets that have made Flip AI successful in building agentic workflows.
- Never run in silos: Agents that need to solve complex problems interact with other systems and agents. We have built cooperative agents (which we call “Actors”) that work on solving a single goal, i.e determining the root cause of an incident while being able to generate multiple hypotheses and explore them independently.
- Dynamic discovery: In an enterprise, each service could have different deployment architecture, language, infrastructure, etc. When Flip is deployed, it has no knowledge of architectures, languages or configurations – all of these are dynamically discovered so that up to date information is used while debugging.Â
- Inconsistencies are everywhere: Most agentic workflows code the happy path scenarios – Systems can return bad data, incorrectly formatted data etc. At Flip, the Actors are able to backtrack and explore new hypotheses based on their observations.
- Rules don’t work: Production systems are not Newtonian, i.e orderly and predictable, but instead conform to the chaos theory paradigm. Flip uses reinforcement learning from human feedback (RLHF) to train its model on large scale chaos generated data to build up its knowledge on the complexity of systems. ‍
- Observability is Multimodal: Flip relies on multiple modalities to connect the dots on what happens in a production system.‍
- Fallback and adapt: When a certain modality of data is unavailable, Flip is able to deduce information from other modalities as a proxy. For example, when application tracing isn’t available, Flip is able to create traces purely based on logged interactions.
- ‍Respect data governance: When working with sensitive data, you can’t store or replicate data in ways that violate compliance requirements. Flip doesn’t store any sensitive data and is able to query observability systems on-demand, respecting data governance boundaries within the enterprise. The entire Flip stack, including the LLM itself, is designed to be installed deep in the enterprise, behind the corporate firewall, without the use of any 3rd party foundational models. In other words, Flip doesn’t require you to acknowledge another data processor agreement.Â
‍
So will Flip actually work for my enterprise?
The approach described in this blog post has enabled Flip DevOps LLM to successfully execute its agentic workflows to reduce MTTR significantly for our customers without any puppetry. Flip is currently deployed alongside at-scale production systems in diverse customer environments ranging from monolith, n-tier, .NET, and Java environments, to modern microservice kubernetes based cloud environments. Flip AI has been deployed in all the major clouds and on-prem environments. Our team's greatest joy has been in solving the numerous challenges required to allow our LLMs to run autonomously in customer’s virtual private clouds - with all the enterprise security controls - while generating RCAs to bring MTTR down to minutes. Truly autonomous agent actor AI is still in early days and we are excited to be leading the way here. Contact us to get started on reducing your MTTR while increasing sleep.
‍