Flip AI in Vegas! re:Invent 2023 Through the Lens of Observability & GenAI

Nate Slater

Dec 12, 2023

Flip AI in Vegas! re:Invent 2023 Through the Lens of Observability & GenAI

re:Invent 2023 has come and gone, and a strong contingent of the Flip AI founding team were not only in attendance, but our CTO, Sunil Mallya, presented a 400 level deep dive talk on how Flip AI trains and deploys our DevOps LLM on AWS! I’ve been attending re:Invent for a long time, first as a customer in 2013, then as an AWS employee for almost a decade through 2022, and now again as a customer this year. One of my favorite things about re:Invent as an employee was co-presenting with customers who were building really cool solutions with AWS services. It was an amazing experience this year to actually be in this role as a customer!

Now that we’ve had a week to digest all of the announcements, we at Flip AI want to take this opportunity to cut through all of the hype and examine some of the product launches that may have flown under the radar but could have important implications in the observability space, particularly through the lens of the generative AI approach that Flip AI takes.

Let’s start with this one: CloudWatch Logs Anomaly Detection and Pattern Analysis. There’s a lot to unpack in this announcement, but let’s start first with the anomaly detection capability. At Flip AI, we actually avoid using the term “anomaly detection” because it can mean a lot of things and often connotes a traditional AIOps approach; an approach that we believe has largely failed because AIOps technologies use ML models that predate LLMs, older models that require training on heavily sanitized customer data before the models can do anything useful. This is not the case with the LLM models that Flip AI trains, which have zero shot inferencing. So how does CloudWatch Logs anomaly detection work? From the documentation:

“After you create an anomaly detector for a log group, it trains using the past two weeks of log events in the log group for training. The training period can take up to 24 hours. After the training is complete, it begins to analyze incoming logs to identify anomalies, and the anomalies are displayed in the CloudWatch Logs console for you to examine.”

In other words, this is traditional AIOps. There’s nothing wrong with that, and for many customers, this could absolutely be useful. As any experienced DevOps expert would say, the most difficult anomalies are subtle and only appear to be anomalous after a second look. More often than not, these patterns appear almost out of nowhere - 24 hours isn’t going to cut it. Having said that, the Pattern Analysis capability indeed is interesting to us at Flip AI. According to the documentation, the patterns that CloudWatch Logs discovers in the logs looks like this:

<*> <*> [INFO] Calling DynamoDB to store for resource id <*>

This should look very familiar to anybody who’s worked with log parsers like “drain,” a long-standing tool in most AIOps-based log analysis pipelines. At Flip AI, we’ve found that techniques for clustering logs of similar structure and meaning are immensely useful when combined with an LLM like the one we have built. CloudWatch Logs will now discover these patterns in logs, and enabling this is free! The great news for Flip AI customers that are using CloudWatch logs is that our models can use these types of discovered patterns - whether they are anomalous or not - to provide rich insights into what your logs are actually saying. For those who haven’t, we still have your back since being able to detect patterns in real time is foundational to Flip’s models and has been since we first developed them. We’re excited to try this feature with many of our enterprise customers.

Let’s move on to the next announcement that caught our attention: “Amazon Managed Service for Prometheus collector provides agentless metric collection for Amazon EKS.” With all the attention on the generative AI announcements, it can be easy to forget that what has made AWS so successful over the years has been eliminating undifferentiated heavy lifting for customers. That’s exactly what this service does. Anyone who’s had to install agents for scraping Prometheus metrics, like the OTEL collector, knows how hard it is to scale horizontally within a Kubernetes cluster. Customers are emitting more Prometheus metrics than ever before, and eliminating the operational overhead of this within EKS is a huge win for them. Flip AI’s models are able to detect patterns in metrics and reason about what is going on based on these patterns. We’ve been huge fans of Prometheus, and love that AWS is making it easier for EKS customers to use it.

Starting at re:Invent 2022, we’ve demo’ed Flip AI’s application to nearly a hundred enterprise customers as we’ve perpetually iterated on it for the past year. The most common refrain we’ve heard from these customers is that their observability data is spread across platforms. So this next re:Invent 2023 launch announcement was no surprise to us: Use Amazon CloudWatch to consolidate hybrid, multicloud, and on-premises metrics. Given what we’ve heard from enterprise customers, we’ve got no doubt that being able to consolidate metrics from different sources will be an important convenience for many of them. But what really caught our attention about this announcement was the ability to take data in RDBMS tables and convert those to metrics. The reason this is interesting is that another common refrain we’ve heard from customers is that they want to define KPIs that bridge observability data with business data, and they want Flip’s automated debugging playbooks to kick into gear when these business-leaning KPIs are breached. Since RDBMS tables are mostly used to store business data, this feature could be used to define these types of business-critical KPIs. In other words, imagine a future where next-generation observability technologies could debug whether a business-critical KPI was suffering due to an underlying infrastructure or application-related incident (yes, we’re aware that a handful of observability tools claim to be able to do this; in practice, this only works when you have every single piece of telemetry data stored in one tool, which we have not seen with a single enterprise customer we’ve spoken to).

We have already begun experimenting with this new feature in partnership with some of our enterprise customers to discover the art of the possible, but the early signs are very promising.

If you watched the re:Invent keynotes, by now you’re probably wondering “why hasn’t Flip AI mentioned Q in this blog post yet?” It's a valid question. Doesn’t Q (and related features like using natural language to query Amazon Cloudwatch logs and metrics) contain a lot of the same capabilities as Flip? Not really, but sure - there is some overlap. Let’s look at this product launch in more detail.

Amazon Q looks like a promising kitchen-sink-like launch that claims to do a wide variety of tasks ranging from generating code, writing RDBMS queries, debugging code, to automating business intelligence. You can ask questions about AWS services, and debug your lambda function or network related issues. This all seems useful. But these features seem disparate (and in true AWS fashion) tempting but incomplete for automating a real end-to-end production scenario like a debugging workflow. Flip automatically generates the queries used by its automated debugging playbooks for observability platforms like Cloudwatch (and many others) without any prompting required by an engineer. Contrast that to AI agents (or “co-pilots” in Microsoft parlance or “GPTs” in OpenAI vernacular) that are designed to automate specific day-to-day tasks by responding to prompts and questions that the human must enter into a chat interface. Flip’s debugging playbooks on the other hand run completely autonomously as opposed to waiting for the right question to be asked.

It's a profound difference. Use of an agent to debug an issue requires that the engineer know which questions to ask. Flip’s models know how to construct a playbook completely automatically based on the knowledge the models have of the modern tech stack. When using Flip, an engineer is shown a complete RCA, broken down into a series of observations that detail the LLM reasoning and the data each observation is grounded in. To reproduce this with an agent, the engineer would have to instruct the agent to execute each of the steps that Flip generates and executes automatically. As a result, Flip can scale in a way that even an AI agent assisted human debugging workflow cannot. This doesn’t suggest that agents like Q have no place in automating workflows that engineers execute manually today - far from it! We firmly believe that the future will include both AI agents that provide an interactive AI driven experience and fully autonomous AI like Flip’s.

The final announcement we’ll discuss today is one that pertains to storage. By far and away, the single biggest business-related pain point our customers tell us they have with observability is cost. As we’ve written in previous blog posts, observability platforms are big data platforms. However, the storage within these platforms is priced at a premium. That’s why we are excited for the newest S3 storage tier, Express One Zone. We’ve been having many conversations with customers about using Flip with observability data stored in inexpensive cloud object storage. Since Flip dynamically creates runbooks and can query data using any engine, including those that run on object storage such as S3, fast object storage tiers could be the basis for a new type of observability architecture. This is an incredibly exciting, and potentially highly disruptive approach that could upend a lot of how the observability incumbency butters their bread. Imagine all of that telemetry data sitting in a highly available, scalable and inexpensive data store like S3, a technology like Flip AI that interprets that data and provides accurate, timely root cause analysis for incidents, and you begin to ask yourself, “why am I paying out the nose for Observability Tool X?” It’s the right question to be asking, and while it’s still early days for storing observability data in inexpensive object storage and using AI to analyze it, we’re very excited about the possibilities, and are in fact experimenting on this approach with a subset of our customers.

As we roll out some of our experiments, particularly those leveraging some of the new features discussed above, we will share what we’ve learned. So please check this blog for updates!

Read the full report

Flip AI in Vegas! re:Invent 2023 Through the Lens of Observability & GenAI

Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

Unlocking Observability with GenAI