Beyond Big Data
What Exactly Is “Observability?”
One could make the case that the very first example of code instrumented for observability is the “Hello World!” application itself. This application, which has been implemented in every programming language ever invented, prints a simple two word message to the console. Change this “Hello World” message to something like “Line 1: Application Starting” and what you’ve done is essentially emit a log message. Now intersperse these messages throughout a more complex application, combine those with metrics being produced from the hardware the application is running on, and you’re now “observing” your application. Add in distributed tracing for your software components that run on different servers, and events that track key activities like automated builds and deployments, and you’re fully instrumented with the key components of the modern observability stack. These data modalities are sometimes known by the acronym MELT (Metrics, Events, Logs, and Traces). Now visualize a large complex organization operating at tremendous scale, the volume of telemetry data that would be emitted from the totality of their systems and software, and you can see how observability systems are really big data systems.
A Walk Down Big Data Lane
In order to fully explore the arc of how observability systems largely came to embody the traditional big data systems, we must turn back the clock to understand how and why this came to pass. Let’s have a brief look at the origins of how the ever increasing weight of telemetry data sought refuge in big data systems, and why we believe this approach is woefully inadequate for the modern enterprise.
We start our big data journey with Google’s seminal paper on the Google File System, leading to the formation of the Apache Hadoop project. While software developers had been instrumenting their code since the earliest days of computer programming, they had to be very judicious in how much observability data they could store at any given time. The aforementioned paper by Google introduced a new paradigm in system design - horizontal scalability, which posited a simple yet extraordinarily powerful observation that Google made: If a system can fail, it ultimately will; build for resilience.
Which leads us to the next stage of our observability-as-big-data-journey ⏳…
Google’s observations around horizontal scalability was a profound change to system design and it coincided nearly perfectly with the rise of Infrastructure-as-a-Service (“IaaS”) cloud computing from Amazon Web Services (“AWS”) which eliminated the operational complexity of running such horizontally scaling systems while simultaneously lowering costs. By the early 2010’s, any company - from the smallest startup to the largest enterprise - could harness the power of big data thanks to cloud computing. And this would usher in a slew of new databases designed for ingesting, storing, and querying data that had notoriously been difficult (or cost prohibitive) to do with the traditional relational database system (“RDBMS”). This differentiation at the database layer was given the moniker “NoSQL” and if you look at any observability platform today, you’ll see that horizontally scaling NoSQL databases are the foundations on which it is built.
Somewhat in parallel to both the technological and economic impact cloud computing was having on observability, open source search frameworks, such as SOLR and Lucene are born, which become fervently embraced by the observability ecosystem. Although the hadoop ecosystem was mostly used for asynchronous data processing, the underlying concepts of building large scale streaming data ingestion, storing large amounts of data and computing with redundancy, made its way into observability systems. The combination of indexing engines and big data ecosystem gives birth to purpose-built observability search and analytics engines like Elasticsearch and the ELK stack, further entrenching the “big data” mindset into the universe of observability.
Big Data, the Villain?
No, that’s not what we’re saying. Reviewing the history, the conditions, the motivations and the impact of big data systems, it is clear that they served a purpose. But an interesting thought experiment, some of which drives our approach at Flip AI: would a more surgical and balanced approach to observability have emerged had we not been so captivated by the economies of scale of big data systems?
At its core, what really is the act of “observing” or “monitoring?” Its sole purpose must be to keep systems healthy, to quickly discover when they are not healthy, and just as quickly to restore them to health. But what do we really have today? We have an ever-growing abundance, overwhelming mountains of telemetry and signal, which make it humanly impossible to achieve true observability. Because of the ubiquity of observability-as-big-data systems - like Splunk, Datadog, New Relic, Elastic, etc. - humans have become the slowest link in the incident response assembly line precisely because there is just so much data to sift through, correlate, and analyze. And this was exactly the motivation behind building Flip AI, by asking ourselves a simple question: “Can’t large machine learning models do that stuff way faster and more accurately than people?”
A (Potentially) Pivotal Shift in Observability: Generative AI
We asked ourselves this question, but we already had a strong belief: That the step-function shift in capabilities with the advent of Large Language Models and Generative AI could absolutely rationalize and reason through large amounts of observability data and present developers with an accurate, evidence-backed hypothesis as to why an incident is happening and what she could do to fix it. We were also very aware of our disappointing ancestry in this very space, what is broadly known as AIOps. The notion of applying some type of statistical or machine learning techniques to observability data wasn’t new, but its inability to drive meaningful value and positive impact to customers was palpable, something we heard from our enterprise customers over and over. AIOps has largely failed its customers, precisely because they came of age when large quantities of clean and labeled data were needed to train older generations of machine learning models.
Generative AI changes all of this. Large language models (or LLMs) are able to learn from the structure of the text itself, meaning that the human effort required to label data has been replaced by much more automatable processes like splitting and structuring the training data so that the algorithm can learn from it. LLMs are able to understand concepts and rationalize through patterns that emerge in observability systems.
At Flip AI, we have intentionally created a symbiotic experience, which merges the power of our large language model with the product experience itself. We see many players in the observability space, and frankly other legacy enterprise observability companies, who take the approach of slapping a chat-oriented widget, usually powered by a third party generic LLM provider with dubious data processing agreements (an approach no large enterprise will stomach), onto their already-busy, data-rich interfaces. This is not helpful. It presumes that the customer knows exactly what question to ask all the time. Sometimes they will, but most of the time, our customers just want us to surface evidence-backed, detailed root cause analysis. We are not naive enough to believe developers will wholesale change behaviors and move away from big data observability systems; those systems are entrenched, and institutional knowledge is a powerful stickiness factor. But we also know, through firsthand experience with large enterprise development teams, that developers are tired, are no faster at remediating incidents (and in fact, getting slower), while observability spends continue to rise. Something is clearly broken, and we at Flip AI intend to show you a better way. #FreeDevelopers