What is observability?

Observability is the ability to understand what is happening inside a web app or API by analysing its external outputs - mainly logs, metrics, and traces. It is the practice that lets you reach into a running production system, ask "what is going on right now?", and get an answer that does not depend on guessing.

Observability helps you detect, investigate, and fix problems by showing how your system behaves in production.- the working definition

The word matters. A system is observable when you can infer its internal state from outside, without attaching a debugger or restarting it in a special mode. The goal is to make that inference cheap and reliable - so the person on call at 3am can find a root cause in minutes, not hours.

The questions it answers

The best way to understand what observability is for is the kind of question it makes answerable. These are the daily questions a properly instrumented system can answer in seconds:

QUESTIONS A USABLE SYSTEM CAN ANSWER
  • Why is the API slow right now
  • Where exactly did this customer's request fail
  • Which service or dependency is causing the cascade
  • Did the last deploy change p99 latency, and by how much
  • Are we breaching the SLO, and how long do we have

The questions look obvious. Answering them quickly, under pressure, in a system you may not have written, is the actual work. Observability is the engineering discipline that makes those answers fast.

The three pillars

Observability is conventionally built on three kinds of telemetry your services emit. Each one answers a different shape of question. You need all three to cover the surface area of a modern application.

PILLAR 01

Logs

Detailed event records. The narrative of what happened, line by line, in a service or component.

Best for
What did this specific request do? What error message did the database return at 14:03:22?
PILLAR 02

Metrics

Numeric performance data, aggregated over time. Latency, error rate, CPU, queue depth - the dashboards.

Best for
How is the system behaving in aggregate? Is the trend up or down? Are we inside the SLO?
PILLAR 03

Traces

The end-to-end flow of a single request as it crosses every service, queue, and database that touched it.

Best for
Where in the call graph did time disappear? Which downstream service caused the upstream failure?

The pillars are complementary, not redundant. Metrics tell you something is wrong. Traces tell you where. Logs tell you what. A team that has invested in only one or two of the three will keep running into questions it cannot answer.

Observability vs monitoring

The two words sometimes get used interchangeably. They are not the same thing.

MONITORING

Watch known failure modes

You define what "healthy" looks like - a list of metrics, a list of thresholds, a list of alerts. The system checks whether reality matches the definition and shouts when it does not.

Best for failure modes you have already seen and decided to guard against.

OBSERVABILITY

Investigate unknown failure modes

You instrument the system richly enough that you can ask arbitrary questions about its behaviour - including questions you did not anticipate when you wrote the code.

Best for the long tail of weird issues that production always produces.

Monitoring is necessary; it is not sufficient. A modern distributed system fails in too many ways to enumerate ahead of time, which is why observability has become the broader, more important practice. Monitoring is the alarms. Observability is the ability to figure out what set them off.

Why distributed systems need it

Observability has always been useful, but it became essential when applications stopped running as single processes on single machines. A monolith failure is usually local: the process crashes, the stack trace explains why, the engineer reads it and moves on. A distributed-systems failure rarely has that shape.

A request that fails today might have crossed five services, three queues, a cache, and two third-party APIs before something gave up. Without observability, the only honest answer to "where did it fail?" is "somewhere in the call graph". With observability - specifically with distributed tracing - the answer is "the third-party API returned 503 at 230ms into the request, and the retry loop in service B amplified it".

Two pressures make this even more acute:

  • Scale. When traffic is high, intermittent failures stop being "one weird ticket" and start being "five customers per minute". You cannot debug them by re-running a single failing request.
  • Cloud-native architecture. Containers come and go. Pods restart. The exact machine that handled the failing request may not exist anymore by the time you start investigating. The telemetry has to be the source of truth.

Tools and OpenTelemetry

The ecosystem of observability tools is large, but it has converged on a single open standard for the telemetry layer: OpenTelemetry (OTel). OTel defines a vendor-neutral way for services to emit logs, metrics, and traces. Whatever you use to store and view the data - Grafana, Datadog, Honeycomb, New Relic, the Aspire Dashboard - they all consume the same OTel signal.

That convergence matters because it means:

  • Adding observability to your code is a one-time investment in an open standard, not a lock-in to a specific vendor.
  • Switching between tools (or running multiple in parallel) does not require re-instrumenting your services.
  • Frameworks like .NET Aspire can ship "telemetry on" by default, and any compatible backend can read what they emit.

Three categories of tools sit on top of OTel today:

  • Local developer dashboards. Aspire's Dashboard, OTel Collector + local Jaeger / Tempo. Optimised for the debugging loop on a single developer's machine.
  • Hosted observability platforms. Datadog, Honeycomb, New Relic, Grafana Cloud. Optimised for running production observability for a team.
  • Self-hosted stacks. Grafana + Loki + Mimir + Tempo, or the Elastic stack. Optimised for organisations that prefer to own the data plane.

When to invest

The honest answer for most teams running a real service is: now, if you have not already. The cost of adding OTel instrumentation has dropped dramatically in the last few years - most frameworks ship instrumentation as a configuration switch - and the return on the investment is the difference between an outage that lasts 30 minutes and one that lasts six hours.

The maturity curve is roughly:

  • Logs only. The minimum viable state. Better than nothing, almost universally insufficient.
  • Logs + a few metrics. You can spot trends, but cannot diagnose anything specific.
  • Logs + metrics + distributed traces. The first state where production debugging stops being painful.
  • Logs + metrics + traces + SLOs + alerting on real signals. The state every team building a serious service should aim for.

Goal: quickly find root causes and understand system behaviour without guessing. Everything in this article is in service of that one sentence.