Engineering

How we cut mean-time-to-resolve by 73% using AI log analysis

Dec 12, 2024 · 8 min read · By ErrorLens Team

A deep dive into how the team at Payload CMS integrated ErrorLens into their incident response workflow and dramatically reduced debugging time.

The 2 AM problem

Every engineering team has a version of this story. It is 2 AM on a Tuesday. Your on-call engineer gets paged. The payment service is throwing errors. Response times have spiked from 120ms to 14 seconds. Customers are seeing failures at checkout.

The engineer opens the logs. There are 47,000 lines. Three different services are reporting errors. Some look related. Some don't. The clock is ticking.

For the team at Payload CMS — an open-source headless CMS used by thousands of developers worldwide — this scenario was familiar. Their production stack spans multiple Node.js services, a PostgreSQL database, and integrations with several third-party APIs. When something goes wrong, the logs are dense and the root causes are rarely obvious from a quick scan.

They started using ErrorLens in October 2024. By December, their mean-time-to-resolve had dropped from 47 minutes to 13 minutes. Here's how they did it.

Before: the manual triage loop

Before ErrorLens, the incident response workflow at Payload looked like this:

Get paged. Open the log dashboard in the browser.
Scan for ERROR-level lines manually, usually using Ctrl+F with terms like "Exception", "Error", "FATAL".
Try to identify which service is the origin versus which are downstream effects.
Search the codebase for the error message. Find 6 potential locations.
Start adding console.log statements or open a debugger.
Spend 20–40 minutes narrowing down before making the first code change.

The problem wasn't incompetence. Their engineers are excellent. The problem was the cognitive overhead of manual log triage at 2 AM. Reading 50,000 lines of logs while trying to reason about distributed system interactions is genuinely hard, regardless of your skill level.

The integration

The Payload team set up ErrorLens in two ways. First, they connected it directly to their CI/CD pipeline using the GitHub Actions connector, so every failing build automatically sent the test runner logs to ErrorLens for analysis. Second, for production incidents, they built a lightweight Slack bot that lets any engineer paste a log snippet and receive an analysis result in the same channel within 10 seconds.

The Slack integration was particularly valuable. During an incident, engineers are often already in Slack coordinating. Being able to drop a log block directly into the channel and get an AI-generated root cause — without switching contexts to another tool — eliminated one of the most friction-heavy steps in the response flow.

A real incident: the database connection pool collapse

In November 2024, Payload experienced an incident where their API started returning 503 errors under load. The logs showed thousands of lines across three services. Manually, it would have taken 20–30 minutes to identify the root cause. With ErrorLens, the analysis came back in 9 seconds:

Root cause: Database connection pool exhaustion. The pool size of 10 connections is insufficient for the current request volume. Connections are timing out after 5 seconds, causing upstream services to cascade-fail. The origin error is Error: Connection pool timeout in src/database/pool.ts:47, not the downstream 503s which are secondary effects.

The fix — increasing the pool size from 10 to 50 and adding a queue with a backpressure limit — took 8 minutes to deploy. Total incident duration: 23 minutes from first alert to resolved, versus their previous average of 47 minutes for a similarly complex incident.

Critically, the AI identified that the 503 errors visible in the API layer logs were secondary effects, not the root cause. A manual scan would likely have started debugging the API service first, wasting 10–15 minutes before tracing back to the database.

The 73% improvement in context

Across their first six weeks of using ErrorLens on production incidents, Payload measured:

Average MTTR dropped from 47 minutes to 13 minutes (73% reduction)
Time spent in manual log triage dropped from ~28 minutes per incident to ~4 minutes
First diagnosis accuracy (identifying the correct root cause on the first attempt) improved from 61% to 89%
Number of engineers pulled into an incident dropped from an average of 2.3 to 1.1

The last metric is worth pausing on. Every incident that previously required a second engineer to sanity-check the diagnosis now typically resolves with one. That is not just a time saving — it is a quality-of-life improvement for a team that values sustainable on-call culture.

What didn't change

ErrorLens doesn't replace engineering judgment. For incidents involving novel failure modes — things the AI hadn't seen patterns of before — the team still needed experienced engineers to reason through the problem. The tool is strongest when the failure pattern has a recognizable signature: null pointer exceptions, connection pool exhaustion, authentication token expiry, rate limiting, out-of-memory conditions. These are exactly the incidents that make up the bulk of production issues.

For genuinely novel distributed systems failures — race conditions with specific timing characteristics, memory corruption, subtle data pipeline issues — the AI provides useful context but doesn't replace deep investigation.

Key takeaways

If you're evaluating AI log analysis tools for your own incident response, here is what the Payload team's experience suggests you should measure:

Time to first diagnosis, not just MTTR. MTTR includes fix and deploy time, which AI can't control. Time to correct root cause identification is the metric that AI actually moves.
Diagnosis accuracy. Starting the fix in the wrong place is worse than starting late. Measure how often your first diagnosis is correct.
Context-switching cost. The value of integrating analysis into Slack or your existing incident channel is hard to quantify but genuinely significant.
Incident breadth. The improvement is largest on incidents involving multiple services with cascading failures — exactly the incidents that hurt most.

The goal isn't to replace the on-call engineer. It's to make that engineer 3–4x more effective in the first 10 minutes of an incident, when the pressure is highest and the information is most overwhelming.