Data

Error frequency patterns: what 2M log entries taught us about production bugs

Nov 20, 2024 · 9 min read · By Data Team

We analyzed anonymized error data across our entire customer base. The findings about error distribution, severity clustering, and recurrence patterns are surprising.

A dataset of production pain

When you build a tool that analyzes production errors, you accumulate an unusual dataset over time: the actual bugs and failures that matter to real engineering teams, not synthetic benchmarks or contrived examples.

With the explicit consent of customers on plans where anonymized data contributes to model improvement (customers on our privacy-first plan are excluded), we analyzed the error patterns from approximately 2 million log entries processed through ErrorLens between July and November 2024. What we found challenges several assumptions that developers commonly hold about production errors.

Note: All data in this analysis is fully anonymized. No customer-identifying information, proprietary code, or business logic is included. We analyze error type names, severity classifications, occurrence counts, language distributions, and temporal patterns only.

Finding 1: 80% of errors come from 3% of error types

The distribution of errors in production systems is extremely skewed. Across all analyses, the top 3% of error types by occurrence count accounted for 80% of all error occurrences. The bottom 50% of error types each appeared fewer than 3 times in the dataset.

This is a much more extreme Pareto distribution than most developers intuitively assume. Teams often treat error backlogs like a to-do list — systematically working through all errors. The data suggests a more focused approach: identify and eliminate the high-frequency errors first. In most codebases, fixing 3–5 recurring error types eliminates the majority of error volume.

The most common errors by occurrence count across the dataset:

NullPointerException / AttributeError: NoneType — 34% of all occurrences
Connection pool exhaustion / timeout errors — 18%
Authentication token expiry / invalid token errors — 12%
Rate limiting errors from third-party APIs — 9%
Unhandled Promise rejections — 7%

Together, these five patterns account for 80% of all error occurrences. If your production system experiences all five, fixing them in order would eliminate 80% of your error volume before touching any other issue in your backlog.

Finding 2: critical errors cluster in a 2-hour window after deployment

When we plotted error occurrence by time relative to deployment events (inferred from deployment-tagged analyses), a clear pattern emerged: 67% of critical-severity errors first appeared within 2 hours of a deployment, and 89% appeared within 6 hours.

This has an important implication for incident response strategy. The period immediately following a deployment is by far the highest-risk window. Teams that invest in intense post-deployment monitoring for the first 2 hours can catch nearly 90% of deployment-related critical errors before they become extended incidents.

The errors that appeared after 6 hours were predominantly three types:

Timeout and connection issues that only manifest under sustained load (often first triggered when US East Coast users wake up following an overnight deployment)
Memory leaks that take time to exhaust heap space
Token or credential expiry issues that surface as users' sessions expire

Knowing these specific patterns lets teams set more targeted post-deployment monitoring rules rather than broad alerting that produces too much noise to act on quickly.

Finding 3: 78% of recurring errors have the same root cause across occurrences

An assumption that many developers hold is that the same error message appearing multiple times might have multiple different root causes — different code paths, different input states, different timing. Our data suggests this is rarely true.

When we analyzed groups of recurring errors (the same error type appearing 10 or more times in the same codebase), 78% of the groups had a single root cause that was consistent across all occurrences. Only 22% showed evidence of multiple contributing root causes.

The practical implication: when you see a recurring error, it's overwhelmingly likely that fixing one thing will eliminate all occurrences, not just the instance you're looking at. This is relevant for how you scope fixes — a recurring NPE in a payment service is almost certainly one missing null check, not 47 different null-check gaps.

Finding 4: Python and JavaScript codebases produce 3x more unique error types per 1,000 lines of code than Java

When we controlled for codebase size (using the volume of log entries as a proxy), Python and JavaScript codebases produced significantly more distinct error types than Java or Go codebases of comparable size.

The error occurrence rate was roughly similar across languages — roughly 12–18 errors per 1,000 lines of executed code. But the error variety was starkly different: Python and JavaScript code produced approximately 3x more unique error type names per 1,000 occurrences compared to Java or Go.

This likely reflects the difference between dynamic and statically typed languages. In Java and Go, the type system catches a class of errors at compile time that Python and JavaScript only surface at runtime. The runtime errors in dynamically typed languages tend to be more varied in type, while the Java/Go errors tend to be more concentrated in a smaller set of runtime patterns like NullPointerException.

The implication for teams: Python and JavaScript projects benefit disproportionately from error categorization tools that group semantically similar errors, because the raw variety makes manual tracking harder. A Python service generating 50 distinct error types is not necessarily less healthy than a Java service generating 12 — the variety is partly structural.

Finding 5: most errors are not fixed — they're suppressed

Perhaps the most uncomfortable finding. When we analyzed error recurrence patterns over time, we found that 44% of errors that appeared in early analyses for a given codebase were still appearing in analyses six weeks later with no meaningful reduction in occurrence count.

Only 31% of identified errors showed a clear resolution pattern — a decline in occurrence count consistent with a fix having been deployed. The remaining 25% showed intermittent patterns consistent with conditional fixes (the error still occurs, but less often).

This tracks with what engineering teams report anecdotally: error backlogs tend to grow over time, not shrink. High-severity errors get fixed quickly. Medium and low severity errors get added to the backlog, triaged in sprint planning, deprioritized in favor of new features, and quietly accumulate.

The teams that showed the best error resolution rates in our dataset had one thing in common: they treated error analysis as part of their definition of done for each sprint, not as a separate maintenance track. Allocating even 10–15% of sprint capacity to error resolution — guided by occurrence-count data rather than gut feeling — had a measurable impact on the trend.

What this means for your error strategy

If the data from these 2 million log entries suggests anything actionable, it's these three practices:

Focus on frequency, not variety. Sort your error backlog by occurrence count, not by error type. The 3% of error types driving 80% of occurrences are your highest-leverage fixes.

Watch the 2-hour deployment window. Your highest-risk period is right after a deploy. Invest disproportionately in monitoring and rapid response in the first 2 hours after every deployment.

Treat recurring errors as single root causes. When the same error appears 20 times, assume it has one fix. Find that fix, deploy it, and verify the count drops to zero — not just to 15.