Why microservices make error monitoring harder
Error monitoring in a monolith is tractable: one process, one log stream, one place to look. In a microservices architecture, a single user action might touch 5–8 services, and an error in one can manifest as an ambiguous failure in another. The stack trace in your API gateway log says "timeout after 30 seconds" but the root cause is a null dereference in the order service triggered by a race condition in the inventory service.
This guide covers how to set up error monitoring for a Node.js microservices system in a way that actually helps you debug problems rather than just generating more log noise to wade through.
The architecture we'll work with: a typical e-commerce backend with an API gateway, auth service, product service, order service, notification service, and a shared PostgreSQL database cluster. We'll use Express.js for the services, Winston for logging, and integrate with ErrorLens for AI-powered analysis.
Step 1: structured logging across all services
The first requirement for effective microservices error monitoring is that every service logs in the same structured format. The biggest mistake teams make is letting each service decide its own logging format — some use plain text, some use JSON, some use different field names for the same concepts.
Create a shared logging package that all your services import:
// packages/logger/index.js
const winston = require('winston');
module.exports = function createLogger(serviceName) {
return winston.createLogger({
level: process.env.LOG_LEVEL || 'info',
format: winston.format.combine(
winston.format.timestamp(),
winston.format.json()
),
defaultMeta: {
service: serviceName,
version: process.env.SERVICE_VERSION || '0.0.0',
environment: process.env.NODE_ENV || 'development',
},
transports: [
new winston.transports.Console(),
new winston.transports.File({
filename: '/var/log/app/error.log',
level: 'error',
}),
],
});
};
Every service creates its logger with its service name:
// In order-service/src/app.js
const logger = require('@myapp/logger')('order-service');
// Every log line now includes: service, timestamp, version, environment
logger.error('Failed to create order', {
error: err.message,
stack: err.stack,
userId: req.user?.id,
orderId: newOrder?.id,
});
Step 2: correlation IDs for cross-service tracing
Without correlation IDs, you cannot connect an error in your order service to the originating request that triggered it. A request comes in at the API gateway, creates a unique ID, and that ID propagates through every service call made to handle that request.
// In api-gateway/src/middleware/correlation.js
const { v4: uuidv4 } = require('uuid');
module.exports = function correlationMiddleware(req, res, next) {
// Accept existing correlation ID from upstream, or create a new one
req.correlationId = req.headers['x-correlation-id'] || uuidv4();
res.setHeader('x-correlation-id', req.correlationId);
// Attach to logger context for this request
req.log = logger.child({ correlationId: req.correlationId });
next();
};
When the API gateway calls downstream services, it forwards the correlation ID:
// When calling downstream services
const response = await axios.post(
`${ORDER_SERVICE_URL}/orders`,
orderPayload,
{
headers: {
'x-correlation-id': req.correlationId,
'Authorization': req.headers.authorization,
},
}
);
Now when an error occurs anywhere in the chain, you can filter all logs by the correlation ID and see the complete request journey across every service that touched it.
Step 3: centralized error collection
With structured logs and correlation IDs in place, you need a way to collect all those logs in one place. For this guide, we'll route all error logs to a single directory that ErrorLens can process:
// docker-compose.yml
services:
api-gateway:
image: myapp/api-gateway
volumes:
- ./logs/api-gateway:/var/log/app
order-service:
image: myapp/order-service
volumes:
- ./logs/order-service:/var/log/app
# ... other services
For production, you'd typically route these to a centralized log aggregator (Loki, ELK, CloudWatch Logs). The key is that all error logs end up in a place you can collect them from when an incident occurs.
Step 4: the ErrorLens integration for Node.js
Install the ErrorLens connector package:
npm install @errorlens/node
Create a shared error analysis utility for your services:
// packages/error-analysis/index.js
const { ErrorLens } = require('@errorlens/node');
const el = new ErrorLens({
apiKey: process.env.ERRORLENS_API_KEY,
});
module.exports = async function analyzeIncident(logPaths, options = {}) {
const fs = require('fs');
const logs = logPaths
.filter(p => fs.existsSync(p))
.map(p => fs.readFileSync(p, 'utf8'))
.join('\n--- SERVICE BOUNDARY ---\n');
return el.analyze({
content: logs,
filename: 'incident-logs.log',
tags: ['incident', ...(options.tags || [])],
});
};
When an incident occurs, collect error logs from all affected services and analyze them together:
// In your incident response script
const analyzeIncident = require('@myapp/error-analysis');
const result = await analyzeIncident([
'/var/log/api-gateway/error.log',
'/var/log/order-service/error.log',
'/var/log/payment-service/error.log',
]);
console.log('Root causes identified:', result.errors.length);
result.errors.forEach(err => {
console.log(`[${err.severity}] ${err.errorType} (${err.occurrences}x)`);
console.log('Root cause:', err.rootCause);
console.log('Fix:', err.solution.join(' | '));
});
Step 5: automated analysis in CI/CD
For catching errors before they reach production, add ErrorLens analysis to your CI pipeline. Here's a GitHub Actions workflow that runs after your test suite and analyzes any failures:
# .github/workflows/errorlens.yml
name: ErrorLens Analysis
on:
workflow_run:
workflows: ["CI"]
types: [completed]
jobs:
analyze:
if: github.event.workflow_run.conclusion == 'failure'
runs-on: ubuntu-latest
steps:
- uses: actions/download-artifact@v4
with:
name: test-output
path: ./test-logs
- uses: errorlens/analyze-action@v1
with:
api-key: ${{ secrets.ERRORLENS_API_KEY }}
log-files: './test-logs/**/*.log'
fail-on: critical
post-comment: true
github-token: ${{ secrets.GITHUB_TOKEN }}
This posts the ErrorLens analysis as a PR comment automatically when tests fail, so reviewers can see the root cause without leaving the code review interface.
Step 6: handling the most common microservices error patterns
Once your infrastructure is set up, you'll encounter the same recurring patterns in every Node.js microservices system. Here's how to structure your error handling for each:
Unhandled Promise rejections:
// Global handler — add to every service's entry point
process.on('unhandledRejection', (reason, promise) => {
logger.error('Unhandled Promise rejection', {
reason: reason?.message || String(reason),
stack: reason?.stack,
promise: promise.toString(),
});
// Don't exit — log it and keep running
});
Circuit breaker for downstream services:
const CircuitBreaker = require('opossum');
const orderServiceBreaker = new CircuitBreaker(callOrderService, {
timeout: 5000,
errorThresholdPercentage: 50,
resetTimeout: 30000,
});
orderServiceBreaker.on('open', () =>
logger.warn('Order service circuit breaker opened', {
service: 'order-service',
action: 'circuit-open',
})
);
orderServiceBreaker.on('fallback', (result) =>
logger.error('Order service fallback triggered', {
fallbackResult: result,
})
);
Request timeout with context:
app.use((req, res, next) => {
res.setTimeout(30000, () => {
logger.error('Request timeout', {
correlationId: req.correlationId,
path: req.path,
method: req.method,
elapsed: Date.now() - req.startTime,
});
res.status(503).json({ error: 'Request timeout' });
});
req.startTime = Date.now();
next();
});
Putting it together: a realistic incident response flow
With this setup in place, here's what incident response looks like:
- PagerDuty fires. The API is returning 503s on checkout.
- Engineer runs the incident analysis script with the last 30 minutes of error logs from the API gateway, order service, and payment service.
- ErrorLens returns a result in 8 seconds: "Root cause is connection pool exhaustion in
order-service/src/database/pool.js. The pool of 10 connections is saturated by slow payment API callbacks. 847 instances ofPoolTimeoutErrorwere counted. Secondary effect: 503s from API gateway are downstream of the pool exhaustion, not an independent issue." - Engineer increases pool size, deploys in 6 minutes, 503s stop.
The structured logs, correlation IDs, and centralized collection made step 2 fast. The AI analysis made step 3 accurate. The combination reduced the time from page to fix from 45 minutes (manual) to under 15 minutes (with tooling).
The full source code for the shared packages and example services in this guide is available as a downloadable package from the Connectors and SDKs page.