Skip to content

Dead Letter Queues: Handling Failed Jobs Gracefully

Not every job succeeds. APIs go down, data is malformed, and bugs slip through. The question isn’t whether jobs will fail - it’s how you handle them when they do. bunqueue’s Dead Letter Queue (DLQ) gives you a systematic way to deal with failures.

What Is a Dead Letter Queue?

A DLQ is a holding area for jobs that have exhausted their retry attempts. Instead of being silently discarded, failed jobs are preserved with their full context so you can:

  • Inspect why they failed
  • Fix the underlying issue
  • Retry them after the fix
  • Purge jobs that are no longer relevant

When Jobs Enter the DLQ

Jobs move to the DLQ when they meet specific failure conditions:

ConditionReasonExample
Max attempts exhaustedmax_attemptsJob failed 3 times with exponential backoff
Processing timeouttimeoutJob exceeded its timeout setting
Stall limit reachedstalledWorker died 3 times while processing this job
Manual discardmanualExplicitly moved via API
const queue = new Queue('payments', { embedded: true });
// Add a job with 3 retry attempts
await queue.add('charge', { amount: 99.99 }, {
attempts: 3,
backoff: { type: 'exponential', delay: 1000 },
timeout: 30_000,
});
// If all 3 attempts fail, the job moves to DLQ
// with reason: 'max_attempts'

Configuring the DLQ

DLQ behavior is configurable per queue:

queue.setDlqConfig({
autoRetry: true, // Automatically retry DLQ jobs
autoRetryInterval: 300_000, // Every 5 minutes
maxAutoRetries: 3, // Max 3 auto-retry cycles
maxAge: 604_800_000, // Expire entries after 7 days
maxEntries: 10_000, // Cap at 10,000 entries
});

Inspecting Failed Jobs

Query the DLQ to understand what’s failing:

// Get DLQ statistics
const stats = queue.getDlqStats();
console.log(stats);
// {
// total: 47,
// byReason: { max_attempts: 30, timeout: 12, stalled: 5 },
// retriable: 42,
// expired: 5,
// oldestEntry: 1707000000000
// }
// List DLQ entries with filtering
const entries = queue.getDlq({
reason: 'timeout', // Only timeout failures
olderThan: Date.now() - 86_400_000, // Older than 24h
limit: 20,
offset: 0,
});
for (const entry of entries) {
console.log({
jobId: entry.job.id,
queue: entry.job.queue,
data: entry.job.data,
reason: entry.reason,
error: entry.error,
enteredAt: new Date(entry.enteredAt),
attempts: entry.job.attempts,
});
}

Retrying Failed Jobs

Once you’ve fixed the underlying issue, retry DLQ entries:

// Retry a specific job
queue.retryDlq('job-id-123');
// Retry all retriable jobs
queue.retryDlq();
// Retry with a filter
queue.retryDlqByFilter({
reason: 'timeout',
newerThan: Date.now() - 3_600_000, // Only last hour
});

When a job is retried from the DLQ:

  1. Its attempt counter is reset
  2. It’s placed back in the waiting queue
  3. Its original data and options are preserved
  4. Workers will pick it up normally

Purging the DLQ

For jobs that are no longer relevant:

// Purge all DLQ entries
queue.purgeDlq();

The maxAge and maxEntries config values also handle automatic cleanup during the DLQ maintenance cycle (runs every 60 seconds).

Monitoring DLQ in Production

Watch for growing DLQ sizes as an early warning signal:

// Periodic health check
setInterval(async () => {
const stats = queue.getDlqStats();
if (stats.total > 100) {
console.warn(`DLQ growing: ${stats.total} entries`);
// Send alert to monitoring system
}
// Log breakdown by reason
for (const [reason, count] of Object.entries(stats.byReason)) {
console.log(`DLQ ${reason}: ${count}`);
}
}, 60_000);

DLQ + Webhooks

Combine DLQ with webhooks for real-time alerts:

// Get notified when jobs enter the DLQ
await queue.add('critical-task', data, {
attempts: 3,
backoff: { type: 'exponential', delay: 2000 },
});
// Set up a webhook for failed events
// (via TCP protocol or HTTP API)

Best Practices

  1. Always configure a DLQ - don’t let failed jobs vanish silently
  2. Set maxAge - old DLQ entries are rarely useful, expire them
  3. Monitor DLQ size - it’s your canary in the coal mine
  4. Use autoRetry for transient failures - API outages resolve themselves
  5. Set maxEntries to prevent unbounded growth
  6. Inspect before retrying - understand why jobs failed before blindly retrying them