Dead Letter Queues: Handling Failed Jobs Gracefully

Not every job succeeds. APIs go down, data is malformed, and bugs slip through. The question isn’t whether jobs will fail - it’s how you handle them when they do. bunqueue’s Dead Letter Queue (DLQ) gives you a systematic way to deal with failures.

What Is a Dead Letter Queue?

A DLQ is a holding area for jobs that have exhausted their retry attempts. Instead of being silently discarded, failed jobs are preserved with their full context so you can:

Inspect why they failed
Fix the underlying issue
Retry them after the fix
Purge jobs that are no longer relevant

When Jobs Enter the DLQ

Jobs move to the DLQ when they meet specific failure conditions:

Condition	Reason	Example
Max attempts exhausted	`max_attempts`	Job failed 3 times with exponential backoff
Processing timeout	`timeout`	Job exceeded its timeout setting
Stall limit reached	`stalled`	Worker died 3 times while processing this job
Manual discard	`manual`	Explicitly moved via API

const queue = new Queue('payments', { embedded: true });

// Add a job with 3 retry attempts
await queue.add('charge', { amount: 99.99 }, {
  attempts: 3,
  backoff: { type: 'exponential', delay: 1000 },
  timeout: 30_000,
});

// If all 3 attempts fail, the job moves to DLQ
// with reason: 'max_attempts'

Configuring the DLQ

DLQ behavior is configurable per queue:

queue.setDlqConfig({
  autoRetry: true,              // Automatically retry DLQ jobs
  autoRetryInterval: 300_000,   // Every 5 minutes
  maxAutoRetries: 3,            // Max 3 auto-retry cycles
  maxAge: 604_800_000,          // Expire entries after 7 days
  maxEntries: 10_000,           // Cap at 10,000 entries
});

Inspecting Failed Jobs

Query the DLQ to understand what’s failing:

// Get DLQ statistics
const stats = queue.getDlqStats();
console.log(stats);
// {
//   total: 47,
//   byReason: { max_attempts: 30, timeout: 12, stalled: 5 },
//   retriable: 42,
//   expired: 5,
//   oldestEntry: 1707000000000
// }

// List DLQ entries with filtering
const entries = queue.getDlq({
  reason: 'timeout',          // Only timeout failures
  olderThan: Date.now() - 86_400_000,  // Older than 24h
  limit: 20,
  offset: 0,
});

for (const entry of entries) {
  console.log({
    jobId: entry.job.id,
    queue: entry.job.queue,
    data: entry.job.data,
    reason: entry.reason,
    error: entry.error,
    enteredAt: new Date(entry.enteredAt),
    attempts: entry.job.attempts,
  });
}

Retrying Failed Jobs

Once you’ve fixed the underlying issue, retry DLQ entries:

// Retry a specific job
queue.retryDlq('job-id-123');

// Retry all retriable jobs
queue.retryDlq();

// Retry with a filter
queue.retryDlqByFilter({
  reason: 'timeout',
  newerThan: Date.now() - 3_600_000, // Only last hour
});

When a job is retried from the DLQ:

Its attempt counter is reset
It’s placed back in the waiting queue
Its original data and options are preserved
Workers will pick it up normally

Purging the DLQ

For jobs that are no longer relevant:

// Purge all DLQ entries
queue.purgeDlq();

The maxAge and maxEntries config values also handle automatic cleanup during the DLQ maintenance cycle (runs every 60 seconds).

Monitoring DLQ in Production

Watch for growing DLQ sizes as an early warning signal:

// Periodic health check
setInterval(async () => {
  const stats = queue.getDlqStats();

  if (stats.total > 100) {
    console.warn(`DLQ growing: ${stats.total} entries`);
    // Send alert to monitoring system
  }

  // Log breakdown by reason
  for (const [reason, count] of Object.entries(stats.byReason)) {
    console.log(`DLQ ${reason}: ${count}`);
  }
}, 60_000);

DLQ + Webhooks

Combine DLQ with webhooks for real-time alerts:

// Get notified when jobs enter the DLQ
await queue.add('critical-task', data, {
  attempts: 3,
  backoff: { type: 'exponential', delay: 2000 },
});

// Set up a webhook for failed events
// (via TCP protocol or HTTP API)

Best Practices

Always configure a DLQ - don’t let failed jobs vanish silently
Set maxAge - old DLQ entries are rarely useful, expire them
Monitor DLQ size - it’s your canary in the coal mine
Use autoRetry for transient failures - API outages resolve themselves
Set maxEntries to prevent unbounded growth
Inspect before retrying - understand why jobs failed before blindly retrying them