Retry & DLQ Strategy (Practical)

When a consumer fails to process a message, you can retry (same queue or a retry queue with delay) and eventually send the message to a dead-letter queue (DLQ) for manual inspection or alerting. This article gives practical patterns: retry count, backoff, and when to move to DLQ, with examples and a short table.

Overview

Retry: On failure (exception, NACK), redeliver the message. Limit retries (e.g. 3–5) to avoid infinite loop. Optionally use a separate retry queue with TTL so the message is delayed before being requeued (exponential backoff).
DLQ: After max retries, send the message to a dedicated DLQ. Consumers do not process DLQ at full speed; operators inspect, fix data or code, and replay or discard. Prevents bad messages from blocking the main queue.
Idempotency: Because of retries, the same message can be processed more than once. Consumers must be idempotent (see Idempotency in Message Consumers).

Example

Example 1: Simple retry — same queue

Consumer catches exception, does not ack (or NACK with requeue). Broker redelivers to the same (or another) consumer. No retry limit in the broker → risk of infinite retry for a poison message. So either limit retries in app (e.g. header or counter) or use DLQ.

Example 2: Retry with counter and DLQ (Kafka)

Java
// Pseudocode: read from topic; on failure increment retry count in header or external store
if (process(record)) {
    consumer.commitSync();
} else {
    int retries = getRetryCount(record);
    if (retries >= 3) {
        sendToDLQ(record);
        consumer.commitSync();  // commit so we don't reprocess forever
    } else {
        sendToRetryTopic(record, retries + 1);  // retry topic with delay (e.g. single partition + time-based consumer)
        consumer.commitSync();
    }
}

Retry topic can have partitions with delay (e.g. consumer sleeps until message timestamp + backoff) or use a delay queue (RocketMQ delay, RabbitMQ delayed exchange). After max retries, move to DLQ and commit offset so the main topic is not blocked.

Example 3: RabbitMQ — DLX and TTL

Main queue: x-dead-letter-exchange = DLX, x-message-ttl or per-message TTL. On TTL expiry or NACK without requeue, message goes to DLX. DLX routes to DLQ. Alternatively, use a separate retry queue: consumer NACKs to retry queue with TTL; when TTL expires, message goes to main queue (or next retry) or to DLQ after max retries (e.g. by checking header).

Example 4: Strategy summary

Step	Action
1	Consumer fails → do not commit (or NACK).
2	Retry in place or send to retry queue with delay (backoff).
3	Increment retry count (header or store).
4	If retries < max: requeue or republish to retry queue. If retries ≥ max: send to DLQ, then commit (so main flow continues).
5	Monitor DLQ; fix and replay or discard.

Core Mechanism / Behavior

Backoff: Delay before retry reduces load on a failing dependency (e.g. DB down). Use exponential backoff (1s, 2s, 4s) or fixed delay. Implement with retry queue + TTL or with a delay topic/plugin.
Poison message: A message that always fails (bad format, bad data). Without DLQ and max retries, it would be retried forever and can block the consumer. DLQ isolates it.
Order: If you move to a retry topic and then back to main, order for that key can be lost. Accept reorder or use a single retry path that preserves partition/key.

Key Rules

Set a max retry count and move to DLQ after that. Commit (or ack) after moving to DLQ so the main queue is not stuck. Monitor DLQ depth and alert.
Prefer delayed retry (retry queue with TTL or delay topic) over immediate requeue so that transient failures (e.g. DB timeout) have time to recover. Make consumers idempotent so retries are safe.
Store retry count in message header or external store so you can enforce max retries across redeliveries. Include original topic and offset in DLQ message for replay.

What's Next

See Idempotency in Message Consumers for safe retries. See Kafka Delivery Semantics for commit behaviour. See RabbitMQ Exchange/Queue for DLX and queue options.