Core settings

  • max.poll.interval.ms sized to processing time; max.poll.records to batch size.
  • fetch.min.bytes/fetch.max.wait.ms to trade latency vs throughput.
  • enable.auto.commit=false; commit sync/async after processing batch.

Concurrency

  • Prefer multiple consumer instances over massive max.poll.records.
  • For CPU-bound steps, hand off to bounded executor; avoid blocking poll thread.

Ordering & retries

  • Keep partition affinity when ordering matters; use DLT for poison messages.
  • Backoff with jitter on retries; limit attempts per message.

Observability

  • Metrics: lag per partition, commit latency, rebalances, processing time, error rates.
  • Log offsets and partition for errors; trace batch sizes.

Checklist

  • Poll loop never blocks; work delegated to bounded pool.
  • Commits after successful processing; DLT in place.
  • Lag and rebalance metrics monitored.