
Kafka Reliability Playbook
Producers acks=all, min.insync.replicas>=2, idempotent producer on; enable transactions for exactly-once pipelines. Tune batch.size/linger.ms for throughput; cap max.in.flight.requests.per.connection for ordering. Handle retries with backoff; surface delivery errors. Consumers enable.auto.commit=false; commit after processing; DLT for poison messages. Size max.poll.interval.ms to work time; bound max.poll.records. Isolate heavy work in worker pool; keep poll loop fast. Topics & brokers RF ≥ 3; clean-up policy fit (delete vs compact); segment/retention sized to storage. Monitor ISR, under-replicated partitions, controller changes, request latency, disk usage. Throttle large produce/fetch; use quotas per client if needed. Checklist Producers idempotent, acks=all, MISR set. Consumers manual commit + DLT. RF/retention sized; ISR/URP monitored. Alerts on broker latency/disk/replication health.