Reliability

Go HTTP Timeouts & Resilience Defaults

Client defaults Set Timeout on http.Client; set Transport with DialContext timeout (e.g., 3s), TLSHandshakeTimeout (3s), ResponseHeaderTimeout (5s), IdleConnTimeout (90s), MaxIdleConns/MaxIdleConnsPerHost. Retry only idempotent methods with backoff + jitter; cap attempts. Use context.WithTimeout per request; cancel on exit. Server defaults ReadHeaderTimeout (e.g., 5s) to mitigate slowloris. ReadTimeout/WriteTimeout to bound handler time (align with business SLAs). IdleTimeout to recycle idle connections; prefer HTTP/2 when available. Patterns Wrap handlers with middleware for deadline + logging when timeouts hit. For upstreams, expose metrics: connect latency, TLS handshake, TTFB, retries. Prefer connection re-use; avoid per-request clients. Checklist Timeouts set on both client and server. Retries limited to idempotent verbs with jitter. Connection pooling tuned; idle conns reused. Metrics for latency stages and timeouts.

Kafka Reliability Playbook

Producers acks=all, min.insync.replicas>=2, idempotent producer on; enable transactions for exactly-once pipelines. Tune batch.size/linger.ms for throughput; cap max.in.flight.requests.per.connection for ordering. Handle retries with backoff; surface delivery errors. Consumers enable.auto.commit=false; commit after processing; DLT for poison messages. Size max.poll.interval.ms to work time; bound max.poll.records. Isolate heavy work in worker pool; keep poll loop fast. Topics & brokers RF ≥ 3; clean-up policy fit (delete vs compact); segment/retention sized to storage. Monitor ISR, under-replicated partitions, controller changes, request latency, disk usage. Throttle large produce/fetch; use quotas per client if needed. Checklist Producers idempotent, acks=all, MISR set. Consumers manual commit + DLT. RF/retention sized; ISR/URP monitored. Alerts on broker latency/disk/replication health.

DevOps Incident Response Playbook

During incident Roles: incident commander, comms lead, ops/feature SMEs, scribe. Declare severity quickly; open shared channel/bridge; timestamp actions. Stabilize first: roll back, feature-flag off, scale up, or shed load. Runbooks & tooling Prebuilt runbooks per service: restart/rollback steps, dashboards, logs, feature flags. One-click access to dashboards (metrics, traces, logs), recent deploys, and toggles. Paging rules with escalation; avoid noisy alerts. Comms Single source of truth: incident doc; external status page if needed. Regular updates with impact, scope, mitigation, ETA. After incident Blameless postmortem; timeline, root causes, contributing factors. Action items with owners/deadlines; track to completion. Add tests/alerts/runbook updates; reduce time-to-detect and time-to-recover.

Circuit Breakers with Resilience4j

Core settings Sliding window (count/time), failure rate threshold, slow-call threshold, minimum calls. Wait duration in open state; half-open permitted calls; automatic transition. Patterns Wrap HTTP/DB/queue clients; combine with timeouts/retries/bulkheads. Tune per dependency; differentiate fast-fail vs. tolerant paths. Provide fallback only when safe/idempotent. Observability Export metrics: state changes, calls/success/failure/slow, not permitted count. Log state transitions; add exemplars linking to traces. Alert on frequent open/half-open oscillation. Checklist Per-downstream breaker with tailored thresholds. Timeouts and retries composed correctly (timeout → breaker → retry). Metrics/logs/traces wired; alerts on open rate.

Hardening gRPC Services in Go

Deadlines & retries Require client deadlines; enforce server-side context with grpc.DeadlineExceeded handling. Configure retry/backoff on idempotent calls; avoid retry storms with jitter + max attempts. Interceptors Unary/stream interceptors for auth, metrics (Prometheus), logging, and panic recovery. Use per-RPC circuit breakers and rate limits for critical dependencies. TLS & auth Enable TLS everywhere; prefer mTLS for internal services. Rotate certs automatically; watch expiry metrics. Add authz checks in interceptors; propagate identity via metadata. Resource protection Limit concurrent streams and max message sizes. Bounded worker pools for handlers performing heavy work. Tune keepalive to detect dead peers without flapping. Observability Metrics: latency, error codes, message sizes, active streams, retries. Traces: annotate methods, peer info, attempt counts; sample smartly. Logs: structured fields for method, code, duration, peer. Checklist Deadlines required; retries only for idempotent calls with backoff. Interceptors for auth/metrics/logging/recovery. TLS/mTLS enabled; cert rotation automated. Concurrency and message limits set; keepalive tuned.