Observability

CI/CD Pipeline Observability & Guardrails

Metrics Lead time, MTTR, change failure rate, deploy frequency. Stage timing (queue, build, test, deploy); flake rate; retry counts. Tracing & logs Trace pipeline executions with build SHA, branch, trigger source; annotate stage spans. Structured logs with status, duration, infra node; keep artifacts linked. Guardrails Quality gates (tests, lint, security scans) per PR; fail fast on criticals. Retry budget per job to avoid infinite flake loops. Rollback hooks + auto-stop on repeated failures. Ops Parallelize where safe; cache dependencies; pin tool versions. Alert on SLA breaches (queue time, total duration) and rising flake rates. Keep dashboards per repo/team; trend regressions release to release.

Spring Boot Observability: Metrics, Traces, Logs

Metrics Use Micrometer + Prometheus: management.endpoints.web.exposure.include=prometheus,health,info. Add JVM+Tomcat/db pool meters; set percentiles for latencies. Create SLIs: request latency, error rate, saturation (threads/connections), GC pauses. Traces Spring Boot 3 ships with OTel starter: add spring-boot-starter-actuator + micrometer-tracing-bridge-otel + exporter (OTLP/Zipkin/Jaeger). Propagate headers (traceparent); ensure async executors use ContextPropagatingExecutor. Sample smartly: lower rates on noisy paths; raise for errors. Logs Use JSON layout; include traceId/spanId for correlation. Avoid verbose INFO in hot paths; keep payload size bounded. Dashboards & alerts Latency/error SLO dashboards per endpoint. DB pool saturation, thread pool queue depth, GC pause, heap used %, 5xx rate. Alerts on SLO burn rates; include exemplars linking metrics → traces → logs. Checklist Actuator endpoints secured and exposed only where needed. OTLP exporter configured; sampling tuned. Trace/log correlation verified in staging. Dashboards + alerts reviewed with oncall.

Go Profiling in Production with pprof

Capturing safely Expose /debug/pprof behind auth/VPN; or run curl -sK -H "Authorization: Bearer ...". CPU profile: go tool pprof http://host/debug/pprof/profile?seconds=30. Heap profile: .../heap; Goroutines: .../goroutine?debug=2. For containers: kubectl port-forward then grab profiles; avoid prod CPU throttle when profiling. Reading CPU profiles Look at flat vs. cumulative time; identify hot functions. Flamegraph: go tool pprof -http=:8081 cpu.pprof. Check GC activity and syscalls; watch for mutex contention. Reading heap profiles Compare live allocations vs. in-use objects; watch large []byte and map growth. Look for leaks via rising heap over time; diff profiles between runs. Goroutine dumps Spot leaked goroutines (blocked on channel/lock/I/O). Common culprits: missing cancel, unbounded worker creation, stuck time.After. Best practices Add pprof only when needed in prod; default on in staging. Sample under load close to real traffic. Keep artifacts: store profiles with build SHA + timestamp; compare after releases. Combine with metrics (alloc rate, GC pauses, goroutines) to validate fixes.