SRE AI Copilot is a Python-based backend service that acts as an AI-driven incident response system for Kubernetes. It integrates with Prometheus AlertManager via webhooks, initiating an agent pipeline that includes analysis, hypothesis generation, criticism, proposed fixes, and risk assessment. The system is built with FastAPI, Celery, and leverages Large Language Models (LLMs), incorporating Kubernetes namespace guardrails for safety.
Key features include fingerprint deduplication to prevent re-running pipelines for ongoing alerts, flapping detection for recurring issues, and a `DiagnosticsEngine` that produces a typed `FactStore` before LLM calls. It enriches incident context with cluster-wide health snapshots, Jira tickets, TeamCity deploys, and VictoriaMetrics data. For remediation, it generates `ExecutionIntent` JSON objects, which can be dry-run against the Kubernetes API server and, with human approval via Discord, executed with full OpenTelemetry audit trails.
The latest release, v0.11.0 (Wave 7), focuses on 'Topology Expansion,' enhancing the Knowledge Graph with new sources. This includes runtime correlation between PodEvents and ServiceEdges, declarative Kubernetes Service and Ingress topology parsing, and a NATS subjects parser from a monorepo. These additions aim to improve the accuracy and breadth of the system's understanding of the Kubernetes environment.