Why it matters
This project offers a novel approach to SRE and DevOps by integrating AI agents into the incident response workflow. By automating analysis and offering human-approved remediation, it aims to reduce MTTR (Mean Time To Resolution) and operational overhead for Kubernetes-based systems. Its focus on structured execution intents and guardrails addresses critical safety concerns in AI-driven automation.

SRE AI Copilot is a Python-based backend service that acts as an AI-driven incident response system for Kubernetes. It integrates with Prometheus AlertManager via webhooks, initiating an agent pipeline that includes analysis, hypothesis generation, criticism, proposed fixes, and risk assessment. The system is built with FastAPI, Celery, and leverages Large Language Models (LLMs), incorporating Kubernetes namespace guardrails for safety.

Key features include fingerprint deduplication to prevent re-running pipelines for ongoing alerts, flapping detection for recurring issues, and a `DiagnosticsEngine` that produces a typed `FactStore` before LLM calls. It enriches incident context with cluster-wide health snapshots, Jira tickets, TeamCity deploys, and VictoriaMetrics data. For remediation, it generates `ExecutionIntent` JSON objects, which can be dry-run against the Kubernetes API server and, with human approval via Discord, executed with full OpenTelemetry audit trails.

The latest release, v0.11.0 (Wave 7), focuses on 'Topology Expansion,' enhancing the Knowledge Graph with new sources. This includes runtime correlation between PodEvents and ServiceEdges, declarative Kubernetes Service and Ingress topology parsing, and a NATS subjects parser from a monorepo. These additions aim to improve the accuracy and breadth of the system's understanding of the Kubernetes environment.

Share:XHacker NewsLink
Article ID - cmpjq9lop0