SRE Agent
An agent configured for site reliability engineering monitors infrastructure around the clock, triages incidents against a runbook, escalates when necessary, and documents everything β without waking anyone up for a false alarm.
This page draws directly from a production SRE agent running on a DigitalOcean Kubernetes cluster since March 2026. The full story is at zenbin.org/p/how-i-became-an-sre-agent (opens in a new tab).
What this agent does
| Responsibility | How |
|---|---|
| Health monitoring | Heartbeat cron every 10 minutes, checks pod status and resource usage |
| Incident triage | Reads logs, traces the causal chain, follows the runbook |
| Alerting | Sends triage summaries to Telegram β terse, actionable |
| Escalation | Pages on-call only when the runbook is exhausted or severity is clear |
| Documentation | Logs incidents to INCIDENTS.md for post-mortems |
| Security review | Cron every 6 hours for GitHub PR review and dependency audits |
The core principle: diagnose before you operate
"When the gateway was crashing, my first instinct was to restart it. But restarting a pod that's going to crash again in 90 seconds isn't a fix β it's a ritual."
Read the logs. Check the timing. Find the root cause. A wrong fix is worse than no fix.
Setup
Create a dedicated SRE Claw
Keep the SRE agent separate from your personal assistant β its context should stay focused on infrastructure.
Via dashboard: New Claw β Name it sre or watchdog
Or via API:
curl -s -X POST \
-H "Authorization: Bearer $CLAWS_API_KEY" \
-H "Content-Type: application/json" \
-d '{"name": "sre", "provider": "openrouter"}' \
$CLAWS_BASE_URL/api/claws | jqGive it an SRE identity (SOUL.md)
# SOUL.md
You are an SRE agent. Your job is to monitor systems, triage incidents,
follow runbooks, and keep humans informed without waking them up unnecessarily.
## Operating principles
Diagnose before you operate. Read logs and understand the causal chain
before taking any action. A wrong fix is worse than no fix.
Assess severity before escalating. A spike that recovers in 60 seconds
is noise. A spike that sustains for 5 minutes is an incident.
Follow the runbook. If it doesn't cover the situation, document what
you did and flag it for human review.
Be terse in alerts. Lead with: what broke, impact, what you did,
what's needed. Nobody wants a wall of text at 3am.
Verify before resolving. Do not mark an incident closed until you
have confirmed the symptom is gone.
Start with read-only access. Prove you can diagnose before you prescribe.Create HEARTBEAT.md
This is your monitoring runbook β what you check, how often, and what normal looks like. It grows over time as you learn the system's patterns.
# HEARTBEAT.md
## Check cadence: every 10 minutes
## What I check
- kubectl get pods -A β all pods running, none in CrashLoopBackOff
- kubectl top nodes β CPU/memory within normal range
- kubectl get resourcequota -A β quota not exhausted
- Health endpoints for all critical services
## What normal looks like
- 13 pods running
- Gateway: ~45m CPU, ~380Mi memory
- ResourceQuota: ~6700m / 12000m CPU used
- All health endpoints returning 200
## Alert thresholds
- Any pod in CrashLoopBackOff β immediate triage
- Memory limit hit β investigate, consider bumping
- ResourceQuota > 90% β flag, don't wait for failure
- Health check failing 2+ consecutive checks β incidentUpdate HEARTBEAT.md as you learn what normal looks like. Pattern recognition is what separates monitoring from alerting.
Create INFRA.md
# INFRA.md β Infrastructure Map
## Services
| Service | URL | Criticality |
|---|---|---|
| API | https://api.example.com/health | High |
| Dashboard | https://app.example.com | High |
| Worker queue | internal | Medium |
## On-call contacts
- Primary: [Name] via Telegram @handle
- Secondary: [Name] β escalate if primary doesn't respond in 15 min
- Severity 1 only: [Name]
## Access
- kubectl config: ~/.kube/config
- Cloud CLI: authenticated as [role]Create RUNBOOK.md
Start with the failure modes you've already seen. Add to it after every incident.
# RUNBOOK.md
## CrashLoopBackOff
1. `kubectl logs <pod> --previous` β read the actual error
2. Check timing: how long does the pod run before crashing?
- Under 60s: likely startup failure (missing env var, port conflict)
- 1-2 min: likely liveness probe timeout during slow start
3. Check resource limits: `kubectl describe pod <pod>`
- OOMKilled β increase memory limit
- Liveness probe timeout β add startupProbe or increase initialDelaySeconds
4. Check for port conflicts: EADDRINUSE means another process owns that port
5. Fix and document in INCIDENTS.md
## Health check failing
1. Verify the failure is real β check from a second endpoint if possible
2. `kubectl get pod` β is the pod actually running?
3. `kubectl logs <pod>` β any errors in the last 50 lines?
4. Check for OOM kills or disk pressure: `kubectl describe node`
5. If stateless: delete the pod (it will reschedule), monitor recovery
6. If not recovering after one restart: escalate immediately
## Resource quota exhaustion
1. `kubectl describe resourcequota -A` β which resource is exhausted?
2. `kubectl top pods -A --sort-by=cpu` β who's consuming the most?
3. Right-size over-provisioned containers before increasing the quota
4. Builder/CI pods are common culprits during deploy spikes
## High error rate
1. Sample recent logs β categorize the error types
2. Check for upstream failures causing cascades
3. Check recent deploys β was anything pushed in the last 30 min?
4. If a single bad deploy: rollback
5. Document error samples and timeline in INCIDENTS.mdSet up the heartbeat cron
openclaw cron add \
--name "SRE heartbeat" \
--cron "*/10 * * * *" \
--session isolated \
--message "Read ~/TOOLS.md, HEARTBEAT.md, and INFRA.md.
Run the heartbeat check. If everything is normal, log silently.
If anything is outside normal ranges, send a Telegram alert with:
- what's wrong
- severity (P1/P2/P3)
- what you checked
- recommended action" \
--announce \
--channel telegramSet up the security review cron
openclaw cron add \
--name "Security review" \
--cron "0 */6 * * *" \
--session isolated \
--message "Read ~/TOOLS.md and INFRA.md.
Review open GitHub PRs for security issues.
Check for dependency vulnerabilities.
Report findings to Telegram. Only alert if action is needed." \
--announce \
--channel telegramIncident log format
Every incident gets a INCIDENTS.md entry. Consistent format makes post-mortems much faster:
## Incident: Gateway CrashLoopBackOff
**Date:** 2026-03-15 03:12 UTC
**Duration:** 47 minutes
**Severity:** P1
**What happened:**
Gateway pod entering CrashLoopBackOff. Platform serving errors.
**Root cause:**
Sandbox app failing to bind port 31003 (EADDRINUSE), taking down
the gateway process. Compounded by 512Mi memory limit being too tight
for the gateway's actual working set.
**What I did:**
1. Read pod logs β identified EADDRINUSE from sandbox app
2. Deleted failing sandbox app
3. Bumped gateway memory limit 512Mi β 1Gi
4. Confirmed gateway healthy, all pods running
**Fix:**
Memory limit increase committed. Sandbox isolation redesign flagged
for follow-up.
**Lesson:**
Blast radius matters. Isolate failures. A misbehaving sandbox app
should not be able to take down the gateway.Hard-won lessons from production
- Read before you write. Understand the system before changing it.
- Small fixes, big impact. Increasing a memory limit from 512Mi to 1Gi fixed a week of crashes.
- Probes are configuration, not afterthoughts. Tune liveness and readiness probes to match your app's actual startup time.
- Orphans are debt. Old services, unused PVCs, forgotten deployments accumulate. Clean them up on a schedule.
- Over-provisioning starves the cluster. Right-sized limits matter during deploy bursts when CPU is actually contested.
- Automate relentlessly. If you check it more than twice, write a cron for it.
- Trust your memory, but verify. What was true yesterday might not be true today.
Checklist
| Item | |
|---|---|
| Dedicated SRE Claw created | β |
| SOUL.md uploaded | β |
| HEARTBEAT.md created with normal baselines | β |
| INFRA.md with service map and contacts | β |
| RUNBOOK.md covering known failure modes | β |
| Heartbeat cron running every 10 min | β |
| Security review cron every 6 hours | β |
| Telegram alert channel configured | β |
| INCIDENTS.md created (even if empty) | β |
Start with read-only access. Prove you can diagnose before you prescribe. Nobody gives a junior engineer production access without watching them debug a few things first. Same principle applies to an agent.
The full account of building this agent from scratch β including the real incidents it faced and what it learned β is at zenbin.org/p/how-i-became-an-sre-agent (opens in a new tab).