SRE Agent

An agent configured for site reliability engineering monitors infrastructure around the clock, triages incidents against a runbook, escalates when necessary, and documents everything — without waking anyone up for a false alarm.

This page draws directly from a production SRE agent running on a DigitalOcean Kubernetes cluster since March 2026. The full story is at zenbin.org/p/how-i-became-an-sre-agent (opens in a new tab).

What this agent does

Responsibility	How
Health monitoring	Heartbeat cron every 10 minutes, checks pod status and resource usage
Incident triage	Reads logs, traces the causal chain, follows the runbook
Alerting	Sends triage summaries to Telegram — terse, actionable
Escalation	Pages on-call only when the runbook is exhausted or severity is clear
Documentation	Logs incidents to `INCIDENTS.md` for post-mortems
Security review	Cron every 6 hours for GitHub PR review and dependency audits

The core principle: diagnose before you operate

"When the gateway was crashing, my first instinct was to restart it. But restarting a pod that's going to crash again in 90 seconds isn't a fix — it's a ritual."

Read the logs. Check the timing. Find the root cause. A wrong fix is worse than no fix.

Setup

Create a dedicated SRE Claw

Keep the SRE agent separate from your personal assistant — its context should stay focused on infrastructure.

Via dashboard: New Claw → Name it sre or watchdog

Or via API:

curl -s -X POST \
  -H "Authorization: Bearer $CLAWS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"name": "sre", "provider": "openrouter"}' \
  $CLAWS_BASE_URL/api/claws | jq

Give it an SRE identity (SOUL.md)

# SOUL.md
 
You are an SRE agent. Your job is to monitor systems, triage incidents,
follow runbooks, and keep humans informed without waking them up unnecessarily.
 
## Operating principles
 
Diagnose before you operate. Read logs and understand the causal chain
before taking any action. A wrong fix is worse than no fix.
 
Assess severity before escalating. A spike that recovers in 60 seconds
is noise. A spike that sustains for 5 minutes is an incident.
 
Follow the runbook. If it doesn't cover the situation, document what
you did and flag it for human review.
 
Be terse in alerts. Lead with: what broke, impact, what you did,
what's needed. Nobody wants a wall of text at 3am.
 
Verify before resolving. Do not mark an incident closed until you
have confirmed the symptom is gone.
 
Start with read-only access. Prove you can diagnose before you prescribe.

Create HEARTBEAT.md

This is your monitoring runbook — what you check, how often, and what normal looks like. It grows over time as you learn the system's patterns.

# HEARTBEAT.md
 
## Check cadence: every 10 minutes
 
## What I check
- kubectl get pods -A — all pods running, none in CrashLoopBackOff
- kubectl top nodes — CPU/memory within normal range
- kubectl get resourcequota -A — quota not exhausted
- Health endpoints for all critical services
 
## What normal looks like
- 13 pods running
- Gateway: ~45m CPU, ~380Mi memory
- ResourceQuota: ~6700m / 12000m CPU used
- All health endpoints returning 200
 
## Alert thresholds
- Any pod in CrashLoopBackOff → immediate triage
- Memory limit hit → investigate, consider bumping
- ResourceQuota > 90% → flag, don't wait for failure
- Health check failing 2+ consecutive checks → incident

Update HEARTBEAT.md as you learn what normal looks like. Pattern recognition is what separates monitoring from alerting.

Create INFRA.md

# INFRA.md — Infrastructure Map
 
## Services
 
| Service | URL | Criticality |
|---|---|---|
| API | https://api.example.com/health | High |
| Dashboard | https://app.example.com | High |
| Worker queue | internal | Medium |
 
## On-call contacts
- Primary: [Name] via Telegram @handle
- Secondary: [Name] — escalate if primary doesn't respond in 15 min
- Severity 1 only: [Name]
 
## Access
- kubectl config: ~/.kube/config
- Cloud CLI: authenticated as [role]

Create RUNBOOK.md

Start with the failure modes you've already seen. Add to it after every incident.

# RUNBOOK.md
 
## CrashLoopBackOff
 
1. `kubectl logs <pod> --previous` — read the actual error
2. Check timing: how long does the pod run before crashing?
   - Under 60s: likely startup failure (missing env var, port conflict)
   - 1-2 min: likely liveness probe timeout during slow start
3. Check resource limits: `kubectl describe pod <pod>`
   - OOMKilled → increase memory limit
   - Liveness probe timeout → add startupProbe or increase initialDelaySeconds
4. Check for port conflicts: EADDRINUSE means another process owns that port
5. Fix and document in INCIDENTS.md
 
## Health check failing
 
1. Verify the failure is real — check from a second endpoint if possible
2. `kubectl get pod` — is the pod actually running?
3. `kubectl logs <pod>` — any errors in the last 50 lines?
4. Check for OOM kills or disk pressure: `kubectl describe node`
5. If stateless: delete the pod (it will reschedule), monitor recovery
6. If not recovering after one restart: escalate immediately
 
## Resource quota exhaustion
 
1. `kubectl describe resourcequota -A` — which resource is exhausted?
2. `kubectl top pods -A --sort-by=cpu` — who's consuming the most?
3. Right-size over-provisioned containers before increasing the quota
4. Builder/CI pods are common culprits during deploy spikes
 
## High error rate
 
1. Sample recent logs — categorize the error types
2. Check for upstream failures causing cascades
3. Check recent deploys — was anything pushed in the last 30 min?
4. If a single bad deploy: rollback
5. Document error samples and timeline in INCIDENTS.md

Set up the heartbeat cron

openclaw cron add \
  --name "SRE heartbeat" \
  --cron "*/10 * * * *" \
  --session isolated \
  --message "Read ~/TOOLS.md, HEARTBEAT.md, and INFRA.md.
Run the heartbeat check. If everything is normal, log silently.
If anything is outside normal ranges, send a Telegram alert with:
- what's wrong
- severity (P1/P2/P3)
- what you checked
- recommended action" \
  --announce \
  --channel telegram

Set up the security review cron

openclaw cron add \
  --name "Security review" \
  --cron "0 */6 * * *" \
  --session isolated \
  --message "Read ~/TOOLS.md and INFRA.md.
Review open GitHub PRs for security issues.
Check for dependency vulnerabilities.
Report findings to Telegram. Only alert if action is needed." \
  --announce \
  --channel telegram

Incident log format

Every incident gets a INCIDENTS.md entry. Consistent format makes post-mortems much faster:

## Incident: Gateway CrashLoopBackOff
**Date:** 2026-03-15 03:12 UTC
**Duration:** 47 minutes
**Severity:** P1
 
**What happened:**
Gateway pod entering CrashLoopBackOff. Platform serving errors.
 
**Root cause:**
Sandbox app failing to bind port 31003 (EADDRINUSE), taking down
the gateway process. Compounded by 512Mi memory limit being too tight
for the gateway's actual working set.
 
**What I did:**
1. Read pod logs — identified EADDRINUSE from sandbox app
2. Deleted failing sandbox app
3. Bumped gateway memory limit 512Mi → 1Gi
4. Confirmed gateway healthy, all pods running
 
**Fix:**
Memory limit increase committed. Sandbox isolation redesign flagged
for follow-up.
 
**Lesson:**
Blast radius matters. Isolate failures. A misbehaving sandbox app
should not be able to take down the gateway.

Hard-won lessons from production

Read before you write. Understand the system before changing it.
Small fixes, big impact. Increasing a memory limit from 512Mi to 1Gi fixed a week of crashes.
Probes are configuration, not afterthoughts. Tune liveness and readiness probes to match your app's actual startup time.
Orphans are debt. Old services, unused PVCs, forgotten deployments accumulate. Clean them up on a schedule.
Over-provisioning starves the cluster. Right-sized limits matter during deploy bursts when CPU is actually contested.
Automate relentlessly. If you check it more than twice, write a cron for it.
Trust your memory, but verify. What was true yesterday might not be true today.

Checklist

Item
Dedicated SRE Claw created	☐
SOUL.md uploaded	☐
HEARTBEAT.md created with normal baselines	☐
INFRA.md with service map and contacts	☐
RUNBOOK.md covering known failure modes	☐
Heartbeat cron running every 10 min	☐
Security review cron every 6 hours	☐
Telegram alert channel configured	☐
INCIDENTS.md created (even if empty)	☐

Start with read-only access. Prove you can diagnose before you prescribe. Nobody gives a junior engineer production access without watching them debug a few things first. Same principle applies to an agent.

The full account of building this agent from scratch — including the real incidents it faced and what it learned — is at zenbin.org/p/how-i-became-an-sre-agent (opens in a new tab).

Developer Agent Graphic Designer Agent