Building a Local Agentic Ops Monitor - Hourly Log Triage That Opens Issues (and Sometimes PRs)

Production logs tell you something is wrong. Dashboards tell you how wrong. What they don’t do is open the ticket, attach the right context, email you when it’s actually urgent, or draft a fix.

We already built read-only AI investigations for when a human asks “what happened to order X?” That pattern is great for ad-hoc RCA, but it doesn’t run on a schedule, and it’s deliberately locked down so the agent can’t touch the repo.

This post is about the complement: a local agentic loop that runs every hour on a developer machine, queries Grafana (Loki + Prometheus)—including Kubernetes cluster data you already ship there—triages findings into Linear, emails on critical severity, and—once you trust it—opens guarded pull requests for low-risk fixes.

The whole thing is a thin shell script plus a prompt file. No new microservice. No Hangfire job. Just cursor agent --print, MCP, and macOS launchd.

Why local instead of cloud?

We considered three options:

Approach	Pros	Cons
Grafana alerting only	Fast, native	Alerts ≠ triage; no codebase context; noisy
Cursor Cloud Automation	Always on	Less repo/shell control; MCP catalog constraints
Local agent loop	Full git + bash; your MCP config; phased rollout	Mac must be awake; you own the scheduler

We picked local because the valuable end state includes write-capable fixes in a specific GitHub repo, with explicit allowlists and blocklists. A headless Cursor Agent on your laptop gets the same tools you use in the editor—Grafana MCP, Linear MCP, gh, Mailgun—without standing up another runner.

The trade-off is honest: if your laptop sleeps, you miss an hour. That’s acceptable when each run looks back 60–90 minutes anyway.

Architecture at a glance

Six pieces, all in your application repo under something like scripts/OpsHealthAgent/:

run-hourly.sh — loads env, verifies git remote, enables MCPs, invokes the agent
prompt.md — the playbook: which queries to run, how to classify severity, what to do next
.cursor/rules/ops-health-agent.mdc — persistent constraints (PR blocklist, Linear workflow, phased rollout)
.cursor/mcp.json — minimal MCP config (Grafana + Linear) for headless CLI
state.json — local dedup fingerprints (gitignored)
send-critical-email.sh — small Mailgun helper the agent calls via bash

Optional seventh piece: a launchd plist for hourly scheduling after you’ve validated manual runs.

What each run actually does

Every hour (or whenever you invoke the script), the agent:

Collects signals from the last hour via Grafana MCP:
- Application logs — structured Error/Fatal volume vs baseline, domain-specific patterns (database, jobs, payment, auth)
- Kubernetes container logs — Promtail labels (namespace, pod, container); error-like stdout/stderr cluster-wide and in platform namespaces
- Kubernetes metrics — restarts, OOM, CrashLoopBackOff, Pending/Failed/Not Ready pods, CPU/memory vs requests, node pressure
- Application metrics — HTTP 5xx, health-check dashboards
- Managed service metrics — e.g. RavenDB scrape health, index errors, disk I/O (throughput/IOPS + attribution), disk space and license pressure
- Monitoring stack self-health — Loki/Grafana/Prometheus errors
Classifies severity — Critical / High / Medium / Low with explicit criteria
Dedupes — fingerprint = normalized error + service or namespace/pod; same fingerprint within 24h → comment on existing Linear issue, don’t spam new ones
Files Linear issues — with Grafana deeplinks, assignee, labels (we use a dedicated OPS_HEALTH label)
Emails on Critical (phase 2+) — with 4-hour dedup so you don’t get paged twice for the same fire
Opens PRs (phase 3+) — only for allowlisted paths and never for payment/auth/infra manifests in a separate repo

That last point matters: observability infra (Kubernetes manifests, Loki config, database scrape jobs) often lives in a different repo from application code. The agent still creates Linear issues for infra and platform problems, but auto-fix PRs target one application repo only.

What we monitor (beyond app logs)

If you run on Kubernetes, you probably already ship cluster telemetry into Grafana: Promtail for container logs, kube-state-metrics and cAdvisor for pod/node state, plus app-specific scrape jobs. The agent reads all of that through Grafana MCP—it never talks to the API server directly.

We split Step 1 of the prompt into layers:

Application layer (Loki + app Prometheus)

JSON level=Error|Fatal by service source label
Targeted LogQL for database, background jobs, payment, auth, sync failures
Sift find_error_pattern_logs for elevated patterns vs yesterday
ASP.NET (or equivalent) 5xx rates and health-check dashboard summaries

Kubernetes layer (Loki + kube Prometheus)

Logs — container stdout/stderr via Promtail labels, not just structured app JSON:

{namespace=~".+"} |~ "(?i)(error|fatal|panic|crashloop|back.?off|oom|evicted|failed)"

Extra pass on platform namespaces (kube-system, ingress, monitoring).

Metrics — same signals you’d put on a pod-resources dashboard:

Signal	Example PromQL direction
Restarts in last hour	`increase(kube_pod_container_status_restarts_total[1h]) > 0`
OOMKilled	`kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}`
CrashLoopBackOff	`kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"}`
Pending / Failed / Not Ready	`kube_pod_status_phase`, `kube_pod_status_condition`
CPU >90% of request	`rate(container_cpu_usage_seconds_total[5m]) / kube_pod_container_resource_requests`
Memory >90% of request	`container_memory_working_set_bytes / kube_pod_container_resource_requests`
Node pressure	`kube_node_status_condition{condition="MemoryPressure"}` etc.

Each run also calls get_dashboard_summary on a Kubernetes pod resources dashboard and cross-checks PromQL hits against its panels.

Severity guidance we encoded: resource pressure trending up → Medium; pressure plus restarts/OOM → High; widespread pod failure or node NotReady → Critical.

Managed services layer (optional Prometheus jobs)

If you scrape a managed database or other PaaS metrics into Prometheus, add a third block to the prompt. Those findings are Linear only (infra repo), same as K8s manifest fixes.

Example: RavenDB Cloud

Most of our RavenDB incidents show up first as Azure Disk Sum Bytes/Sec and Disk IOPS Sum spikes—not as application errors. We wired that into the same Grafana stack the agent already reads, so hourly triage catches database pressure without a separate on-call channel.

Getting metrics in. RavenDB Cloud exposes a Prometheus endpoint at /admin/monitoring/v1/prometheus on each cluster node. It requires HTTPS with a client certificate (Operator clearance minimum). In-cluster Prometheus scrapes all three nodes with mTLS; the cert lives in a K8s secret, not in git.

We run two scrape jobs to balance cardinality vs drill-down:

Job	What it scrapes	Interval
`premier-ravendb-cloud`	Server + database aggregates; indexes/collections skipped	30s
`premier-ravendb-cloud-indexes`	Index metrics only (maps/sec, errors)	60s

The main job keeps the TSDB lean (~1.7k series). The index-only job adds ~1.8k series so we can name the index during an I/O spike without pulling the full 12k-metric export.

Metric naming gotcha. RavenDB prefixes everything with ravendb_ and uses database_name as the label—not what their docs sometimes show without the prefix. We learned this the hard way when the first dashboard showed “No data.” Always validate against a live /prometheus export before writing PromQL.

What the agent checks each hour:

Signal	Severity	Example PromQL
Scrape target down	Critical	`up{job="premier-ravendb-cloud"} == 0`
Errored indexes	High	`sum(ravendb_database_indexes_errored_count{...}) > 0`
Stale indexes	Medium	`sum(ravendb_database_indexes_stale_count{...}) > 5`
Low disk space	High	`min(ravendb_server_disk_remaining_storage_space_percentage{...}) < 15`
License expiring	High	`min(ravendb_license_expiration_left_seconds{...}) < 604800`
High disk throughput	High	read + write bytes/sec > 100 MiB/s
High disk IOPS	High	read + write ops/sec > 1000
Disk queue elevated	Medium	`max(ravendb_server_storage_queue_length{...}) > 5`

Disk throughput and IOPS align with the Azure metrics that were firing first:

# Disk sum bytes/sec per node
ravendb_server_storage_read_throughput_bytes{job="premier-ravendb-cloud"}
  + ravendb_server_storage_write_throughput_bytes{job="premier-ravendb-cloud"}

# Disk IOPS sum per node
ravendb_server_storage_io_read_operations{job="premier-ravendb-cloud"}
  + ravendb_server_storage_io_write_operations{job="premier-ravendb-cloud"}

Spike attribution is the part dashboards alone don’t give you. RavenDB’s per-database ravendb_database_storage_* metrics duplicate the server total on every database—they can’t tell you which DB caused the spike. When disk alerts fire, the agent runs follow-up queries:

# Which databases were busiest?
topk(5, max_over_time(ravendb_database_statistics_requests_per_second{job="premier-ravendb-cloud"}[1h]))

# Which indexes were mapping hardest?
topk(10, max_over_time(ravendb_index_mapped_per_second{job="premier-ravendb-cloud-indexes"}[1h]))

The Grafana dashboard (pp-ravendb-health) has a Disk I/O and spike attribution row with these as panels—bytes/sec, IOPS, queue depth, top databases by request/write/indexing rate, and a top-indexes table. The agent calls get_dashboard_summary on that dashboard each run and cross-checks PromQL hits against the panels.

Grafana-managed alert rules cover the same thresholds (throughput, IOPS, queue). Infra changes live in a separate repo; the agent files Linear issues there and never auto-PRs scrape config or dashboard YAML into the application repo.

Phased rollout (don’t turn on the flamethrower on day one)

We use an environment variable OPS_HEALTH_PHASE:

Phase	Behavior
1 (default)	Observe + Linear triage only
2	+ critical email alerts
3	+ guarded auto-fix PRs

Phase 1 lets you tune LogQL/PromQL and severity rules without sending email or touching git. Phase 2 validates alert quality. Phase 3 is where you allowlist safe fix types (defensive null checks, config tweaks) and blocklist payment flows, schema migrations, and secrets.

This is the difference between “AI ops” as a demo and AI ops you can leave running.

The orchestrator script

The shell script is intentionally boring. It resolves the git root, verifies you’re in the right remote, sources secrets from .env, prevents overlapping runs, and calls Cursor Agent in print mode:

#!/usr/bin/env bash
set -euo pipefail

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
REPO_ROOT="$(git -C "$SCRIPT_DIR" rev-parse --show-toplevel)"

# Verify this is the repo you intend to fix (customize the match string)
url="$(git remote get-url origin 2>/dev/null || true)"
[[ "$url" == *"your-org/your-app"* ]] || exit 1

# Load secrets (Grafana token, Mailgun key, etc.)
source "$REPO_ROOT/.env" 2>/dev/null || true
source "$SCRIPT_DIR/.env" 2>/dev/null || true

export OPS_HEALTH_PHASE="${OPS_HEALTH_PHASE:-1}"
export OPS_HEALTH_STATE="$SCRIPT_DIR/state.json"
export OPS_HEALTH_REPO_ROOT="$REPO_ROOT"

# macOS-friendly lock (flock isn't always available)
LOCK_DIR="$SCRIPT_DIR/.run.lockdir"
mkdir "$LOCK_DIR" 2>/dev/null || { echo "Run in progress"; exit 0; }
trap 'rmdir "$LOCK_DIR" 2>/dev/null || true' EXIT

cursor agent mcp enable grafana 2>/dev/null || true
cursor agent mcp enable Linear 2>/dev/null || true

cursor agent --print --approve-mcps --force \
  --workspace "$REPO_ROOT" \
  --output-format stream-json \
  -- "$(cat "$SCRIPT_DIR/prompt.md")"

Notes that saved us time:

--approve-mcps is required for headless MCP access
--workspace must point at the repo root where .cursor/mcp.json lives
Env is sourced every run — change .env, next run picks it up; nothing cached
cursor agent login (or CURSOR_API_KEY) is separate from Grafana/Mailgun env vars

The prompt is the product

The script is glue. The prompt.md is where you encode ops judgment: which LogQL queries matter, what “3x baseline” means, which error substrings map to payment vs auth vs database, when to skip noise.

A simplified excerpt of the structure:

## Step 1 — Collect signals (last 60–90 minutes)

### Loki — applications
- Error volume by service vs prior hour
- Domain-specific searches: database, jobs, payment, auth, sync

### Loki — Kubernetes
- Container logs by namespace/pod (Promtail labels)
- Platform namespace pass: kube-system, ingress, monitoring

### Prometheus — Kubernetes
- Restarts, OOM, CrashLoop, Pending/Failed/Not Ready
- CPU/memory >90% of request; node pressure
- Dashboard summary: k8s-resources

### Prometheus — application
- HTTP 5xx, health-check dashboard

### Prometheus — RavenDB Cloud (optional)
- Scrape `up`, errored/stale indexes, disk space, license, memory
- Disk I/O: throughput (bytes/sec), IOPS, queue length
- On I/O alert: top databases (requests, doc puts, index maps) + top indexes
- Dashboard summary: pp-ravendb-health
- Infra findings → Linear only (separate repo)

## Step 2 — Classify and dedupe
Fingerprint = `<normalized-message>|<service-or-namespace/pod>`

## Step 3 — Linear triage
- save_issue with assignee, OPS_HEALTH label, Grafana deeplink
- save_comment if fingerprint seen in last 24h

## Step 4 — Critical email (phase >= 2)
## Step 5 — Auto-fix PR (phase >= 3, allowlist only)
## Step 6 — Update state.json; comment summary on tracking issue

Pair the prompt with a Cursor rule (.mdc) for constraints that shouldn’t be repeated every run: repo scope, PR blocklist, “never mark Linear issues Done automatically,” commit message format.

MCP configuration

Headless Cursor Agent reads MCP from the workspace’s .cursor/mcp.json. We keep a minimal config—just Grafana and Linear:

{
  "mcpServers": {
    "Linear": {
      "url": "https://mcp.linear.app/mcp"
    },
    "grafana": {
      "command": "uvx",
      "args": ["mcp-grafana"],
      "envFile": ".env",
      "env": {
        "GRAFANA_URL": "${env:GRAFANA_URL}",
        "GRAFANA_SERVICE_ACCOUNT_TOKEN": "${env:GRAFANA_SERVICE_ACCOUNT_TOKEN}"
      }
    }
  }
}

Grafana MCP gives you query_loki_logs, query_prometheus, find_error_pattern_logs (Sift), generate_deeplink, and dashboard summaries—exactly what you’d click through manually in Explore, but scriptable for an agent.

Linear MCP gives you save_issue and save_comment so findings become tracked work instead of chat output lost at the end of the run.

Enable them once for the CLI:

cursor agent mcp enable grafana
cursor agent mcp enable Linear

Dedup state

Without dedup, an hourly agent creates duplicate tickets for the same recurring error. We store a gitignored state.json:

{
  "lastRunUtc": "2026-06-29T12:00:00Z",
  "fingerprints": {
    "timeout-exceeded|api-service": {
      "linearIssueId": "ENG-123",
      "lastSeenUtc": "2026-06-29T12:00:00Z",
      "lastEmailUtc": null,
      "severity": "high"
    },
    "crashloop-backoff|production/api-deployment-7f8c": {
      "linearIssueId": "ENG-124",
      "lastSeenUtc": "2026-06-29T13:00:00Z",
      "lastEmailUtc": "2026-06-29T13:00:00Z",
      "severity": "critical"
    }
  }
}

Rules:

Same fingerprint within 24 hours → comment, don’t create
Critical email → no resend within 4 hours unless severity escalates

The agent reads and writes this file during each run. It’s crude compared to Redis, but crude is fine on a single machine and easy to inspect when something looks wrong.

Critical email via Mailgun

For phase 2+, we use a tiny bash helper that POSTs to Mailgun. The agent writes a markdown body to /tmp/alert.md and shells out:

./scripts/OpsHealthAgent/send-critical-email.sh \
  --subject "[CRITICAL] Production: database connection pool exhausted" \
  --body-file /tmp/alert.md

Keep email Critical-only. Everything else lives in Linear where it belongs.

Auto-fix PRs (phase 3): allowlist beats optimism

This is the scary part, so we constrain it hard in the Cursor rule:

Allowlisted:

Application code under known project folders
App config, logging config
Obvious defensive fixes with a stack trace match

Blocklisted:

Payment and authentication flows
Database schema/migrations
Multi-tenant config without feature flags
Secrets and PII
Infra manifests in a separate repo

Workflow when allowed:

git checkout main && git pull origin main
git checkout -b fix/eng-123-short-desc
# minimal change
gh pr create --repo your-org/your-app --base main

Then comment the PR URL on the Linear issue and set status to In Review—never Done.

Scheduling with launchd

After manual runs look good, install a plist:

<key>StartInterval</key>
<integer>3600</integer>
<key>RunAtLoad</key>
<false/>
<key>ProgramArguments</key>
<array>
  <string>/absolute/path/to/your/repo/scripts/OpsHealthAgent/run-hourly.sh</string>
</array>
<key>WorkingDirectory</key>
<string>/absolute/path/to/your/repo</string>

Use absolute paths. Keep secrets in .env sourced by the script, not in the plist.

How this differs from our Investigations feature

We now have two agent patterns in production:

	Investigations (Cloud Agent)	Ops Health Agent (local loop)
Trigger	Human asks about one entity	Hourly schedule
Mode	Read-only (`plan`)	Write-capable (phased)
Context	Audit trail + repo	Logs/metrics (app + K8s + optional managed services) + repo
Output	Markdown RCA	Linear issues + optional PR
Runner	Hangfire in the app	`launchd` on a dev machine

Same underlying bet: structured observability data + code-aware agent + tight guardrails. Different trigger and different permissions.

Build your own: checklist

Grafana with Loki (app + K8s container logs) and Prometheus (kube-state-metrics, cAdvisor, app scrapes, optional managed-service jobs)
Dashboards for app overview, log investigation, Kubernetes pod resources, and any managed-service health you scrape (e.g. RavenDB disk I/O + attribution)
Grafana service account token in .env
Linear MCP connected; create an OPS_HEALTH (or similar) label
scripts/OpsHealthAgent/ with run-hourly.sh, prompt.md, state.json.example
.cursor/mcp.json + .cursor/rules/ops-health-agent.mdc — encode K8s severity rules and infra-vs-app PR boundaries
cursor agent login, then manual phase 1 runs until false positives are tolerable
Phase 2 — Mailgun (or your email provider) for Critical only
Phase 3 — PR allowlist after 1–2 weeks of clean triage
launchd (or cron on Linux) when you’re confident

What we’d do differently next time

Start phase 1 longer than you think. Severity tuning is most of the work—especially separating app log noise from K8s platform noise.
Treat infra and app repos separately from day one. Agents love to “fix” the wrong repository; K8s and database scrape config belong in Linear, not auto-PRs.
Ship K8s data to Grafana first, then teach the prompt where it lives. Promtail namespace labels and kube-state-metrics metric names should match what’s actually in your cluster before the agent runs.
Validate managed-service metric names from a live export. PaaS vendors document idealized names; production exports often differ (prefixes, label keys). We burned a cycle on RavenDB panels showing “No data” until we compared PromQL against the actual /prometheus response.
Split high-cardinality scrapes when you need attribution. One lean job for alerts, a second slower job for drill-down (e.g. per-index maps/sec) keeps TSDB growth predictable.
Parent tracking issue in Linear for the automation itself—each run comments a one-line summary so you can audit what it did without reading JSONL logs.
Consider UptimeRobot or similar MCP as a cross-check for external availability—not a replacement for logs, but a good sanity signal.

If you’re already using Cursor with MCP for development, an hourly ops agent is mostly workflow design, not infrastructure. The hard parts are severity rules, dedup, and knowing when not to let the agent commit code. Get those right and you’ve got a cheap, code-aware on-call assistant that actually leaves a paper trail.

Alexandru Puiu

Engineer / Security Architect

Systems Engineering advocate, Software Engineer, Security Architect / Researcher, SQL/NoSQL DBA, and Certified Scrum Master with a passion for Distributed Systems, AI and IoT..