ChrisHuber1/siem-automation
GitHub: ChrisHuber1/siem-automation
Stars: 0 | Forks: 0
# SIEM Automation
Automated monitoring, triage, and alerting for a Wazuh SIEM deployment. Built after my Wazuh manager crashed and stayed down for 3 days without anyone noticing ; including me.
The system has three parts: a watchdog that detects and auto-restarts failed SIEM services, an AI-powered triage bot that classifies alerts and sends daily reports, and a CVE validation pipeline that checks whether flagged vulnerabilities actually apply to the affected hosts.
## Why This Exists
I run Wazuh across a home lab with multiple Linux hosts. In May 2026, the Wazuh manager crashed after a reboot ; stale PID files in `/var/ossec/var/run/` prevented the service from restarting. No alerts were generated because the system that generates alerts was the thing that was down.
I found out 3 days later when I happened to check manually. That's unacceptable.
## Components
### 1. SIEM Watchdog
Runs on a cron schedule, SSHs into the SIEM host, and checks Wazuh service health.
- Detects stale PID files and clears them before restart
- Auto-restarts crashed services (max 2 attempts per 24 hours to avoid restart loops)
- Successful restart = HIGH finding; failed restart = CRITICAL + phone notification
- Tracks restart attempts in a JSON state file so it doesn't retry endlessly
### 2. AI Triage Bot
Fetches recent Wazuh alerts over SSH, classifies them with Claude, and sends a daily summary.
- Pre-flight health check: if the Wazuh manager is down, sends an urgent alert immediately instead of trying to fetch alerts from a dead service
- Zero alerts + healthy manager gets flagged as anomalous (not "quiet day")
- SSH failures are handled gracefully ; no more crashes from unreachable hosts
- Sends reports via ntfy push notification
### 3. CVE Validation
Wazuh flags kernel CVEs aggressively. Most of them don't actually apply to the running kernel version.
- Pulls flagged CVEs from Wazuh alerts
- Checks each one against the NVD API for affected version ranges
- Compares against the actual running kernel version on each host
- Generates a report: confirmed real vs. false positive, with reasoning
- False positives get Wazuh rule overrides (level 0 suppression) so they stop generating alerts
**Example finding:** Wazuh flagged CVE-2026-31461 on a host running kernel 6.8. The NVD affected range starts at 6.13. That's a false positive ; the host isn't running a vulnerable version. Added a suppression rule.
### 4. Alerting
- ntfy push notifications on all CRITICAL findings
- File-based cooldown (1 hour per agent) to prevent alert spam from cron runs
- Audible "Master we need your input" via Windows SAPI for human-in-the-loop decisions
## Decisions and Tradeoffs
**File-based cooldown over rate limiting:** A rate limiter would be more elegant, but a file with a timestamp is debuggable. I can see when the last alert fired by reading a file. If cooldown breaks, I can fix it by deleting a file.
**Max 2 restarts per 24 hours:** Infinite restart loops are worse than a down service. If the service won't stay up after 2 attempts, something is fundamentally wrong and a human needs to look at it.
**Claude for triage, not for detection:** Wazuh handles detection. Claude classifies and prioritizes the results. Using an LLM for detection would miss the rule-based patterns that Wazuh is purpose-built for.
**Custom Wazuh rules over modifying defaults:** False positive overrides use custom rules (100100+ range) in `local_rules.xml`, never modifications to the default ruleset. This survives Wazuh upgrades and makes it easy to see exactly what's been tuned.
## False Positive Tuning
| Rule ID | What It Suppresses | Why |
|---|---|---|
| 100101 | Root SSH from ops host | Cron jobs SSH as root for health checks ; not unauthorized access |
| 100160 | CVE-2026-31461 | NVD affected range starts at 6.13; host runs 6.8 |
| 100170 | Sudo alerts on SIEM host | Cloud-init gives the service account NOPASSWD ALL by default |
## Architecture
Cron (hourly)
|
v
+-------------------+ +------------------+
| SIEM Watchdog |--SSH-->| SIEM Host |
| (check health, | | (Wazuh manager) |
| auto-restart) | +------------------+
+-------------------+
|
v (findings)
+-------------------+ +------------------+
| Triage Bot |--SSH-->| Wazuh API |
| (Claude classify, | | (fetch alerts) |
| daily report) | +------------------+
+-------------------+
|
v (alerts)
+-------------------+
| ntfy push |----> Phone notification
| Windows SAPI |----> Audible alert
+-------------------+
## Current State
Running in production on my home lab. The watchdog has caught and auto-resolved two Wazuh manager crashes since deployment. CVE validation eliminated 10 false critical alerts per scan cycle. The triage bot sends daily summaries to my phone.
## What I'd Do Differently
- Add Prometheus metrics for SIEM health instead of relying solely on the SSH-based check. Would give me historical uptime data and alerting integration with Grafana.
- The CVE validation could cache NVD responses to avoid repeated API calls for the same CVE across scan cycles.