ricardomaraschini/oomhero

GitHub: ricardomaraschini/oomhero

Stars: 113 | Forks: 11

# OOMHero A lightweight Kubernetes sidecar that monitors process resource usage and pressure metrics, sending configurable signals to applications before resource exhaustion occurs. ## Overview OOMHero runs alongside your application containers in Kubernetes pods, continuously monitoring memory usage, memory pressure, I/O pressure, and CPU pressure. When processes cross configurable thresholds (defined as expressions), OOMHero sends Unix signals to enable proactive remediation before the OOMKiller terminates your application. ## Features - **Expression-based thresholds**: Define complex triggers using any combination of memory, OOM score, and pressure metrics - **Signal-based notifications**: Sends customizable Unix signals (default: `SIGUSR1` for warning, `SIGUSR2` for critical) - **Cooldown periods**: Prevents signal spam with configurable intervals between notifications - **Low overhead**: Minimal resource footprint (typically 1m CPU, 32Mi memory) ## How It Works OOMHero operates in pods with `shareProcessNamespace: true`, enabling it to monitor all processes within the pod. It continuously scans processes at configurable intervals, evaluating their resource usage against defined threshold expressions. When a process matches an expression: 1. **Warning expression**: Sends SIGUSR1 (or custom signal) to the process 2. **Critical expression**: Sends SIGUSR2 (or custom signal) to the process Applications implement signal handlers to take corrective action such as: - Flushing caches to disk - Shedding non-critical workloads - Triggering graceful degradation - Dumping diagnostics for post-mortem analysis - Initiating controlled restarts ### Threshold Expressions OOMHero uses the [fasteval](https://github.com/likebike/fasteval) library to evaluate threshold expressions. You can combine various metrics using standard operators: - **Logical**: `&&` (and), `||` (or), `!` (not) - **Comparison**: `>`, `<`, `>=`, `<=`, `==`, `!=` - **Algebraic**: `+`, `-`, `*`, `/`, `%` (modulo), `^` (power) #### Available Variables | Variable | Type | Description | |----------|------|-------------| | `memory_usage` | `f64` | Current memory usage as a percentage of the limit (%) | | `memory_current` | `f64` | Current memory usage in bytes | | `memory_max` | `f64` | Memory limit in bytes | | `oom_score` | `f64` | Current OOM score | | `oom_score_adj` | `f64` | OOM score adjustment | | `{resource}_pressure_{severity}_{window}` | `f64` | Pressure metrics | **Pressure Metric Components:** - **Resource**: `memory`, `io`, `cpu` - **Severity**: `some`, `full` - **Window**: `avg10`, `avg60`, `avg300`, `total` *Example*: `memory_pressure_full_avg10 > 20` ## Metrics OOMHero exposes Prometheus metrics on port `9000` by default. These metrics provide real-time visibility into the resource usage and pressure of all processes being monitored. | Metric Name | Type | Labels | Description | |-------------|------|--------|-------------| | `memory_usage` | Gauge | `pid`, `cmdline` | Current memory usage as a percentage of the limit | | `oom_score` | Gauge | `pid`, `cmdline` | Current OOM score (including adjustment) | | `memory_pressure` | Gauge | `pid`, `cmdline`, `severity_level`, `severity_window` | Memory pressure stall information | | `io_pressure` | Gauge | `pid`, `cmdline`, `severity_level`, `severity_window` | I/O pressure stall information | | `cpu_pressure` | Gauge | `pid`, `cmdline`, `severity_level`, `severity_window` | CPU pressure stall information | ### Metric Labels - `pid`: Process ID - `cmdline`: The command line of the process - `severity_level`: Either `some` or `full` - `severity_window`: One of `avg10`, `avg60`, `avg300`, or `total` Metrics have an idle timeout of 1 minute; if a process (identified by pid and cmdline) is not seen for 1 minute, its metrics will be removed. ## Requirements ## Installation ### Using Pre-built Container apiVersion: v1 kind: Pod metadata: name: my-application spec: shareProcessNamespace: true containers: - name: app image: your-app:latest resources: limits: memory: "512Mi" cpu: "500m" - name: oomhero image: ghcr.io/ricardomaraschini/oomhero:latest args: - "--warning=memory_usage > 75" - "--critical=memory_usage > 90" - "--loop-interval=100ms" - "--cooldown-interval=30s" resources: limits: cpu: "1m" memory: "32Mi" securityContext: capabilities: add: - SYS_PTRACE ### Building from Source # Clone the repository git clone https://github.com/yourusername/oomhero cd oomhero # Build release binary make release # Run locally ./target/release/oomhero --warning "memory_usage > 75" --critical "memory_usage > 90" ## Usage ### Basic Memory Monitoring oomhero \ --warning "memory_usage > 75" \ --critical "memory_usage > 90" \ --loop-interval 100ms \ --cooldown-interval 30s ### Comprehensive Resource Monitoring oomhero \ --warning "memory_usage > 70 || memory_pressure_full_avg60 > 50" \ --critical "memory_usage > 85 || memory_pressure_full_avg60 > 80" \ --loop-interval 200ms \ --cooldown-interval 30s ### Custom OOM Score Logic oomhero \ --warning "oom_score > 500" \ --critical "oom_score > 800" ### Custom Signals oomhero \ --warning "memory_usage > 75" \ --critical "memory_usage > 90" \ --warning-signal SIGHUP \ --critical-signal SIGTERM ## Configuration Options | Option | Description | Default | |--------|-------------|---------| | `--warning` | Expression for warning signal | (empty) | | `--critical` | Expression for critical signal | (empty) | | `--loop-interval` | Process scanning frequency | 100ms | | `--cooldown-interval` | Minimum time between repeated signals | 30s | | `--warning-signal` | Signal sent at warning threshold | SIGUSR1 | | `--critical-signal` | Signal sent at critical threshold | SIGUSR2 | | `--version` | Display version information | false | **Note**: Both `--warning` and `--critical` expressions must be provided for OOMHero to run. ## Important Considerations ### Memory Limits vs Requests OOMHero operates based on container **limits**, not requests. If only resource requests are specified without limits, OOMHero cannot calculate meaningful usage percentages. ### Performance Impact OOMHero scans all processes at the configured interval. Use CPU limits to control scan frequency and resource consumption. ## Troubleshooting ### OOMHero exits with "invalid expression: ..." Ensure both `--warning` and `--critical` expressions are valid `fasteval` expressions and provided when starting OOMHero. Example: --warning "memory_usage > 75" --critical "memory_usage > 90" ### Signals not being received by application 1. Verify `shareProcessNamespace: true` is set on the pod 2. Confirm OOMHero has `SYS_PTRACE` capability 3. Check application has signal handlers registered 4. Review OOMHero logs for signal delivery errors ### High CPU usage Reduce scan frequency by increasing `--loop-interval` or set lower CPU limits to throttle OOMHero's execution rate. ## License Licensed under the Apache License, Version 2.0. See [LICENSE](LICENSE) for details.
标签:通知系统