blueguy23/ghostrunner

GitHub: blueguy23/ghostrunner

Stars: 0 | Forks: 0

# ghostrunner Self-hosted GitHub Actions runner infrastructure for [bill-tracker](https://github.com/blueguy23/bill-tracker). Two ephemeral runners + shared MongoDB in Docker Compose, designed for WSL2 on a single machine. ## Why this exists **GitHub-hosted runners can't reach a local MongoDB.** The bill-tracker CI pipeline runs E2E tests against a real database — not mocks. GitHub-hosted runners would need either a cloud-hosted MongoDB (cost) or service containers that reset on every run (complexity). A self-hosted runner on the same Docker network as MongoDB is simpler and free. **WSL2 introduces failure modes that don't exist elsewhere.** Clock drift after host sleep/wake breaks TLS. TCP keepalives default to 2 hours, so dead connections go undetected. DNS silently breaks after VPN cycles. Every sidecar in this stack exists to handle a specific WSL2 failure mode — see [INCIDENTS.md](INCIDENTS.md) for the ones we've actually hit. ## Architecture ┌─────────────────────────────────────────────────────────────────────┐ │ docker compose │ │ │ │ ┌──────────────┐ ┌──────────────┐ │ │ │ ci-runner-1 │ │ ci-runner-2 │ Ephemeral runners │ │ │ │ │ │ Register → run 1 job → repeat │ │ │ entrypoint │ │ entrypoint │ │ │ │ ├ chronyd │ │ ├ chronyd │ Clock sync (WSL2 drift) │ │ │ ├ watchdog │ │ ├ watchdog │ Kill stuck retry loops │ │ │ ├ token-wt │ │ ├ token-wt │ PAT expiry detection │ │ │ └ run.sh │ │ └ run.sh │ GitHub runner binary │ │ └──────┬───────┘ └──────┬───────┘ │ │ │ │ │ │ └────────┬────────┘ │ │ │ │ │ ┌────────▼────────┐ │ │ │ ci-mongo │ MongoDB 7.0 — shared test database │ │ │ (mongo:27017) │ CI jobs connect here, not localhost │ │ └─────────────────┘ │ │ │ │ ┌──────────────┐ ┌──────────────┐ │ │ │ autoheal │ │ disk-watcher │ Sidecars │ │ │ Restarts │ │ Prunes Docker│ │ │ │ unhealthy │ │ when disk │ │ │ │ containers │ │ < 10GB free │ │ │ └──────────────┘ └──────────────┘ │ └─────────────────────────────────────────────────────────────────────┘ ### Lifecycle of a single job 1. entrypoint.sh starts as root 2. preflight-check.sh validates PAT, scopes, binaries 3. chronyd starts, initial clock step, DNS fallback written 4. gosu drops to runner user 5. Background loops start (clock monitor, PAT re-validation) 6. Runner registers as ephemeral via GitHub API 7. run.sh picks up one job from the queue 8. Job completes → runner auto-deregisters (ephemeral) 9. Loop back to step 6 ### Watchdog + circuit breaker WSL2 silently drops long-poll connections. The runner enters a "Retrying until reconnected" loop that never self-recovers. The watchdog detects this pattern in the runner log and kills the process so the main loop re-registers fresh. The circuit breaker prevents infinite restart loops: if the watchdog fires more than `WATCHDOG_MAX_FIRES` times within a 1-hour window, it backs off for 10 minutes before resetting. This stops thrashing when the root cause isn't recoverable by restarting (e.g., expired PAT, sustained network outage). ### Healthcheck layers | Layer | What it checks | Action on failure | |-------|---------------|-------------------| | **Docker healthcheck** (`deep-healthcheck.sh`) | Queries GitHub API — is this runner actually "online"? | Marks container unhealthy after 3 consecutive failures | | **Autoheal sidecar** | Watches for unhealthy containers | Restarts the container | | **Watchdog** (in entrypoint) | Watches runner log for "Retrying until reconnected" | Kills runner process → main loop re-registers | | **Token watch** (background loop) | Re-validates PAT against GitHub API every 6 hours | Writes `/tmp/token-invalid` sentinel → healthcheck reports it | | **Clock monitor** (background loop) | Reads chronyd offset every 60 seconds | Warns if drift > 2 seconds (chronyd auto-corrects via `makestep 1.0 -1`) | ## Decision log ### Ephemeral over persistent runners Persistent runners accumulate state: stale credentials files, leftover build artifacts, environment pollution between jobs. Ephemeral runners re-register after every job, so each CI run starts clean. The tradeoff is ~5 seconds of registration overhead per job — negligible against a 4-minute pipeline. ### Chrony over ntpdate `ntpdate` is a one-shot sync — it corrects the clock once at boot and never again. WSL2 drifts every time the host sleeps, which can happen multiple times per day. `chronyd` runs as a daemon and auto-steps when drift exceeds 1 second (`makestep 1.0 -1`). The background loop only *monitors* drift — it doesn't correct it. See [INC-001](INCIDENTS.md#inc-001-clock-sync-loop-silently-failing-since-inception) for what happened when we tried to correct from userspace. ### Circuit breaker thresholds (5 fires / 1 hour / 10 min backoff) These were tuned from real incidents. A healthy runner fires the watchdog 0-1 times per day (brief network blips). 5 fires in an hour means something systemic is wrong — clock drift, expired PAT, GitHub outage. The 10-minute backoff is long enough for transient GitHub issues to clear but short enough that the runner recovers within a reasonable window. ### SYS_TIME capability + Docker socket mount Both are security tradeoffs documented as `RISK ACCEPTED` in `docker-compose.yml`: - **SYS_TIME** is required for chrony to step the system clock. No alternative exists that preserves clock correction. Acceptable for a single-user local runner; do not replicate to shared environments. - **Docker socket** grants effective root on the host. Acceptable because this runner only executes trusted code (our own repo). If the runner ever processes untrusted PRs (forks, external contributors), replace with `tecnativa/docker-socket-proxy`. ### TCP keepalive tuning (60/10/6) WSL2's default keepalive is 7200 seconds — dead connections go undetected for 2 hours. The runner's long-poll to GitHub silently dies, and the process hangs until the OS timeout. `60/10/6` detects a dead connection within ~2 minutes (60s initial + 6 probes × 10s). ## Setup ### Prerequisites - Docker and Docker Compose - A GitHub PAT with `repo` scope - The runner binary tarball (`actions-runner.tar.gz`) in the project root ### Quick start # 1. Download the runner binary curl -fsSL https://github.com/actions/runner/releases/download/v2.322.0/actions-runner-linux-x64-2.322.0.tar.gz \ -o actions-runner.tar.gz # 2. Configure environment cp .env.example .env # Edit .env — set GITHUB_PAT, REPO_OWNER, REPO_NAME, DOCKER_GID # 3. Build and start docker compose build docker compose up -d # 4. Verify runners are online docker compose logs -f runner-1 runner-2 # Look for: "Runner registered. Waiting for a job..." ### Environment variables | Variable | Required | Default | Description | |----------|----------|---------|-------------| | `GITHUB_PAT` | Yes | — | PAT with `repo` scope | | `REPO_OWNER` | Yes | — | GitHub username or org | | `REPO_NAME` | Yes | — | Repository name | | `DOCKER_GID` | Yes | `1001` | GID of `/var/run/docker.sock` on host (`stat -c '%g' /var/run/docker.sock`) | | `DOCKERHUB_USERNAME` | No | — | Prevents anonymous pull rate limits | | `DOCKERHUB_TOKEN` | No | — | Docker Hub access token | | `PRUNE_THRESHOLD_GB` | No | `10` | Disk watcher prunes below this | | `RUNNER_CPUS` | No | `3.0` | CPU cap per runner container | | `SESSION_CONFLICT_WAIT` | No | `30` | Seconds to wait on session conflict before retry | | `WATCHDOG_MAX_FIRES` | No | `5` | Watchdog restarts before circuit breaker trips | ## Operations # Start everything docker compose up -d # View logs (both runners) docker compose logs -f runner-1 runner-2 # View single runner docker compose logs -f runner-1 # Restart stuck runners (not `up -d` — that only starts stopped containers) docker compose restart runner-1 runner-2 # Stop (deregisters runners from GitHub automatically) docker compose down # Check runner health docker inspect --format='{{.State.Health.Status}}' ci-runner-1 # Check clock drift inside a runner docker exec ci-runner-1 chronyc tracking ### Upgrading the runner binary curl -fsSL -o actions-runner.tar.gz docker volume rm runner_runner-1-config runner_runner-2-config docker compose build && docker compose up -d Config volumes must be deleted — the old runner binary caches its version in the config directory. ### Shared volumes | Volume | Purpose | Shared? | |--------|---------|---------| | `runner-N-config` | Runner binary + registration state | Per-runner | | `runner-N-work` | Job workspace (`_work/`) | Per-runner | | `playwright-cache` | Chromium binaries for E2E tests | Shared | | `pnpm-store` | pnpm content-addressable store | Shared | | `mongo-data` | MongoDB data directory | Shared | Playwright and pnpm caches are shared to avoid downloading ~400MB of binaries on every job. `docker volume prune` is intentionally omitted from the disk watcher — it would wipe these caches. ## File map ghostrunner/ ├── Dockerfile # Ubuntu 22.04 + Node 22 + runner binary + Playwright deps ├── docker-compose.yml # 2 runners + MongoDB + autoheal + disk watcher ├── entrypoint.sh # Root setup → gosu → ephemeral runner loop + watchdog ├── chrony.conf # Aggressive NTP sync for WSL2 clock drift ├── deep-healthcheck.sh # Queries GitHub API to verify runner is actually online ├── disk-watch.sh # Sidecar: prunes Docker artifacts when disk is low ├── .env.example # All configuration variables with descriptions ├── INCIDENTS.md # Operational incident log └── scripts/ ├── registration.sh # Token fetch, runner register/deregister (sourced) ├── background-loops.sh # Clock monitor + PAT re-validation loops (sourced) └── preflight-check.sh # Validates PAT scopes, binaries, env vars before start