signal-layer-labs/workflow-failure-library
GitHub: signal-layer-labs/workflow-failure-library
Stars: 0 | Forks: 0
# Workflow Failure Library
A practical library of failure modes in AI workflows, automation, and production systems.
This repository documents recognizable ways that operational workflows fail once they are running in the real world. It is intended to help builders diagnose incidents, improve system design, and share practical lessons without turning every problem into a new framework.
## Who this is for
- AI workflows and agentic flows
- automation and orchestration
- tool-calling systems
- human-in-the-loop operations
- production APIs and external providers
- reliability and observability practices
The focus is operational clarity: what breaks, how it shows up, why it happens, and what teams can do about it.
## Why workflow failures matter
Modern workflows often span models, tools, queues, databases, user interfaces, approvals, and external services. A small failure in one step can become hard to diagnose when it is retried, hidden, delayed, or interpreted incorrectly by later steps.
Failure modes give teams shared language for incident reviews and design discussions. They help turn vague statements like "the agent failed" into concrete observations such as "the workflow retried a timed-out tool call without idempotency and created duplicate downstream actions."
## How this differs from Production AI Checklists
- Checklists help you prepare before shipping.
- Failure modes help you recognize and diagnose what breaks after systems are running.
- The two repos should link to each other over time.
`production-ai-checklists` is about readiness, prevention, and what to verify before release. This repository is about diagnosis, incident learning, and the operational patterns that appear after systems meet real users, real data, and real dependencies.
## How to use the library
Use the entries in [`failures/`](failures/) during:
- architecture reviews
- incident analysis
- workflow design
- observability planning
- support and escalation reviews
- postmortems
- production readiness discussions
Each failure mode includes a summary, symptoms, causes, example scenario, operational impact, mitigations, prevention checklist, observability signals, and related failure modes.
## Challenges
Challenges are short operational scenarios designed to help builders practice failure analysis, observability thinking, and workflow reliability design.
Use them to practice diagnosing workflow failures before they happen in production:
- [001 - Retry Storms](challenges/001-retry-storms.md)
- [002 - Human Handoff Failure](challenges/002-human-handoff-failure.md)
- [003 - Missing Audit Trail](challenges/003-missing-audit-trail.md)
## Current failure modes
| Failure mode | Theme | Short description | Link |
| --- | --- | --- | --- |
| Context Drift | Context and state | Workflow decisions use stale, accumulated, or changed context. | [View](failures/context-drift.md) |
| State Desynchronization | Context and state | UI, database, queues, tools, or external systems disagree about status. | [View](failures/state-desynchronization.md) |
| Retry Storms | Retries and timeouts | Uncontrolled retries create duplicate work, cost spikes, or downstream pressure. | [View](failures/retry-storms.md) |
| Tool Timeout Cascades | Retries and timeouts | Timed-out tool calls trigger retries or dependent failures across steps. | [View](failures/tool-timeout-cascades.md) |
| Human Handoff Failures | Human review | Escalation to a person is unclear, late, missing, or poorly structured. | [View](failures/human-handoff-failures.md) |
| Approval Loop Breakdowns | Human review | Approval flows get skipped, stuck, duplicated, or misinterpreted. | [View](failures/approval-loop-breakdowns.md) |
| Missing Audit Trail | Observability | Teams cannot reconstruct what happened, why, or what triggered an action. | [View](failures/missing-audit-trail.md) |
| Silent Failure Propagation | Observability | Hidden failures cause later steps to act on bad assumptions. | [View](failures/silent-failure-propagation.md) |
| Ambiguous Tool Selection | Providers and dependencies | Agents or workflows choose the wrong tool because boundaries are unclear. | [View](failures/ambiguous-tool-selection.md) |
| Provider Instability | Providers and dependencies | External AI, API, or service instability degrades workflow behavior. | [View](failures/provider-instability.md) |
## Signal Layer Labs philosophy
Signal Layer Labs focuses on operational AI: workflows, orchestration, automation, reliability, observability, and the systems thinking needed to put real systems into production.
We believe useful documentation should help teams make better decisions under real constraints. This library avoids hype and vague warnings. It aims to name the failures builders actually see, describe how they surface, and make mitigation easier to discuss.