aditihedaoo26/dataops-warroom

GitHub: aditihedaoo26/dataops-warroom

Stars: 0 | Forks: 0

## title: DataOps War Room emoji: 🚨 colorFrom: red colorTo: blue sdk: docker pinned: false app_port: 7860 # DataOps War Room 🚨 **Meta PyTorch OpenEnv Hackathon Submission** A real-world OpenEnv environment where an AI agent acts as a **Data Engineering Operations specialist** inside a healthcare analytics company. A production data pipeline has gone wrong — and the agent must work through a chain of realistic operational failures. ## Environment Description DataOps War Room simulates the full lifecycle of a production data pipeline incident: | Task | Difficulty | Description | |------|-----------|-------------| | `task1_triage` | Easy | Read logs & alerts, classify root cause, severity, and impacted services | | `task2_sql` | Medium | Rewrite a slow or broken SQL query to be correct and performant | | `task3_cleaning` | Medium-Hard | Clean a dirty clinical dataset and document all issues + actions | | `task4_review` | Hard | Review pipeline code for bugs, security vulnerabilities, and suggest fixes | **Why this matters:** Every data-driven company — especially in healthcare — runs on-call rotations, battles bad queries, cleans upstream data, and reviews pipeline code. This environment fills a genuine gap in the RL/agent ecosystem by making these daily human tasks testable and measurable. ## Action & Observation Spaces ### Observation class Observation(BaseModel): task_id: str phase: TaskPhase # triage | sql_optimization | data_cleaning | code_review context: Dict[str, Any] # task-specific payload (logs, SQL, data records, code) instructions: str step_count: int max_steps: int ### Actions (one per task) # Task 1 class TriageAction(BaseModel): root_cause: RootCause # sql_error | schema_drift | data_quality | ... severity: Severity # low | medium | high | critical affected_services: List[str] summary: str # Task 2 class SQLAction(BaseModel): rewritten_query: str explanation: str # Task 3 class CleaningAction(BaseModel): cleaned_records: List[CleanedRecord] issues_found: List[str] actions_taken: List[str] # Task 4 class CodeReviewAction(BaseModel): issues: List[CodeIssue] # {line, severity, issue_type, description, suggested_fix} summary: str ### Reward All graders produce a `Reward` with `value ∈ [0.0, 1.0]` and a `breakdown` dict. Grading is fully deterministic and reproducible. ## Grading Rubrics ### Task 1 — Triage | Dimension | Weight | |-----------|--------| | Root cause (exact match) | 40% | | Severity (exact; adjacent = 50%) | 30% | | Affected services (Jaccard similarity) | 20% | | Summary quality (length ≥ 30 chars) | 10% | ### Task 2 — SQL Optimization | Dimension | Weight | |-----------|--------| | Expected SQL constructs present | 45% | | Anti-patterns removed | 25% | | Explanation quality | 20% | | Basic syntactic validity | 10% | ### Task 3 — Data Cleaning | Dimension | Weight | |-----------|--------| | Issue recall (keyword matching) | 40% | | All patient records returned | 30% | | Actions documented per-record | 20% | | No data loss | 10% | ### Task 4 — Code Review | Dimension | Weight | |-----------|--------| | Issue recall (type + label match) | 40% | | Security issues specifically covered | 25% | | Fix quality (concrete, code-specific) | 25% | | Summary addresses risk level | 10% | ## Baseline Scores Measured with `gpt-4o` at temperature 0.0, seed 42: | Task | Score | |------|-------| | task1_triage | 0.72 | | task2_sql | 0.58 | | task3_cleaning | 0.61 | | task4_review | 0.44 | | **Average** | **0.59** | ## Setup & Usage ### Local Development # Clone and install git clone cd dataops-warroom pip install -r requirements.txt # Run the API server python app.py # Run baseline inference (requires API credentials) export API_BASE_URL="https://api.openai.com/v1" export MODEL_NAME="gpt-4o" export HF_TOKEN="your-api-key" python inference.py ### Docker docker build -t dataops-warroom . docker run -p 7860:7860 \ -e API_BASE_URL=https://api.openai.com/v1 \ -e MODEL_NAME=gpt-4o \ -e HF_TOKEN=your-key \ dataops-warroom ### API Endpoints | Method | Path | Description | |--------|------|-------------| | GET | `/` | Health check (returns 200) | | GET | `/tasks` | List all tasks | | POST | `/reset` | Reset env: `{"task_id": "task1_triage", "seed": 42}` | | POST | `/step` | Submit action: `{"action": {...}}` | | GET | `/state` | Get current env state: `?task_id=task1_triage` | | POST | `/grade` | Grade action directly (reset+step) | ### Using the Python SDK from environment import DataOpsWarRoomEnv from environment.models import Action, TaskPhase, TriageAction, RootCause, Severity env = DataOpsWarRoomEnv(task_id="task1_triage", seed=42) obs = env.reset() print(obs.context["logs"]) # production logs print(obs.instructions) # task instructions action = Action( task_id="task1_triage", phase=TaskPhase.TRIAGE, triage=TriageAction( root_cause=RootCause.SCHEMA_DRIFT, severity=Severity.HIGH, affected_services=["reporting_dashboard", "ml_feature_store"], summary="Pipeline failed due to a renamed column in patient_vitals table.", ), ) obs, reward, done, info = env.step(action) print(f"Score: {reward.value}") # e.g. 0.90 print(f"Feedback: {reward.feedback}") ## Project Structure dataops-warroom/ ├── app.py # FastAPI server (HF Space entry point) ├── inference.py # Baseline inference script (OpenAI client) ├── openenv.yaml # OpenEnv metadata & task registry ├── requirements.txt ├── Dockerfile ├── README.md ├── environment/ │ ├── __init__.py │ ├── env.py # DataOpsWarRoomEnv (reset/step/state) │ └── models.py # Typed Pydantic models (Observation/Action/Reward) ├── tasks/ │ ├── task1_triage.py # Scenario generation + observation builder │ ├── task2_sql.py │ ├── task3_cleaning.py │ └── task4_review.py └── graders/ ├── grader1.py # Deterministic grader for each task ├── grader2.py ├── grader3.py └── grader4.py ## Design Decisions **Reward shaping:** Every grader rewards partial progress, not just binary pass/fail. For example, Triage gives partial credit for an adjacent severity level, and SQL grading rewards each correct construct independently. **Scenario variety:** Each task has 3–4 distinct scenarios sampled at reset time, ensuring the graders can't be gamed by memorization. Use `seed` for reproducibility. **Hard task:** Task 4 (Code Review) genuinely challenges frontier models — it requires identifying 8–9 distinct issues across bug categories, security vulnerabilities, and providing concrete fixes. GPT-4o baseline is ~0.44. ## License MIT