adrienclaire/threat-intel-rag

GitHub: adrienclaire/threat-intel-rag

Stars: 0 | Forks: 0

# threat-intel-rag Local-first threat intelligence retrieval assistant for security analysts. This MVP answers analyst questions from a compact local security corpus and returns the passages it used as citations. It is intentionally simple and transparent: lexical retrieval first, with a clean path to upgrade later to embeddings or a full RAG pipeline. ## Why this repo exists Security questions are often answerable from known references, but the context is scattered across ATT&CK techniques, advisories, analyst notes, and response guidance. This project demonstrates a small retrieval workflow that keeps cited source material attached to every answer. It is designed as a portfolio-ready example of: - cybersecurity knowledge retrieval - explainable local-first AI architecture - FastAPI service design - testable Python code - a realistic path from MVP to production RAG ## What it does ## Current corpus The demo corpus includes examples inspired by common security reference types: - MITRE ATT&CK-style techniques - vendor advisory-style vulnerability notes - CISA-style ransomware readiness guidance The corpus is deliberately small so the retrieval behavior is easy to inspect and explain. ## Architecture Analyst question | v FastAPI /query endpoint | v retriever.py -> lexical scoring over loaded documents | v answer + citations Key files: - `app/main.py` — FastAPI app and request validation - `app/indexer.py` — corpus loading and tokenization - `app/retriever.py` — ranking and answer construction - `data/corpus.json` — compact demo security corpus - `tests/` — retriever, API, and corpus quality tests ## Stack - Python 3.12+ - FastAPI - Uvicorn - Pytest ## API endpoints | Method | Path | Purpose | |---|---|---| | `GET` | `/` | Small browser UI for analyst queries | | `GET` | `/health` | Service health check | | `POST` | `/query` | Ask a threat-intel question | ## Setup Manual setup: python3 -m venv .venv source .venv/bin/activate python -m pip install --upgrade pip python -m pip install -r requirements-dev.txt Or use the Makefile: make install ## Run tests source .venv/bin/activate python -m pytest -q Or: make test Expected result: 12 passed ## Run locally source .venv/bin/activate uvicorn app.main:app --reload Or: make run Open the API docs: http://localhost:8000/docs ## Run with Docker Build the image: make docker-build Run the container: make docker-run Or use Docker Compose: make docker-up The API listens on: http://localhost:8000 Stop the Compose stack: make docker-down ## Sample request curl -X POST http://localhost:8000/query \ -H "Content-Type: application/json" \ -d '{"question":"How should we triage phishing that may have captured credentials?"}' ## Sample response shape { "question": "How should we triage phishing that may have captured credentials?", "answer": "Phishing: MITRE ATT&CK T1566 covers phishing techniques used to obtain credentials...", "citations": [ { "doc_id": "mitre-t1566", "title": "Phishing", "source": "MITRE ATT&CK", "source_url": "https://attack.mitre.org/techniques/T1566/", "score": 1.23, "passage": "MITRE ATT&CK T1566 covers phishing techniques used to obtain credentials or deliver malicious content" } ] } ## Example analyst questions - `How should we triage phishing that may have captured credentials?` - `What should we validate after a remote code execution advisory?` - `Which signs suggest valid account abuse?` - `What should we do for ransomware readiness?` - `How can suspicious PowerShell execution be detected?` ## Public release checklist Before making this repository public, verify: - [x] no real customer data, secrets, tokens, or internal notes are present - [x] corpus uses demo/advisory-style content only - [x] tests pass locally - [x] README includes setup, test, and demo usage - [x] request validation avoids obvious API errors - [x] Docker image and Compose demo are available - [x] GitHub topics are configured - [x] repository visibility is changed to public intentionally ## Roadmap Good next improvements: 1. [x] Add TF-IDF or BM25 scoring while keeping citation behavior unchanged. 2. [x] Add markdown document ingestion for analyst notes. 3. [ ] Add embedding-based retrieval behind the same `rank_documents` interface. 4. [x] Add source URLs and passage-level citation spans. 5. [x] Add a small web UI for analyst queries. ## License MIT