adrienclaire/threat-intel-rag
GitHub: adrienclaire/threat-intel-rag
Stars: 0 | Forks: 0
# threat-intel-rag
Local-first threat intelligence retrieval assistant for security analysts.
This MVP answers analyst questions from a compact local security corpus and returns the passages it used as citations. It is intentionally simple and transparent: lexical retrieval first, with a clean path to upgrade later to embeddings or a full RAG pipeline.
## Why this repo exists
Security questions are often answerable from known references, but the context is scattered across ATT&CK techniques, advisories, analyst notes, and response guidance. This project demonstrates a small retrieval workflow that keeps cited source material attached to every answer.
It is designed as a portfolio-ready example of:
- cybersecurity knowledge retrieval
- explainable local-first AI architecture
- FastAPI service design
- testable Python code
- a realistic path from MVP to production RAG
## What it does
## Current corpus
The demo corpus includes examples inspired by common security reference types:
- MITRE ATT&CK-style techniques
- vendor advisory-style vulnerability notes
- CISA-style ransomware readiness guidance
The corpus is deliberately small so the retrieval behavior is easy to inspect and explain.
## Architecture
Analyst question
|
v
FastAPI /query endpoint
|
v
retriever.py -> lexical scoring over loaded documents
|
v
answer + citations
Key files:
- `app/main.py` — FastAPI app and request validation
- `app/indexer.py` — corpus loading and tokenization
- `app/retriever.py` — ranking and answer construction
- `data/corpus.json` — compact demo security corpus
- `tests/` — retriever, API, and corpus quality tests
## Stack
- Python 3.12+
- FastAPI
- Uvicorn
- Pytest
## API endpoints
| Method | Path | Purpose |
|---|---|---|
| `GET` | `/` | Small browser UI for analyst queries |
| `GET` | `/health` | Service health check |
| `POST` | `/query` | Ask a threat-intel question |
## Setup
Manual setup:
python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -r requirements-dev.txt
Or use the Makefile:
make install
## Run tests
source .venv/bin/activate
python -m pytest -q
Or:
make test
Expected result:
12 passed
## Run locally
source .venv/bin/activate
uvicorn app.main:app --reload
Or:
make run
Open the API docs:
http://localhost:8000/docs
## Run with Docker
Build the image:
make docker-build
Run the container:
make docker-run
Or use Docker Compose:
make docker-up
The API listens on:
http://localhost:8000
Stop the Compose stack:
make docker-down
## Sample request
curl -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-d '{"question":"How should we triage phishing that may have captured credentials?"}'
## Sample response shape
{
"question": "How should we triage phishing that may have captured credentials?",
"answer": "Phishing: MITRE ATT&CK T1566 covers phishing techniques used to obtain credentials...",
"citations": [
{
"doc_id": "mitre-t1566",
"title": "Phishing",
"source": "MITRE ATT&CK",
"source_url": "https://attack.mitre.org/techniques/T1566/",
"score": 1.23,
"passage": "MITRE ATT&CK T1566 covers phishing techniques used to obtain credentials or deliver malicious content"
}
]
}
## Example analyst questions
- `How should we triage phishing that may have captured credentials?`
- `What should we validate after a remote code execution advisory?`
- `Which signs suggest valid account abuse?`
- `What should we do for ransomware readiness?`
- `How can suspicious PowerShell execution be detected?`
## Public release checklist
Before making this repository public, verify:
- [x] no real customer data, secrets, tokens, or internal notes are present
- [x] corpus uses demo/advisory-style content only
- [x] tests pass locally
- [x] README includes setup, test, and demo usage
- [x] request validation avoids obvious API errors
- [x] Docker image and Compose demo are available
- [x] GitHub topics are configured
- [x] repository visibility is changed to public intentionally
## Roadmap
Good next improvements:
1. [x] Add TF-IDF or BM25 scoring while keeping citation behavior unchanged.
2. [x] Add markdown document ingestion for analyst notes.
3. [ ] Add embedding-based retrieval behind the same `rank_documents` interface.
4. [x] Add source URLs and passage-level citation spans.
5. [x] Add a small web UI for analyst queries.
## License
MIT