sahilkapur1993-sys/cyber-risk-rag-assistant
GitHub: sahilkapur1993-sys/cyber-risk-rag-assistant
Stars: 0 | Forks: 0
# Cyber Risk Intelligence Assistant
## Table of Contents
1. [Project Overview](#project-overview)
2. [Architecture](#architecture)
3. [Tech Stack](#tech-stack)
4. [Repository Structure](#repository-structure)
5. [Data Pipeline](#data-pipeline)
6. [RAG Pipeline](#rag-pipeline)
7. [Backend API](#backend-api)
8. [Frontend](#frontend)
9. [AWS Infrastructure](#aws-infrastructure)
10. [Local Development Setup](#local-development-setup)
11. [Deployment Guide](#deployment-guide)
12. [Cost Breakdown](#cost-breakdown)
## Project Overview
The Cyber Risk Intelligence Assistant ingests real-time CVE (Common Vulnerabilities and Exposures) data from the NVD (National Vulnerability Database), indexes it using vector embeddings, and exposes a natural language query interface. Users can ask questions like *"What are the most critical vulnerabilities affecting Microsoft products?"* and receive AI-generated answers grounded in real CVE data.
**Key differentiators:**
- Live data — not a generic PDF chatbot. Queries real, up-to-date CVE threat intelligence.
- Domain-specific — designed for cyber risk and security professionals.
- Production-grade AWS infrastructure — automated weekly pipeline, persistent storage, live deployment.
## Architecture
┌─────────────────────────────────────────────────────┐
│ DATA PIPELINE (Weekly) │
│ │
│ AWS Glue Workflow │
│ ┌─────────────┐ ┌────────────┐ ┌──────────┐ │
│ │ Job 1 │ -> │ Job 2 │ -> │ Job 3 │ │
│ │ Fetch CVEs │ │ Chunk Data │ │ Embed │ │
│ │ (NVD API) │ │ │ │ (OpenAI) │ │
│ └─────────────┘ └────────────┘ └──────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ S3: raw/cve_feed.json S3: processed/ S3: faiss_index/
│ cve_chunks.json embeddings.npy
└─────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────┐
│ APP SERVER (EC2) │
│ │
│ User → nginx (port 80) → FastAPI (port 8000) │
│ │ │
│ rag/pipeline.py │
│ │ │
│ ┌──────────┴──────────┐ │
│ │ │ │
│ OpenAI API S3 (load index) │
│ (embed question) (at startup) │
└─────────────────────────────────────────────────────┘
**Data flow:**
1. Weekly, AWS Glue fetches all CVEs from NVD (last 60 days), chunks and embeds them, and stores the FAISS index in S3.
2. The FastAPI app downloads the index from S3 at startup and builds it in memory.
3. A user submits a question via the HTML frontend.
4. The question is embedded via OpenAI, searched against the FAISS index, and the top-k matching CVEs are injected into a GPT-4o-mini prompt.
5. The answer (with CVE sources) is returned to the user.
## Tech Stack
| Layer | Technology | Purpose |
|---|---|---|
| Data ingestion | Python, Requests, AWS Glue | Fetch CVEs from NVD API |
| Data processing | Python, Boto3 | Chunk and clean CVE records |
| Embeddings | OpenAI `text-embedding-3-small` | Generate semantic vectors |
| Vector search | FAISS (`IndexFlatL2`) | Nearest-neighbour retrieval |
| LLM | OpenAI `gpt-4o-mini` | Generate natural language answers |
| Backend | FastAPI, Uvicorn | REST API serving the RAG pipeline |
| Frontend | HTML, CSS, Vanilla JS | Query interface |
| Web server | nginx | Reverse proxy on port 80 |
| Process manager | systemd | Keep FastAPI alive across reboots |
| Storage | AWS S3 | Raw data, chunks, embeddings |
| Orchestration | AWS Glue Workflow | Weekly automated pipeline |
| Runtime | Python 3.11 (EC2), Python 3.9 (Glue) | |
| Package manager | uv | Fast dependency management |
## Repository Structure
cyber-risk-rag-assistant/
│
├── .env.example # Template for required env vars
├── .gitignore
├── pyproject.toml # uv project config and dependencies
├── uv.lock
├── pipeline_run.py # Local runner — executes all 3 pipeline steps in sequence
│
├── ingestion/
│ ├── fetch_cve_feed.py # Fetches CVEs from NVD API → saves to S3
│ └── chunk_documents.py # Reads raw JSON from S3, cleans → saves chunks to S3
│
├── embeddings/
│ └── generate_embeddings.py # Reads chunks from S3, calls OpenAI → uploads embeddings.npy to S3
│
├── rag/
│ └── pipeline.py # Core RAG logic: load index, embed question, retrieve, generate answer
│
├── backend/
│ └── main.py # FastAPI app — exposes /query, /health, / endpoints
│
└── frontend/
└── index.html # Single-page HTML frontend
## Data Pipeline
### Step 1 — Fetch CVEs (`ingestion/fetch_cve_feed.py`)
- **Source:** `https://services.nvd.nist.gov/rest/json/cves/2.0`
- **Output:** `s3://cyber-risk-raw-{name}/raw/cve_feed.json`
- **Typical output size:** ~36 MB for 60 days of CVEs
Key parameters:
days_back = 60 # How far back to fetch
results_per_page = 2000 # NVD maximum
### Step 2 — Chunk & Clean (`ingestion/chunk_documents.py`)
**Fields extracted per CVE:**
- CVE ID
- Published date
- Severity and CVSS score (v3.1 → v3.0 → v2.0 fallback)
- Affected products (from CPE strings)
- English description
**Output format per chunk:**
CVE ID: CVE-2026-32211
Published: 2026-04-03
Severity: CRITICAL (Score: 9.1)
Affected Products: microsoft azure_mcp_server
Description: Missing authentication for a critical function in Azure MCP Server...
- **Input:** `s3://.../raw/cve_feed.json`
- **Output:** `s3://.../processed/cve_chunks.json`
### Step 3 — Generate Embeddings (`embeddings/generate_embeddings.py`)
- **Model:** `text-embedding-3-small` (1536 dimensions)
- **Batch size:** 100 (with 0.5s sleep between batches to respect rate limits)
- **Output:** `s3://.../faiss_index/embeddings.npy` and `chunks.pkl`
## RAG Pipeline
The core logic lives in `rag/pipeline.py` and runs entirely on the EC2 app server.
### `load_index()`
dimension = embeddings.shape[1] # 1536
index = faiss.IndexFlatL2(dimension)
index.add(embeddings)
### `embed_question(question)`
Converts the user's question into a 1536-dimensional vector using the same embedding model used for the CVEs (`text-embedding-3-small`). This ensures the question and documents are in the same vector space.
### `retrieve(question, index, chunks, top_k=5)`
Searches the FAISS index for the `top_k` nearest vectors to the question vector. Returns the corresponding CVE chunks with their distances.
### `build_prompt(question, context_chunks)`
Constructs a system prompt that injects the retrieved CVE texts as context and instructs the model to answer using only that data, always citing CVE IDs.
### `generate_answer(prompt)`
Calls `gpt-4o-mini` with `temperature=0.2` (factual, low-creativity) and returns the answer string.
## Backend API
FastAPI app at `backend/main.py`. Served by Uvicorn on port 8000, with nginx as a reverse proxy on port 80.
### Endpoints
| Method | Path | Description |
|---|---|---|
| `GET` | `/` | Health check — returns status and total CVE count |
| `GET` | `/health` | Simple ping |
| `POST` | `/query` | Main endpoint — accepts a question, returns answer + sources |
| `GET` | `/docs` | Auto-generated Swagger UI |
### POST `/query` — Request
{
"question": "What are the most critical vulnerabilities affecting Microsoft products?",
"top_k": 5
}
### POST `/query` — Response
{
"question": "...",
"answer": "Based on the CVE data, the most critical Microsoft vulnerabilities include CVE-2026-32211 (CRITICAL, 9.1)...",
"sources": [
{ "cve_id": "CVE-2026-32211", "severity": "CRITICAL", "score": 9.1 },
{ "cve_id": "CVE-2026-24303", "severity": "CRITICAL", "score": 9.6 }
]
}
### CORS
All origins are allowed (`allow_origins=["*"]`). In production this should be restricted to your domain.
## Frontend
A single-file HTML/CSS/JS interface at `frontend/index.html`. Served as a static file by nginx.
**Features:**
- Dark-themed professional UI
- Textarea for natural language questions
- Sample question buttons that auto-fill the input
- Adjustable `top_k` selector (3, 5, or 10 CVEs)
- Loading animation while query runs
- Answer rendered as formatted text
- Source cards colour-coded by severity (CRITICAL = red, HIGH = orange, MEDIUM = yellow, LOW = green)
- Enter key submits the query
The frontend calls the API at the same host/IP. The `const API` constant at the top of the script should be updated if the server IP changes.
## AWS Infrastructure
### S3 Buckets
| Bucket | Contents |
|---|---|
| `cyber-risk-raw-{name}` | `raw/cve_feed.json`, `processed/cve_chunks.json` |
| `cyber-risk-processed-{name}` | `faiss_index/embeddings.npy`, `faiss_index/chunks.pkl` |
All buckets are private with public access blocked. Access is via IAM role (Glue) or AWS credentials (EC2).
### AWS Glue Workflow
**Workflow name:** `cyber-risk-daily-pipeline`
**Schedule:** Every Monday at 02:00 UTC
Three Python Shell jobs chained via event triggers:
[Schedule: Monday 2AM UTC]
↓
[cyber-risk-fetch-cves]
↓ (SUCCEEDED)
[cyber-risk-chunk-data]
↓ (SUCCEEDED)
[cyber-risk-generate-embeddings]
**Job settings:**
- Type: Python Shell
- Glue version: Python 3.9
- Max capacity: 0.0625 DPU (minimum)
- `--additional-python-modules`: `openai` (Job 3 only)
**Job parameters (set in Glue console — not hardcoded):**
| Job | Parameter | Value |
|---|---|---|
| All | `S3_BUCKET_RAW` | `cyber-risk-raw-{name}` |
| Job 3 | `S3_BUCKET_PROCESSED` | `cyber-risk-processed-{name}` |
| Job 3 | `OPENAI_API_KEY` | `sk-proj-...` |
### EC2 Instance
| Setting | Value |
|---|---|
| AMI | Ubuntu 26.04 LTS |
| Instance type | t2.micro (free tier eligible) |
| Storage | 20 GB gp3 |
| Region | ap-south-1 (Mumbai) |
| Open ports | 22 (SSH, My IP only), 80 (HTTP, anywhere), 8000 (API, anywhere) |
**Services running on EC2:**
- `uvicorn` — FastAPI app on port 8000 (managed by systemd)
- `nginx` — reverse proxy on port 80, serves static frontend
### systemd Service
The FastAPI app is registered as a systemd service (`cyber-risk-app.service`) so it:
- Starts automatically when EC2 boots
- Restarts automatically if the process crashes (`Restart=always`)
Useful commands:
sudo systemctl status cyber-risk-app # Check status
sudo systemctl restart cyber-risk-app # Restart after code update
sudo systemctl stop cyber-risk-app # Stop
journalctl -u cyber-risk-app -f # View live logs
## Local Development Setup
### Prerequisites
- Python 3.11
- [uv](https://github.com/astral-sh/uv) (`pip install uv`)
- Git
- AWS CLI configured (`aws configure`)
- OpenAI API key
### Steps
# 1. Clone
git clone https://github.com/your-username/cyber-risk-rag-assistant.git
cd cyber-risk-rag-assistant
# 2. Create virtual environment
uv venv
source .venv/bin/activate # Linux/Mac
.venv\Scripts\activate # Windows
# 3. Install dependencies
uv sync
# 4. Set up environment variables
cp .env.example .env
# Edit .env and add your OPENAI_API_KEY
# 5. Run the full pipeline (requires AWS credentials + NVD API access)
python pipeline_run.py
# 6. Start the API
uvicorn backend.main:app --host 0.0.0.0 --port 8000
# 7. Open http://localhost:8000/docs to test
## Deployment Guide
### First-time EC2 setup
# Connect
ssh -i ~/.ssh/cyber-risk-key.pem ubuntu@
# Update and install dependencies
sudo apt update && sudo apt upgrade -y
sudo apt install -y python3-pip python3-venv nginx git curl unzip
# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env
# Install AWS CLI
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip && sudo ./aws/install
# Configure AWS
aws configure # Enter access key, secret, region: ap-south-1, output: json
# Clone and set up project
git clone https://github.com/your-username/cyber-risk-rag-assistant.git
cd cyber-risk-rag-assistant
uv venv && source .venv/bin/activate && uv sync
# Set OpenAI key
echo "OPENAI_API_KEY=sk-proj-..." > .env
# Start FastAPI
uvicorn backend.main:app --host 0.0.0.0 --port 8000
### Set up systemd
sudo nano /etc/systemd/system/cyber-risk-app.service
Paste:
[Unit]
Description=Cyber Risk RAG Assistant
After=network.target
[Service]
User=ubuntu
WorkingDirectory=/home/ubuntu/cyber-risk-rag-assistant
Environment="PATH=/home/ubuntu/cyber-risk-rag-assistant/.venv/bin"
ExecStart=/home/ubuntu/cyber-risk-rag-assistant/.venv/bin/uvicorn backend.main:app --host 0.0.0.0 --port 8000
Restart=always
RestartSec=3
[Install]
WantedBy=multi-user.target
sudo systemctl daemon-reload
sudo systemctl enable cyber-risk-app
sudo systemctl start cyber-risk-app
### Set up nginx
sudo nano /etc/nginx/sites-available/cyber-risk-app
Paste:
server {
listen 80;
server_name ;
root /var/www/cyber-risk;
index index.html;
location / {
try_files $uri $uri/ /index.html;
}
location /query {
proxy_pass http://127.0.0.1:8000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
location /health {
proxy_pass http://127.0.0.1:8000;
}
}
sudo mkdir -p /var/www/cyber-risk
sudo cp frontend/index.html /var/www/cyber-risk/index.html
sudo ln -s /etc/nginx/sites-available/cyber-risk-app /etc/nginx/sites-enabled/
sudo nginx -t
sudo systemctl restart nginx
### Updating the app after code changes
cd ~/cyber-risk-rag-assistant
git pull
uv sync
sudo systemctl restart cyber-risk-app
## Cost Breakdown
### Monthly estimates (weekly pipeline run)
| Service | Usage | Monthly cost |
|---|---|---|
| AWS Glue (3 jobs, weekly) | ~$0.004/run × 4 runs | ~$0.016 |
| OpenAI Embeddings (13k chunks) | ~$0.026/run × 4 runs | ~$0.10 |
| OpenAI Queries (GPT-4o-mini) | ~$0.001–0.002 per query | Varies |
| S3 Storage (~100 MB) | Negligible | ~$0.002 |
| EC2 t2.micro | Free tier (750 hrs/month) | $0 (year 1) |
| **Total (pipeline only)** | | **~$0.12/month** |
EC2 costs approximately $8–10/month after the free tier year expires.
### Cost optimisation opportunity
Currently the pipeline re-embeds all CVEs every run. An incremental approach — only embedding CVEs published since the last run — would reduce OpenAI costs by ~95% (~$0.005/month total).