justkelvin/rag-phishing-incident-response
GitHub: justkelvin/rag-phishing-incident-response
Stars: 1 | Forks: 0
# RAG Phishing Incident Response Guidance
A Retrieval-Augmented Generation (RAG) system that helps small and mid-sized organizations answer practical phishing incident response questions using source-grounded cybersecurity guidance.
The project builds a curated knowledge base, chunks and embeds authoritative incident response documents, retrieves relevant context with ChromaDB, and generates concise practitioner guidance with a Google GenAI/Gemma model.

## Project Snapshot
| Area | Details |
| --- | --- |
| Domain | Cybersecurity |
| Use case | Phishing incident response for small and mid-sized organizations |
| Knowledge base | 10 curated government and vendor guidance documents |
| Document volume | 20,007 total source words |
| Chunking | 277 paragraph-aware chunks |
| Embedding model | `sentence-transformers/all-MiniLM-L6-v2` |
| Vector store | ChromaDB |
| Retrieval method | Dense vector similarity search |
| Retrieved context | Top 3 chunks per query |
| Generator | `models/gemma-4-26b-a4b-it` via Google AI Studio / Gemini API |
| Evaluation set | 10 phishing response queries |
## What This System Does
The RAG pipeline answers questions such as:
- What should we do after an employee clicks a phishing link?
- How should a small company respond to credential theft after a phishing email?
- What should an administrator do when a Microsoft 365 mailbox is compromised?
- What should be included in a phishing incident response playbook?
For each query, the system:
1. Embeds the practitioner question.
2. Retrieves the most relevant source chunks from ChromaDB.
3. Builds a strict source-grounded prompt.
4. Generates operational guidance with citations to retrieved source titles.
5. Scores the answer using reference-free RAG evaluation proxies.
## Architecture
Curated source documents
|
v
Document loading and cleaning
|
v
Paragraph-aware chunking
|
v
MiniLM embeddings
|
v
ChromaDB vector store
|
v
Top-k retrieval
|
v
Source-grounded prompt
|
v
Google GenAI / Gemma generation
|
v
Heuristic RAG evaluation and reporting
## Repository Structure
.
|-- data/
| |-- raw/sources/ # Curated source text files
| |-- processed/ # Cleaned documents and generated chunks
| |-- evaluation/ # Evaluation query set
| `-- metadata/ # Source metadata and project config
|-- notebooks/
| `-- 01_rag_phishing_incident_response.ipynb
|-- outputs/
| |-- retrieval/ # Retrieved chunks per query
| |-- generation/ # Generated answers and full RAG results
| |-- evaluation/ # Heuristic evaluation scores
| |-- tables/ # Report-ready tables
| `-- figures/ # Report-ready plots
|-- report/
| |-- figures/ # Figures copied for report writing
| |-- tables/ # Tables copied for report writing
| `-- assignment3_report.pdf
|-- slides/ # Presentation deck
|-- src/ # Reusable pipeline modules
|-- video/ # Demo video and transcript
|-- requirements.txt
`-- README.md
`vectorstore/chroma/` is generated locally by the notebook and ignored by Git.
## Source Corpus
The knowledge base combines cybersecurity guidance from NIST, CISA, NSA, FBI, MS-ISAC, NCSC, Microsoft, Google, and the FTC.
| ID | Source | Organization |
| --- | --- | --- |
| `src001` | Incident Response Recommendations and Considerations for Cybersecurity Risk Management: A CSF 2.0 Community Profile | NIST |
| `src002` | Phishing Guidance: Stopping the Attack Cycle at Phase One | CISA, NSA, FBI, MS-ISAC |
| `src003` | Recognize and Report Phishing | CISA |
| `src004` | Small organisations guide to cyber security | NCSC |
| `src005` | Plan: Your cyber incident response processes | NCSC |
| `src006` | Respond to a compromised cloud email account | Microsoft |
| `src007` | Incident response overview | Microsoft |
| `src008` | Identify and secure compromised accounts | Google |
| `src009` | Protect your Google Cloud resources from compromised credentials | Google |
| `src010` | Cybersecurity for Small Business | FTC |
Full source URLs and metadata are stored in [`data/metadata/sources_metadata.csv`](data/metadata/sources_metadata.csv).
## Core Modules
| Module | Purpose |
| --- | --- |
| [`src/config.py`](src/config.py) | Central project constants for models, chunking, retrieval, and evaluation |
| [`src/document_loader.py`](src/document_loader.py) | Loads source metadata and raw source files |
| [`src/preprocessing.py`](src/preprocessing.py) | Cleans text, creates overlapping chunks, and assigns deterministic chunk IDs |
| [`src/embeddings.py`](src/embeddings.py) | Loads the sentence-transformer model and creates normalized embeddings |
| [`src/retriever.py`](src/retriever.py) | Retrieves top-k chunks from ChromaDB |
| [`src/generator.py`](src/generator.py) | Builds source-grounded prompts and calls Google GenAI |
| [`src/evaluator.py`](src/evaluator.py) | Computes context relevance, answer relevance, faithfulness, and citation coverage proxies |
| [`src/utils.py`](src/utils.py) | Shared filesystem and JSON helpers |
## Setup
### 1. Create a virtual environment
python -m venv .venv
source .venv/bin/activate
On Windows PowerShell:
python -m venv .venv
.\.venv\Scripts\Activate.ps1
### 2. Install dependencies
pip install -r requirements.txt
### 3. Configure environment variables
Create `.env` from the example file:
cp .env.example .env
Then add your Google AI Studio key:
GOOGLE_API_KEY=your_google_ai_studio_key_here
OPENAI_API_KEY=your_openai_key_here
`GOOGLE_API_KEY` is required for answer generation. `OPENAI_API_KEY` is included as an optional placeholder and is not required by the default pipeline.
## Running the Project
Open and run the notebook from top to bottom:
notebooks/01_rag_phishing_incident_response.ipynb
The notebook performs the full workflow:
1. Environment and API setup
2. Project configuration export
3. Source loading and text cleaning
4. Chunk generation
5. Embedding and ChromaDB indexing
6. Evaluation query loading
7. Retrieval testing
8. RAG answer generation
9. Heuristic evaluation
10. Error analysis, plots, and report exports
## Main Outputs
| Output | Description |
| --- | --- |
| [`data/processed/documents_cleaned.csv`](data/processed/documents_cleaned.csv) | Cleaned document-level corpus |
| [`data/processed/chunks.csv`](data/processed/chunks.csv) | Chunk-level corpus used for retrieval |
| [`outputs/retrieval/retrieved_chunks.csv`](outputs/retrieval/retrieved_chunks.csv) | Retrieved chunks for evaluation queries |
| [`outputs/generation/generated_answers.csv`](outputs/generation/generated_answers.csv) | Generated answers only |
| [`outputs/generation/rag_results_full.csv`](outputs/generation/rag_results_full.csv) | Queries, retrieved context, prompts, answers, and metadata |
| [`outputs/evaluation/heuristic_scores.csv`](outputs/evaluation/heuristic_scores.csv) | Per-query evaluation scores |
| [`outputs/tables/results_summary.csv`](outputs/tables/results_summary.csv) | Compact results table |
| [`outputs/tables/error_analysis_with_scores.csv`](outputs/tables/error_analysis_with_scores.csv) | Error analysis joined with scores |
| [`outputs/figures/overall_score_by_query.png`](outputs/figures/overall_score_by_query.png) | Overall score visualization |
| [`outputs/figures/average_metric_scores.png`](outputs/figures/average_metric_scores.png) | Mean score visualization |
## Evaluation
Because the project does not include human-written reference answers, evaluation uses RAGAS-style reference-free proxy metrics:
- **Context relevance**: how well retrieved chunks match the query.
- **Answer relevance**: how semantically aligned the generated answer is with the query.
- **Faithfulness**: how well answer claims are grounded in retrieved context.
- **Citation coverage**: whether generated guidelines cite retrieved source titles.
- **Overall score**: aggregate quality score across the evaluation dimensions.
### Final Mean Scores
| Metric | Mean | Minimum | Maximum |
| --- | ---: | ---: | ---: |
| Context relevance | 0.617 | 0.541 | 0.689 |
| Answer relevance | 0.764 | 0.671 | 0.873 |
| Faithfulness | 0.627 | 0.568 | 0.740 |
| Citation coverage | 1.000 | 1.000 | 1.000 |
| Overall score | 0.696 | 0.656 | 0.762 |
Best-performing query: `Q7`
Weakest query: `Q6`

## Reproducibility Notes
- The source corpus is versioned in `data/raw/sources/`.
- The processed corpus and generated outputs are included for inspection.
- The local ChromaDB directory is reproducible and intentionally ignored through `.gitignore`.
- Generation results can vary slightly because the hosted model and provider-side behavior may change.
- The notebook selects a generation-capable Gemma model when available.
## Limitations
## Submitted Artifacts
- Report: [`report/assignment3_report.pdf`](report/assignment3_report.pdf)
- Slides: [`slides/RAG System for Phishing Incident Response Guidance.pptx`](slides/RAG%20System%20for%20Phishing%20Incident%20Response%20Guidance.pptx)
- Demo video: [`video/screen-2026-05-26_22-52-06.mp4`](video/screen-2026-05-26_22-52-06.mp4)
- Video transcript: [`video/video_transcipt.pdf`](video/video_transcipt.pdf)
## License
This project is released under the [MIT License](LICENSE).
This repository contains an academic/research implementation and curated references to public cybersecurity guidance. Review the original source terms and your organization's policies before reusing the material in production.