oussama-allouch/snort-rag-rule-generator
GitHub: oussama-allouch/snort-rag-rule-generator
Stars: 1 | Forks: 0
# Snort RAG Rule Generator - NLP Mini Project + Devoir 3
Defensive NLP/RAG project for generating valid Snort IDS rules from natural-language network-attack descriptions.
## What this project contains
- An **official Person 1 retrieval dataset** stored in `data/processed/final_snort_dataset.csv`.
- A **trusted-source knowledge base** of real Snort rules stored in `data/knowledge_base/`.
- A legacy script to **expand trusted rules into experiment rows**: `python -m snort_rag.generate_dataset --multiplier 20`.
- Seven Devoir 3 architectures:
- baseline without RAG
- classic RAG
- RAG with re-ranking
- hybrid RAG (TF-IDF dense fallback + BM25 fusion)
- multi-hop RAG
- graph RAG
- agentic RAG
- Metrics and comparison table in `results/comparison_metrics.csv`.
- t-SNE embedding visualization in `results/embedding_tsne.png`.
- Gradio dashboard with PDF upload to extend the knowledge base.
- Technical report in `docs/`.
## Important academic constraint
The assignment forbids direct black-box LLM use and forbids OpenAI/Mistral/Ollama API usage for Devoir 3. Therefore this project uses a **local transparent generator** based on retrieved context + Snort templates. The code still implements the complete RAG pipeline: query encoding, retrieval, prompt construction, generation, explanation and evaluation.
## Installation
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txt
pip install -e .
## Build the real-rule knowledge base
python scripts/fetch_real_sources.py
Outputs:
- `data/knowledge_base/trusted_rule_kb.csv`
- `data/knowledge_base/trusted_rule_kb.jsonl`
- `data/knowledge_base/fetch_summary.json`
## Official Person 1 dataset
The official dataset used by the application, Devoir 3 runner, and retrieval layer is:
- `data/processed/final_snort_dataset.csv`
- `data/processed/final_snort_dataset.jsonl`
- `data/processed/dataset_summary.json`
- `data/processed/person1_rules.rules`
This dataset is personal, controlled, manually reviewable, and is the main dataset for Person 1 and the default retrieval corpus across the project.
Submission status:
- The official Person 1 dataset is `data/processed/final_snort_dataset.csv`.
- The final dataset is personal, controlled, balanced across 10 attack types, and contains 200 rows.
- The legacy trusted-rule expansion generator remains in the repository only as experimental code.
- The old 500k-row generated files are not included in the final submitted project.
## Legacy dataset generator
The legacy generator requires the trusted real-rule knowledge base and creates
multiple natural-language rows per real rule for older experiments only.
python -m snort_rag.generate_dataset --multiplier 10
# bigger dataset
python -m snort_rag.generate_dataset --multiplier 30
Legacy outputs if you run the experiment manually:
- `data/experiments/legacy_generated/snort_generated_dataset.csv`
- `data/experiments/legacy_generated/snort_generated_dataset.json`
- `data/experiments/legacy_generated/snort_generated_dataset.jsonl`
- `data/experiments/legacy_generated/snort_generated_dataset_summary.json`
These files are legacy experimental artifacts only. The legacy generator no longer writes into `data/processed/` and must not be used as the official Person 1 dataset workflow.
## Run Devoir 3 comparison
python -m snort_rag.run_devoir3
Outputs:
- `results/comparison_metrics.csv`
- `results/detailed_devoir3_results.csv`
- `results/embedding_tsne.png`
## Launch dashboard
python -m snort_rag.app_gradio
The dashboard lets you choose a RAG architecture, generate a Snort rule, inspect retrieved documents and add a PDF as a new knowledge source.
## Project structure
src/snort_rag/ package source code
data/knowledge_base/ persisted trusted-source real Snort rules
data/raw/ source manifest
data/processed/ official Person 1 dataset
data/experiments/legacy_generated/ legacy trusted-rule expansion outputs only
notebooks/ Devoir 3 notebook
results/ metrics and t-SNE plot
docs/ report files
scripts/fetch_real_sources.py trusted-source rule ingestion
## Example
from snort_rag.architectures import SnortRAGArchitectures
rag = SnortRAGArchitectures("data/processed/final_snort_dataset.csv")
result = rag.agentic_rag("Detect SQL injection with UNION SELECT in HTTP URI")
print(result["generated_rule"])
print(result["explanation"])
## Disclaimer
This is a defensive IDS rule generation project for education. Every generated rule must still be validated in a real Snort installation using Snort's configuration test mode before production use.