Wizen-Labs/AI-ML_Capstone
GitHub: Wizen-Labs/AI-ML_Capstone
Stars: 0 | Forks: 0
# CVE Risk Classification — Imperial Business School Capstone
**MSc Cybersecurity + Certificate AI/ML**
## Project Structure
cve_capstone_final/
│
├── 01_data_prep.ipynb ← Load, merge, clean NVD + KEV + EPSS data
├── 02_eda_and_baseline.ipynb ← EDA + Naive Bayes (explored, rejected)
├── 03_hybrid_knn_decision_tree.ipynb ← CORE MODEL (KNN clusters + Decision Tree)
├── 04_pytorch_neural_network.ipynb ← Deep learning layer (text classification)
├── 05_contrast_layer_stub.ipynb ← Mythos contrast layer (stub + queue)
│
├── data/
│ └── df_clean.csv ← Generated by notebook 01
│
├── models/
│ ├── knn_clusterer.pkl
│ ├── scaler.pkl
│ ├── hybrid_tree.pkl
│ ├── label_encoder.pkl
│ ├── feature_encoders.pkl
│ └── pytorch_best.pt
│
├── outputs/ ← All charts and reports (auto-generated)
└── queue/
└── pending.jsonl ← Mythos validation queue
## Run Order
Always run notebooks in sequence — each depends on outputs from the previous:
01 → 02 → 03 → 04 → 05
## Key Methodological Decisions
### Data leakage fix (notebook 03)
In earlier exploratory notebooks, KNN was fitted on the full dataset before
the train/test split — causing data leakage. Fixed: split happens first,
KNN is fitted on training data only, cluster IDs for the test set are
generated via `knn.predict()` on unseen data.
### Why Naive Bayes was rejected (notebook 02)
CVE descriptions have strong word co-dependencies that violate the NB
independence assumption. Class imbalance causes NB to predict the majority
class. Documented in notebook 02 as methodological evidence.
### KNN as feature engineer, not classifier
KNN is used to discover natural vulnerability clusters and inject a
`cluster_id` feature into the Decision Tree. This reduces majority-class
bias by giving the tree semantically richer splits.
### Blind hybrid & Zero-Day stress test (notebook 03)
The model is tested without CVSS scores to simulate the NVD enrichment gap
(NIST triage model, April 2026). If the accuracy delta is positive, the
cluster_id acts as a reliable backup when official scoring is missing.
### Mythos contrast layer (notebook 05)
Full pipeline architecture for Claude Mythos validation is implemented.
Mythos is currently restricted under Project Glasswing. The stub queues
all classifications as a retrospective validation set, ready for contrast
the moment access is granted.
## Data Source
Kaggle: `francescomanzoni/vulnerability-management-datasets`
- `cve_cisa_epss_enriched_dataset.csv`
- `cve_corpus.csv`
**Data cutoff:** [INSERT DOWNLOAD DATE]
**NVD enrichment note:** NIST triage model active from April 15, 2026.
CVEs after that date may lack CVSS/CWE enrichment — see capstone
methodology Section 3.
## Dependencies
numpy pandas scikit-learn matplotlib seaborn
torch joblib xgboost
Install: `pip install numpy pandas scikit-learn matplotlib seaborn torch joblib xgboost`