kryptbakar/AI-Integrated-Intrusion-Detection-System
GitHub: kryptbakar/AI-Integrated-Intrusion-Detection-System
Stars: 0 | Forks: 0
# AI-Integrated Intrusion Detection System 🚨🧠
A production-oriented, local-first intrusion detection prototype combining a supervised gradient-boosting classifier (CatBoost) with an unsupervised anomaly detector (Local Outlier Factor). The repository contains a Streamlit inference app, a full local training pipeline, and supporting artifacts for evaluation and explainability.
## Table of Contents
- 🚀 Project Overview
- 🧾 Recruiter-Friendly Project Summary (what you should know)
- ✨ Key Technical Features
- 🛠️ Tech Stack & Dependencies
- 📁 Project Structure (detailed)
- 🧩 Installation & Setup (train & run)
- ▶️ How to Train Locally (exact scripts & artifacts produced)
- ▶️ How to Run the Streamlit App (inference)
- 📦 Saved Artifacts & File Formats
- 🔍 Files used for Local Training — Thorough File-by-File Analysis
- ✅ Notes & Next Steps
## 🚀 Project Overview
This repository implements a hybrid intrusion detection workflow that combines:
- CatBoost (supervised binary classifier) to predict attacks vs. benign traffic, and
- Local Outlier Factor (LOF) as an anomaly detector trained only on benign traffic.
The hybrid voting strategy marks a sample as malicious if either detector raises a flag (OR logic), prioritizing detection (high recall) while preserving explainability through feature statistics and saved metrics.
## 🧾 Project Summary (what you should know)
This project demonstrates end-to-end competencies valuable for production ML roles:
- Practical model selection for imbalanced classification: replaces XGBoost + OCSVM with CatBoost + LOF for speed and better handling of categorical/imbalanced data.
- Data-leakage-aware design: strict temporal split (no shuffling), scaler fit on training only, and automated leakage warnings for suspicious metric values.
- Imbalance handling & resampling expertise: fallback simple resampling and an aggressive pipeline using imbalanced-learn's RandomUnderSampler, SMOTE, and ADASYN.
- Model engineering & deployment readiness: training script serializes models, scaler, feature names, and produces JSON artifacts for metrics and sample predictions — ready for Streamlit deployment.
- Performance & scaling considerations: sampling benign data for LOF to drastically reduce training time, multi-threaded CatBoost configuration, and measured training-time reporting.
- Reproducibility & artifacts: saved Pickle and JSON outputs for reproducible inference and dashboarding.
## ✨ Key Technical Features
- Hybrid ensemble: CatBoost classifier + LOF anomaly detector with OR voting to maximize recall.
- Leakage-aware pipeline: strict chronological train/test split and scaler fit on training only.
- Aggressive balancing strategy: undersampling + SMOTE + ADASYN (when imbalanced-learn available), with robust fallback to simple oversampling.
- LOF optimization: benign-only training with configurable sampling ratio to reduce LOF training time on large datasets.
- Comprehensive artifact generation: trained model pickle, metrics JSON, sample predictions JSON, and feature statistics JSON for Streamlit consumption.
## 🛠️ Tech Stack & Dependencies
Badges (primary):
[](https://www.python.org/)
[](https://streamlit.io/)
[](https://catboost.ai/)
[](https://scikit-learn.org/)
[](https://imbalanced-learn.org/)
Primary packages (from repository files):
- streamlit
- numpy
- pandas
- scikit-learn
- catboost
- imbalanced-learn (optional; used in training)
- matplotlib, seaborn (visualization)
- gdown (suggests model download helper present in other flows)
Two sets of requirements are provided:
- requirements_catboost_lof_training.txt — for training (includes imbalanced-learn)
- requirements_catboost_lof_inference.txt — inference-only (no imbalanced-learn)
- requirements.txt — Streamlit inference general dependencies
## 📁 Project Structure (detailed)
Root (main files and purpose)
- .devcontainer/ — optional devcontainer config (repro environment)
- .streamlit/ — Streamlit config (UI settings)
- app.py — Main Streamlit inference application (UI + model loading)
- pipeline_catboost_lof.py — Core training pipeline implementing CatBoost + LOF hybrid
- train_catboost_lof.py — Convenience training script that runs pipeline and writes artifacts
- requirements.txt — Inference requirements (streamlit + viz libs)
- requirements_catboost_lof_inference.txt — Inference-only reqs
- requirements_catboost_lof_training.txt — Training-only reqs (includes imbalanced-learn)
- runtime.txt — runtime hint (platform)
- feature_stats.json — example feature statistics for Streamlit dashboard
- sample_predictions.json — example predictions for Streamlit demo
- metrics.json — example metrics for Streamlit demo
- sample_predictions_catboost_lof.json — (produced by training script) example predictions for CatBoost+LOF
- trained_models_catboost_lof.pkl — (produced by training script) pickled models & metadata
- feature_stats_catboost_lof.json — (produced by training script) training feature stats
- metrics_catboost_lof.json — (produced by training script) metrics from training
- catboost_lof_results.csv — (produced by pipeline when run directly)
Note: Some of the files above (with _catboost_lof suffix) are created when you run the included local training script. If not present, run training to generate them.
## 🔍 Files used for Local Training — Thorough File-by-File Analysis
Below are the training-focused files you provided, with detailed technical notes (useful for interview talking points).
1. pipeline_catboost_lof.py
- Purpose: Encapsulates the complete CatBoost + LOF training, evaluation, and prediction pipeline.
- Key design decisions:
- Temporal split: Implemented in temporal_split() — uses chronological split with no shuffling to avoid leakage across time.
- Robust cleaning: Replaces +/-inf with NaN for "Flow Bytes/s" and "Flow Packets/s" if present, and converts 'Label' to binary (BENIGN -> 0, others -> 1).
- Preprocessing: numeric-only features selected; missing values imputed using training medians (applied to test using training stats).
- Scaling: StandardScaler fit on training only (scale_features).
- Class imbalance handling:
- When imbalanced-learn is available: aggressive_balance() runs a three-step balancing approach — optional light RandomUnderSampler, then SMOTE to 1:1, then ADASYN to refine. This is a strong pipeline for class imbalance.
- If imbalanced-learn is not installed: _simple_balance() falls back to sklearn.utils.resample to oversample minority until parity.
- CatBoost model:
- CatBoostClassifier configured with iterations=500, depth=10, learning_rate=0.05, Logloss, eval_metric=AUC.
- class_weights set to balance with ratio n_neg / n_pos.
- thread_count=-1 to use all CPU cores; task_type='CPU' by default (GPU switchable).
- LOF model:
- LOF trained only on benign samples (novelty=True), contamination configurable, and uses sampling of benign set if dataset is large to reduce training time.
- Predicts novelty at inference time and converts LOF outputs to a pseudo-probability via logistic transform on score_samples().
- Voting strategy:
- OR logic: sample classified as attack (1) if either CatBoost or LOF predicts attack — maximizes recall.
- Evaluation:
- evaluate_model() computes accuracy, precision, recall, f1, roc_auc (if proba available), confusion matrix, false positive rate, false negative rate, and prints human-friendly summary with leakage warnings when metrics are suspiciously high (recall & precision > 0.98).
- Runtime & performance:
- Reports training time for CatBoost and LOF separately and aggregates them.
- LOF sample ratio default 0.3 (30% of benign samples) to drastically reduce LOF training time for very large datasets.
2. train_catboost_lof.py
- Purpose: A convenience script that runs the pipeline and writes model and artifact files for use by Streamlit.
- Behavior:
- Creates pipeline with default hyperparameters (threshold 0.15, LOF contamination 0.1, neighbors 20, LOF sample ratio 0.3).
- Calls pipeline.run_pipeline(), then packages:
- trained models + scaler + feature names into trained_models_catboost_lof.pkl using pickle.
- metrics into metrics_catboost_lof.json.
- sample predictions into sample_predictions_catboost_lof.json (samples up to 1000 records).
- feature statistics (means/stds/mins/maxs) into feature_stats_catboost_lof.json.
- Output is intended for immediate consumption by the Streamlit app — minimal additional glue code necessary.
- Safety checks:
- Exits with error if `merged_output.csv` not found.
3. requirements_catboost_lof_training.txt
- Purpose: Training-specific dependency list.
- Notable packages:
- imbalanced-learn==0.11.0 — used heavily in aggressive balancing (RandomUnderSampler, SMOTE, ADASYN).
4. requirements_catboost_lof_inference.txt
- Purpose: Inference-only dependencies (Streamlit, CatBoost, scikit-learn, etc.) — smaller surface area than training.
Practical implications & interview notes:
- The repo demonstrates realistic operational concerns: class imbalance, temporal leakage, training-time constraints (sampling), multi-model hybridization, artifact serialization.
- The code shows familiarity with both supervised and unsupervised detection techniques and with standard ML libraries.
## 🧩 Installation & Setup
1. Clone repository
git clone https://github.com/aliuzair1/AI-Integrated-Intrusion-Detection-System.git
cd AI-Integrated-Intrusion-Detection-System
2. Create & activate a virtual environment
python -m venv .venv
# macOS / Linux
source .venv/bin/activate
# Windows (PowerShell)
.venv\Scripts\Activate.ps1
3. Install dependencies
- For local training (recommended):
pip install -r requirements_catboost_lof_training.txt
- For inference/Streamlit only:
pip install -r requirements_catboost_lof_inference.txt
# or use the general requirements.txt for the Streamlit app
pip install -r requirements.txt
4. Place training dataset
- The training scripts expect a CSV named `merged_output.csv` in the project root. If your filename differs, pass the path to the training script.
## ▶️ How to Train Locally (exact scripts & artifacts produced)
Train using the provided training script (this runs the entire pipeline and saves artifacts):
python train_catboost_lof.py
What train_catboost_lof.py does (summary):
- Instantiates CatBoostLOFPipeline from pipeline_catboost_lof.py with tuned defaults.
- Runs entire pipeline (load -> temporal split -> preprocessing -> balancing -> train CatBoost -> train LOF -> evaluate).
- Serializes the following files (saved to project root):
- trained_models_catboost_lof.pkl (pickle containing CatBoost model, LOF model, scaler, feature names, thresholds, config)
- metrics_catboost_lof.json (training metrics + metadata)
- sample_predictions_catboost_lof.json (sample of predictions for dashboard)
- feature_stats_catboost_lof.json (means, stds, mins, maxs for features)
- Estimated training time: ~4–6 minutes on a typical multi-core CPU (the pipeline reports timing).
Important: the script will error-out if `merged_output.csv` is not present — place your dataset in the root or update the path.
## ▶️ How to Run the Streamlit App (inference)
1. Ensure inference requirements installed:
pip install -r requirements_catboost_lof_inference.txt
2. Place model artifacts produced by training (or download them into repo root). Expected artifacts (names used by training script; app.py may look for slightly different names — update accordingly):
- trained_models_catboost_lof.pkl (or trained models expected by app.py)
- sample_predictions_catboost_lof.json (for demo)
- feature_stats_catboost_lof.json
- metrics_catboost_lof.json
3. Run Streamlit
streamlit run app.py
Streamlit will open at http://localhost:8501 by default.
If the app expects a model at a different path or expects to download from Google Drive, check `app.py` and update model paths or set environment placeholders like [MODEL_DRIVE_ID].
## 📦 Saved Artifacts & File Formats (what to expect after training)
- trained_models_catboost_lof.pkl — Pickle file with keys:
- 'catboost_model' : CatBoostClassifier object
- 'lof_model' : LocalOutlierFactor object
- 'scaler' : StandardScaler object
- 'feature_names' : list of feature columns
- 'catboost_threshold' : float
- 'config' : metadata dict (sampling, neighbors, test_size, training_date)
- metrics_catboost_lof.json — JSON with evaluation metrics and metadata (accuracy, precision, recall, f1, roc_auc, confusion matrix counts, training times)
- sample_predictions_catboost_lof.json — JSON with arrays: y_true, y_pred_catboost, y_pred_lof, y_pred_voting, proba arrays
- feature_stats_catboost_lof.json — JSON with 'means', 'stds', 'mins', 'maxs' per feature (computed from scaled X_train)
These artifacts are designed to be consumed directly by the Streamlit app for visualization and demo.
## ✅ Notes
- The repository has the training and inference pieces required to reproduce results locally. To fully run training, place a CSV named `merged_output.csv` in the project root or change the path in `train_catboost_lof.py`