Alok8018/Hybridshield-Malware-detection

GitHub: Alok8018/Hybridshield-Malware-detection

Stars: 0 | Forks: 0

# 🛡️ HybridShield V10 — VS Code Implementation Guide ## Project Structure hybridshield/ │ ├── config.py ← Edit paths here before anything else ├── requirements.txt ← pip install -r requirements.txt │ ├── train_cnn.py ← Step 2: train EfficientNetB0 CNN ├── train_ember.py ← Step 3: train DNN + LightGBM ├── app.py ← Step 4: launch Gradio web UI │ ├── src/ │ ├── data_utils.py ← MalNet extraction, scanning, tf.data pipelines │ ├── ember_utils.py ← EMBER download, patch, vectorize, load │ └── models.py ← CNN, DNN, WarmupCosineDecay definitions │ ├── data/ │ ├── malnet-images-tiny.tar.gz ← PUT YOUR MALNET FILE HERE │ ├── malnet_data/ ← auto-created during training │ └── ember_data/ ← auto-created during training │ ├── models/ ← saved model files (auto-created) │ ├── cnn_best.keras │ ├── cnn_final.keras │ ├── cnn_results.json │ ├── dnn_best.keras │ ├── dnn_final.keras │ ├── lgb_model.txt │ ├── ember_meta.pkl │ └── ensemble_cfg.json │ └── results/ ← plots and dashboards (auto-created) ├── cnn_curves.png ├── cnn_confusion.png └── roc_curves.png ## Step 0 — Prerequisites ### System requirements - Python 3.10 or 3.11 (3.12 has issues with some ember deps) - NVIDIA GPU with CUDA 11.8+ (for fast CNN training) - 16 GB RAM minimum (EMBER loading needs ~6 GB) - 30 GB free disk space (MalNet + EMBER + models) ### VS Code extensions to install - Python (Microsoft) - Pylance - Jupyter (optional — for notebook-style debugging) ## Step 1 — Setup ### 1a. Clone / copy project # Copy all files into a folder, then open in VS Code code hybridshield/ ### 1b. Create virtual environment # In VS Code terminal (Ctrl+`) python -m venv .venv # Activate # Windows: .venv\Scripts\activate # Mac/Linux: source .venv/bin/activate ### 1c. Install dependencies pip install -r requirements.txt ### 1d. Put your MalNet file in place Copy `malnet-images-tiny.tar.gz` into the `data/` folder: data/malnet-images-tiny.tar.gz ← required The file should be ~800 MB. If it's in a different location, update `MALNET_SRC` in `config.py`. ## Step 2 — Train CNN python train_cnn.py **What it does:** 1. Extracts MalNet tar.gz into `data/malnet_data/` 2. Scans train/val/test splits at type level (21 classes) 3. Builds balanced tf.data pipeline 4. Phase 1 (25 epochs): trains head only, backbone frozen 5. Phase 2 (50 epochs): unfreezes top-40 backbone layers 6. Saves `models/cnn_best.keras` and `models/cnn_final.keras` 7. Saves `results/cnn_curves.png` and `results/cnn_confusion.png` **Expected training time:** - With GPU: ~6–8 min/epoch × 40 actual epochs ≈ 4–5 hours - Without GPU (CPU only): not recommended **Expected accuracy:** - Top-1: 65–80% - Top-3: 85–93% **VS Code tip:** Open the terminal and watch the epoch output. The training curves PNG updates after training completes. ## Step 3 — Train DNN + LightGBM python train_ember.py **What it does:** **Expected time:** - EMBER download + vectorize: ~45 min (one-time only) - DNN training: ~20–30 min - LightGBM training: ~15–25 min **Expected accuracy:** - DNN: 96–97% - LightGBM: 97–98% - Ensemble: 97–98% ## Step 4 — Launch Web Interface python app.py Open your browser at: **http://localhost:7860** **What you can do:** - Upload any `.exe`, `.dll`, `.apk`, `.bin`, `.sys` file - See the byte visualization (inferno colormap) - See per-model confidence bars (CNN / DNN / LGB / Ensemble) - See the risk gauge (speedometer) - Read the verdict card with file info and top-3 malware types **To share publicly (ngrok tunnel):** Change in `app.py`: demo.launch(share=True) # generates a gradio.live URL ## GPU Setup (Windows) If `tf.config.list_physical_devices('GPU')` returns `[]`: 1. Install CUDA 11.8: https://developer.nvidia.com/cuda-11-8-0-download-archive 2. Install cuDNN 8.6: https://developer.nvidia.com/cudnn 3. Install TF with GPU: pip install tensorflow[and-cuda] ## Running without GPU (CPU-only) CNN training will be very slow (~60 min/epoch). Reduce in `config.py`: PHASE1_EPOCHS = 5 PHASE2_EPOCHS = 10 STEPS_PER_EPOCH = 100 # override in train_cnn.py DNN and LightGBM run fine on CPU. ## File descriptions | File | Purpose | |---|---| | `config.py` | All paths and hyperparameters in one place | | `src/data_utils.py` | MalNet extraction, type-level scanning, tf.data | | `src/ember_utils.py` | EMBER download, patch, vectorize, RAM-safe load | | `src/models.py` | EfficientNetB0, DNN, WarmupCosineDecay definitions | | `train_cnn.py` | Full CNN training pipeline — run this first | | `train_ember.py` | DNN + LightGBM training — run this second | | `app.py` | Gradio web UI — run after both trainings complete | ## Common errors **`MALNET_SRC not found`** → Put `malnet-images-tiny.tar.gz` in `data/` folder and check `config.py` **`Cannot find train/val/test`** → The tar.gz extracted to a different depth. The `find_split_root()` function auto-searches, but if it fails, check what's inside `data/malnet_data/` and set `SPLIT_ROOT` manually in `train_cnn.py`. **`Loss = 80 at pre-training check`** → This is expected and safe. EfficientNet BN stats from ImageNet don't match malware images, causing high initial loss. It normalizes by epoch 2. **`ember module not found`** → Run: `pip install git+https://github.com/elastic/ember.git` **`cnn_results.json not found` in train_ember.py** → Run `train_cnn.py` first so the CNN open-set threshold is saved. **Out of memory during EMBER loading** → Reduce `chunk` parameter in `ember_utils.py load_ember()` from 50_000 to 20_000. ## Expected final results | Model | Accuracy | AUC | F1 | |---|---|---|---| | CNN (Top-1) | 65–80% | — | — | | CNN (Top-3) | 85–93% | — | — | | DNN | 96–97% | 0.995+ | 0.96+ | | LightGBM | 97–98% | 0.997+ | 0.97+ | | **Ensemble** | **97–98%** | **0.997+** | **0.97+** |