Alok8018/Hybridshield-Malware-detection
GitHub: Alok8018/Hybridshield-Malware-detection
Stars: 0 | Forks: 0
# 🛡️ HybridShield V10 — VS Code Implementation Guide
## Project Structure
hybridshield/
│
├── config.py ← Edit paths here before anything else
├── requirements.txt ← pip install -r requirements.txt
│
├── train_cnn.py ← Step 2: train EfficientNetB0 CNN
├── train_ember.py ← Step 3: train DNN + LightGBM
├── app.py ← Step 4: launch Gradio web UI
│
├── src/
│ ├── data_utils.py ← MalNet extraction, scanning, tf.data pipelines
│ ├── ember_utils.py ← EMBER download, patch, vectorize, load
│ └── models.py ← CNN, DNN, WarmupCosineDecay definitions
│
├── data/
│ ├── malnet-images-tiny.tar.gz ← PUT YOUR MALNET FILE HERE
│ ├── malnet_data/ ← auto-created during training
│ └── ember_data/ ← auto-created during training
│
├── models/ ← saved model files (auto-created)
│ ├── cnn_best.keras
│ ├── cnn_final.keras
│ ├── cnn_results.json
│ ├── dnn_best.keras
│ ├── dnn_final.keras
│ ├── lgb_model.txt
│ ├── ember_meta.pkl
│ └── ensemble_cfg.json
│
└── results/ ← plots and dashboards (auto-created)
├── cnn_curves.png
├── cnn_confusion.png
└── roc_curves.png
## Step 0 — Prerequisites
### System requirements
- Python 3.10 or 3.11 (3.12 has issues with some ember deps)
- NVIDIA GPU with CUDA 11.8+ (for fast CNN training)
- 16 GB RAM minimum (EMBER loading needs ~6 GB)
- 30 GB free disk space (MalNet + EMBER + models)
### VS Code extensions to install
- Python (Microsoft)
- Pylance
- Jupyter (optional — for notebook-style debugging)
## Step 1 — Setup
### 1a. Clone / copy project
# Copy all files into a folder, then open in VS Code
code hybridshield/
### 1b. Create virtual environment
# In VS Code terminal (Ctrl+`)
python -m venv .venv
# Activate
# Windows:
.venv\Scripts\activate
# Mac/Linux:
source .venv/bin/activate
### 1c. Install dependencies
pip install -r requirements.txt
### 1d. Put your MalNet file in place
Copy `malnet-images-tiny.tar.gz` into the `data/` folder:
data/malnet-images-tiny.tar.gz ← required
The file should be ~800 MB. If it's in a different location,
update `MALNET_SRC` in `config.py`.
## Step 2 — Train CNN
python train_cnn.py
**What it does:**
1. Extracts MalNet tar.gz into `data/malnet_data/`
2. Scans train/val/test splits at type level (21 classes)
3. Builds balanced tf.data pipeline
4. Phase 1 (25 epochs): trains head only, backbone frozen
5. Phase 2 (50 epochs): unfreezes top-40 backbone layers
6. Saves `models/cnn_best.keras` and `models/cnn_final.keras`
7. Saves `results/cnn_curves.png` and `results/cnn_confusion.png`
**Expected training time:**
- With GPU: ~6–8 min/epoch × 40 actual epochs ≈ 4–5 hours
- Without GPU (CPU only): not recommended
**Expected accuracy:**
- Top-1: 65–80%
- Top-3: 85–93%
**VS Code tip:** Open the terminal and watch the epoch output.
The training curves PNG updates after training completes.
## Step 3 — Train DNN + LightGBM
python train_ember.py
**What it does:**
**Expected time:**
- EMBER download + vectorize: ~45 min (one-time only)
- DNN training: ~20–30 min
- LightGBM training: ~15–25 min
**Expected accuracy:**
- DNN: 96–97%
- LightGBM: 97–98%
- Ensemble: 97–98%
## Step 4 — Launch Web Interface
python app.py
Open your browser at: **http://localhost:7860**
**What you can do:**
- Upload any `.exe`, `.dll`, `.apk`, `.bin`, `.sys` file
- See the byte visualization (inferno colormap)
- See per-model confidence bars (CNN / DNN / LGB / Ensemble)
- See the risk gauge (speedometer)
- Read the verdict card with file info and top-3 malware types
**To share publicly (ngrok tunnel):**
Change in `app.py`:
demo.launch(share=True) # generates a gradio.live URL
## GPU Setup (Windows)
If `tf.config.list_physical_devices('GPU')` returns `[]`:
1. Install CUDA 11.8: https://developer.nvidia.com/cuda-11-8-0-download-archive
2. Install cuDNN 8.6: https://developer.nvidia.com/cudnn
3. Install TF with GPU:
pip install tensorflow[and-cuda]
## Running without GPU (CPU-only)
CNN training will be very slow (~60 min/epoch).
Reduce in `config.py`:
PHASE1_EPOCHS = 5
PHASE2_EPOCHS = 10
STEPS_PER_EPOCH = 100 # override in train_cnn.py
DNN and LightGBM run fine on CPU.
## File descriptions
| File | Purpose |
|---|---|
| `config.py` | All paths and hyperparameters in one place |
| `src/data_utils.py` | MalNet extraction, type-level scanning, tf.data |
| `src/ember_utils.py` | EMBER download, patch, vectorize, RAM-safe load |
| `src/models.py` | EfficientNetB0, DNN, WarmupCosineDecay definitions |
| `train_cnn.py` | Full CNN training pipeline — run this first |
| `train_ember.py` | DNN + LightGBM training — run this second |
| `app.py` | Gradio web UI — run after both trainings complete |
## Common errors
**`MALNET_SRC not found`**
→ Put `malnet-images-tiny.tar.gz` in `data/` folder and check `config.py`
**`Cannot find train/val/test`**
→ The tar.gz extracted to a different depth. The `find_split_root()` function
auto-searches, but if it fails, check what's inside `data/malnet_data/`
and set `SPLIT_ROOT` manually in `train_cnn.py`.
**`Loss = 80 at pre-training check`**
→ This is expected and safe. EfficientNet BN stats from ImageNet don't match
malware images, causing high initial loss. It normalizes by epoch 2.
**`ember module not found`**
→ Run: `pip install git+https://github.com/elastic/ember.git`
**`cnn_results.json not found` in train_ember.py**
→ Run `train_cnn.py` first so the CNN open-set threshold is saved.
**Out of memory during EMBER loading**
→ Reduce `chunk` parameter in `ember_utils.py load_ember()` from 50_000 to 20_000.
## Expected final results
| Model | Accuracy | AUC | F1 |
|---|---|---|---|
| CNN (Top-1) | 65–80% | — | — |
| CNN (Top-3) | 85–93% | — | — |
| DNN | 96–97% | 0.995+ | 0.96+ |
| LightGBM | 97–98% | 0.997+ | 0.97+ |
| **Ensemble** | **97–98%** | **0.997+** | **0.97+** |