PedroSct/malware-detection-honeypot

GitHub: PedroSct/malware-detection-honeypot

Stars: 0 | Forks: 0

# Honeypot-IDS: Malware Detection with Machine Learning ## Overview This project integrates a **Honeypot** with a **Machine Learning classifier** to automatically detect and quarantine malware in a controlled environment. Files entering the honeypot trap are analyzed in real time by a trained Random Forest model, which routes them to either a safe folder or quarantine — no human intervention required. The research was approved by the academic evaluation board at FATEC Ourinhos and is part of the undergraduate program in Information Security Technology. ## Architecture The system runs across **3 virtual machines** in an isolated network: ┌─────────────────┐ (1) sends files ┌─────────────────────┐ │ VM Windows │ ───────────────────────► │ VM Debian │ │ Client/Attacker│ │ Honeypot Observer │ └─────────────────┘ │ (watcher.py) │ └──────────┬──────────┘ │ (2) forwards for analysis ▼ ┌─────────────────────┐ │ VM Ubuntu │ │ Analyzer API │ │ (Random Forest) │ └──────────┬──────────┘ │ (3) returns verdict ▼ ┌──────────────────────────────┐ │ VM Debian routes file to: │ │ ✅ /safe (benign) │ │ 🔴 /quarantine (malware) │ └──────────────────────────────┘ │ (5) ▼ 📊 Performance Report (FP/FN/TP/TN) **Flow:** 1. Windows VM sends files continuously to the honeypot trap folder 2. `watcher.py` detects new files and forwards them to the analyzer API 3. Ubuntu VM classifies each file using the trained Random Forest model 4. Debian VM routes files to `/safe` or `/quarantine` based on the verdict 5. A performance report is generated with full TP/TN/FP/FN metrics ## Dataset After benchmarking three datasets, **CIC-MalMem-2022** was selected as the primary dataset due to its superior performance across all classifiers. | Dataset | Best Algorithm | Accuracy | |---|---|---| | **CIC-MalMem-2022** | Random Forest | **99.99% (binary)** | | DikeDataset | Random Forest | 96.00% | | Malware Datasets (Adep) | Random Forest | ~95–97% | The CIC-MalMem-2022 dataset covers memory analysis of malware families including **Ransomware, Spyware, and Trojans** in Windows environments. ## Algorithm Comparison All classifiers were tested on the CIC-MalMem-2022 dataset for binary classification: | Algorithm | Accuracy | Precision | Recall | F1-Score | |---|---|---|---|---| | **Random Forest** | 99.99% | 100% | 100% | 100% | | Decision Tree | 99.99% | 100% | 100% | 100% | | SVM | 99.95% | ~99.9% | ~99.9% | ~99.9% | | KNN | 99.95% | ~99.9% | ~99.9% | ~99.9% | | Random Forest (multi-class) | ~88.70% | ~88% | ~88% | ~88% | Random Forest was selected for the practical implementation due to its robustness and interpretability in multi-class scenarios. ## Final Results (Practical Simulation — 1 Week) A total of **1,542 files** were processed during the simulation: | Metric | Count | Description | |---|---|---| | Total files analyzed | 1,542 | Full sample volume | | Real benign files | 1,215 | Legitimate files sent to the trap | | Real malware files | 327 | Malware samples from multiple families | | True Positives (TP) | 318 | Malware correctly quarantined | | True Negatives (TN) | 1,198 | Benign files correctly passed | | False Positives (FP) | 17 | Benign files incorrectly quarantined | | False Negatives (FN) | 9 | Malware incorrectly passed as safe | ### Performance Metrics | Metric | Result | |---|---| | **Overall Accuracy** | **98.31%** | | Precision | 94.93% | | **Recall (Detection Rate)** | **97.25%** | | F1-Score | 96.08% | ## Failure Analysis **False Positives (17 files):** Mostly software installers using packers similar to those found in malware, and system administration tools whose behavior pattern overlaps with spyware signatures. Expected behavior for a static analysis model. **False Negatives (9 files):** All 9 undetected threats were either **polymorphic malware** or **zero-day variants** specifically designed to evade static analysis by altering their structure. This highlights the known limitation of static-only approaches and reinforces the need for multi-layer defense strategies (e.g., dynamic/memory analysis as a second layer). ## Tech Stack | Layer | Technology | |---|---| | Virtualization | VirtualBox — 3 VMs (Debian, Ubuntu, Windows) | | Honeypot Observer | Python (`watcher.py`) | | Analyzer API | Python + Flask (`analyzer_api.py`) | | ML Model | Scikit-learn — Random Forest | | Dataset | CIC-MalMem-2022 | | Traffic Monitoring | Wireshark | | ML Benchmarking | RapidMiner, Jupyter Notebook | ## Source Code The paper (in Portuguese) is available in [`TG_Honeypot.pdf`](./TG_Honeypot.pdf). ## Authors | Name | Contact | |---|---| | Pedro Augusto Scoton Alves | [linkedin.com/in/pedroscoton](https://linkedin.com/in/pedroscoton) | | Pedro Lucas de Souza | pedro.souza92@fatec.sp.gov.br | | Gian Luca Monticeli | gian.monticeli@fatec.sp.gov.br | **Advisor:** Prof. Dr. Thiago José Lucas — thiago@fatecourinhos.edu.br **Institution:** FATEC Ourinhos — Faculdade de Tecnologia de Ourinhos **Program:** Tecnólogo em Segurança da Informação **Year:** 2025 ## Related Work This research builds on 10 peer-reviewed papers (2022–2025) from IEEE Xplore, Wiley, and ACM, comparing approaches including ML-IDHIF, reinforcement learning honeypots (DQN), generative honeypots (GPT-3.5), and IoT-focused detection systems. The Random Forest algorithm appeared in the majority of surveyed works as the most consistent performer across different datasets and attack scenarios.