Fnc-Jit/Malware-Threat-Scoring-Engine
GitHub: Fnc-Jit/Malware-Threat-Scoring-Engine
Stars: 0 | Forks: 0
🛡️ AI-Based Malware Detection System
An intelligent malware analysis platform powered by Machine Learning that detects threats in real-time with explainable AI insights.
Trained on 1,000,000 synthetic samples • 100% Test Accuracy • Explains Every Decision
## 📋 Table of Contents - [Overview](#-overview) - [Architecture](#-architecture) - [Features](#-features) - [How It Works](#-how-it-works) - [ML Model Performance](#-ml-model-performance) - [Risk Scoring Formula](#-risk-scoring-formula) - [Feature Extraction Pipeline](#-feature-extraction-pipeline) - [Explainable AI (XAI)](#-explainable-ai-xai) - [Installation](#-installation) - [Usage](#-usage) - [Project Structure](#-project-structure) - [Tech Stack](#-tech-stack) - [Future Enhancements](#-future-enhancements) ## 🔍 Overview Traditional antivirus relies on signature databases — it can only catch threats it has already seen. This system takes a different approach: it extracts structural and behavioral features from any file and applies a dual-model ML pipeline to detect both known malware patterns and statistically anomalous files that may represent zero-day threats. The design goal was not just detection, but **explainability** — every verdict is accompanied by a breakdown of which features drove the classification and what each one means, making it useful as a learning and triage tool, not just a black-box scanner. **Core design decisions:** - Dual-model architecture (Random Forest + Isolation Forest) catches both known patterns and novel outliers - Shannon entropy as a primary feature — encrypted/packed payloads have characteristically high entropy - Weighted risk score rather than binary output — gives analysts actionable signal strength - Synthetic training data used deliberately to control class balance and feature distribution ## 🏗️ Architecture graph TB subgraph User Interface A[📤 File Upload Page] --> B[📊 Threat Report Page] B --> C[🧠 Explanation Modal] end subgraph Backend - Flask D[🔄 File Upload Handler] --> E[🔬 Feature Extraction] E --> F[🤖 ML Prediction Engine] F --> G[📝 Explanation Generator] end subgraph ML Models H[🌲 Random Forest Classifier] I[🔮 Isolation Forest - Anomaly] end subgraph Feature Analysis J[📏 File Size] K[🎲 Shannon Entropy] L[🔧 PE Section Analysis] M[⚙️ API Import Analysis] end A -->|Upload| D E --> J & K & L & M F --> H & I F -->|Risk Score| B G -->|9 Analysis Categories| C style A fill:#3b82f6,color:#fff style B fill:#8b5cf6,color:#fff style C fill:#10b981,color:#fff style H fill:#f59e0b,color:#000 style I fill:#ef4444,color:#fff ## ✨ Features ### 🔬 Intelligent Analysis Engine graph LR A[📄 Any File] --> B{PE File?} B -->|Yes| C[Full PE Analysis] B -->|No| D[Generic Analysis] C --> E[Section Entropy] C --> F[API Imports] C --> G[Suspicious APIs] C --> H[Number of Sections] D --> I[File Entropy] D --> J[File Size] E & F & G & H & I & J --> K[🤖 ML Classification] K --> L[📊 Risk Score 0-100] L --> M{Score} M -->|0-30| N[✅ Safe] M -->|31-70| O[⚠️ Suspicious] M -->|71-100| P[🚨 Malicious] style N fill:#10b981,color:#fff style O fill:#f59e0b,color:#000 style P fill:#ef4444,color:#fff ### 🎯 Dual-Model Detection | Model | Type | Purpose | How It Works | |---|---|---|---| | **Random Forest** | Supervised Classification | Primary malware detection | Ensemble of 100 decision trees voting on malware probability | | **Isolation Forest** | Unsupervised Anomaly Detection | Zero-day threat detection | Identifies files with unusual feature profiles that deviate from normal patterns | ### Feature Extraction The system extracts 9 features per file. For PE executables, the full structural analysis runs; for generic files, entropy and size are the primary signals. | # | Feature | Source | Why it matters | |---|---|---|---| | 1 | `Size` | File metadata | Unusually small high-entropy files are strong malware indicators | | 2 | `Entropy` | Shannon formula | Packed/encrypted payloads score 7.5–8.0; benign files typically 2.0–6.5 | | 3 | `NumSections` | PE header | Malware often has abnormal section counts | | 4 | `AvgSectionEntropy` | PE sections | High average entropy suggests code packing | | 5 | `MaxSectionEntropy` | PE sections | Single high-entropy section = injected shellcode indicator | | 6 | `NumImports` | PE import table | Very low imports can indicate manual API resolution (evasion technique) | | 7 | `SuspiciousImportCount` | PE import table | Direct count of high-risk Windows API calls present | | 8 | `NumExports` | PE export table | Unusual for standard executables; common in malicious DLLs | | 9 | `IsPE` | File magic bytes | Determines which analysis branch runs | ### Suspicious API Watchlist The PE parser flags imports matching known malware-associated Windows APIs: | API | Threat Pattern | |---|---| | `VirtualAlloc` / `WriteProcessMemory` | Shellcode injection staging | | `CreateRemoteThread` | Remote process code execution | | `URLDownloadToFile` / `InternetOpen` | C2 communication and payload retrieval | | `RegCreateKey` / `RegSetValue` | Persistence via registry modification | | `GetProcAddress` / `LoadLibrary` | Dynamic import resolution to evade static analysis | | `ShellExecute` | Child process spawning | ## ⚙️ How It Works sequenceDiagram participant U as 👤 User participant F as 🌐 Flask App participant FE as 🔬 Feature Extractor participant RF as 🌲 Random Forest participant IF as 🔮 Isolation Forest participant EX as 💡 Explanation Engine U->>F: Upload File F->>FE: Extract Features FE-->>F: 9 Feature Vector par Dual Model Prediction F->>RF: Predict Malware Probability RF-->>F: Probability (0.0 - 1.0) and F->>IF: Check for Anomaly IF-->>F: Normal / Anomaly end F->>F: Calculate Risk Score F->>EX: Generate Explanation EX-->>F: 9-Category Analysis F-->>U: Render Report + Explanation ### Risk Score Calculation Risk Score = (RF_Probability × 0.70) + (Anomaly_Score × 0.20) + (Entropy / 8.0 × 0.10) The 70/20/10 weighting reflects relative signal reliability: the Random Forest has strongest predictive power on the training distribution, anomaly detection adds sensitivity for out-of-distribution files, and raw entropy provides a lightweight sanity check independent of both models. | Component | Weight | Source | Description | |---|---|---|---| | **ML Probability** | 70% | Random Forest | Primary malware classification confidence | | **Anomaly Score** | 20% | Isolation Forest | 1.0 if anomaly detected, 0.0 if normal | | **Entropy Factor** | 10% | Shannon Entropy | Normalized entropy (entropy / 8.0) | ## 📊 ML Model Performance ### Training Configuration | Parameter | Value | |---|---| | Training samples | 1,000,000 (synthetic) | | Train / test split | 80% / 20% | | Random Forest estimators | 100 trees | | Isolation Forest contamination | 0.1 | | PE sample ratio | ~50% | ### Classification Results ### Note on Synthetic Training Data Training was performed on synthetically generated samples with controlled feature distributions rather than a real malware corpus. This was a deliberate choice for class balance and reproducibility — but it has an important implication: **the 100% test accuracy reflects performance on held-out synthetic data, not real-world generalization**. The primary roadmap item is retraining on real samples from MalwareBazaar and VirusTotal to close this gap. The current system should be treated as a research prototype and triage aid, not a production AV replacement. ### Training Data Distribution pie title Training Data Composition "Benign PE" : 30 "Malicious PE" : 20 "Benign Non-PE (Low Entropy)" : 15 "Benign Non-PE (High Entropy)" : 20 "Malicious Non-PE" : 15 ### Entropy Distribution by Class | File Type | Benign Range | Malicious Range | |---|---|---| | PE executables | 2.0 – 6.5 | 6.0 – 7.99 | | Non-PE (standard) | 1.0 – 6.5 | — | | Non-PE (compressed/media) | 6.5 – 7.99 | — | | Non-PE (malicious) | — | 7.80 – 8.0 | ## 💡 Explainable AI (XAI) Every scan produces a detailed explanation accessible via a floating modal window. The explanation covers **9 analysis categories**: graph TB A[🧠 Why Button Clicked] --> B[Floating Modal Opens] B --> C[📋 Analysis Summary] B --> D[📁 File Type Analysis] B --> E[🎲 Entropy Analysis] B --> F[📏 File Size Analysis] B --> G[🔧 Structural Analysis] B --> H[⚙️ API Import Analysis] B --> I[🤖 ML Model Analysis] B --> J[🔬 Anomaly Detection] B --> K[📊 Risk Score Breakdown] B --> L[💡 Recommendation] style A fill:#3b82f6,color:#fff style B fill:#8b5cf6,color:#fff style C fill:#1e3a5f,color:#fff style L fill:#10b981,color:#fff Every scan produces a 9-category breakdown accessible via the "Why?" modal. The goal is to make the model's reasoning auditable rather than opaque. | Category | What it shows | |---|---| | Summary | High-level verdict with risk score context | | File type | PE vs generic, associated risk profile | | Entropy | Entropy value interpretation and what it signals | | File size | Size-entropy correlation analysis | | Structural | PE section count and per-section entropy | | API imports | Which suspicious APIs were found and their threat patterns | | ML analysis | RF confidence and top contributing features | | Anomaly detection | Whether Isolation Forest flagged statistical outlier status | | Risk breakdown | Component-level contribution bar (RF / anomaly / entropy) | ## 🚀 Installation ### Prerequisites - Python 3.10 or higher - pip (Python package manager) ### Setup # 1. Clone the repository git clone https://github.com/Fnc-Jit/MAlWARE_ANYLASIS.git cd MAlWARE_ANYLASIS # 2. Create a virtual environment python3 -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate # 3. Install dependencies pip install -r requirements.txt # 4. Train the ML models (generates 1M samples) python3 train_model.py # 5. Run the application python3 app.py ### Access the Application Open your browser and navigate to: **http://localhost:5000** ## 📖 Usage 1. **Upload a File** — Click the upload button on the home page and select any file 2. **View the Report** — See the risk score (0-100), verdict, and analysis breakdown 3. **Click "Why?"** — Open the floating explanation modal for a detailed 9-category analysis 4. **Review Recommendations** — Follow the suggested actions based on the verdict ### Verdict Classification | Verdict | Risk Score | Meaning | |---|---|---| | ✅ **Safe** | 0 - 30 | No significant malicious indicators detected | | ⚠️ **Suspicious** | 31 - 70 | Some concerning characteristics found — exercise caution | | 🚨 **Malicious** | 71 - 100 | Strong malware indicators — delete immediately | ## 📁 Project Structure MAlWARE_ANYLASIS/ ├── app.py # Flask web application & risk scoring engine ├── feature_extraction.py # PE & generic file feature extractor ├── train_model.py # Synthetic data generation & model training ├── verify_system.py # Automated verification test suite ├── requirements.txt # Python dependencies ├── .gitignore # Git ignore rules ├── README.md # This file ├── models/ │ ├── rf_model.pkl # Trained Random Forest model │ └── iso_forest.pkl # Trained Isolation Forest model ├── data/ │ └── dataset.csv # Generated training data (gitignored) ├── templates/ │ ├── index.html # File upload page │ └── report.html # Scan report with explanation modal └── uploads/ # Temporary upload directory (gitignored) ## 🛠️ Tech Stack graph LR subgraph Frontend A[HTML5] --> B[CSS3] B --> C[JavaScript] C --> D[Chart.js] A --> E[Bootstrap 5] end subgraph Backend F[Python 3] --> G[Flask] G --> H[Jinja2 Templates] end subgraph Machine Learning I[scikit-learn] --> J[Random Forest] I --> K[Isolation Forest] L[NumPy] --> I M[Pandas] --> I end subgraph Analysis N[pefile] --> O[PE Parsing] P[math] --> Q[Shannon Entropy] end style A fill:#e34f26,color:#fff style B fill:#1572b6,color:#fff style C fill:#f7df1e,color:#000 style F fill:#3776ab,color:#fff style G fill:#000,color:#fff style I fill:#f7931e,color:#000 | Layer | Technology | Purpose | |---|---|---| | **Frontend** | HTML5, CSS3, JavaScript, Bootstrap 5, Chart.js | Dark-themed responsive UI with data visualizations | | **Backend** | Python 3, Flask | Web server, routing, file handling, risk calculation | | **ML Engine** | scikit-learn, NumPy, Pandas | Model training, prediction, and feature processing | | **File Analysis** | pefile, math | PE header parsing, Shannon entropy calculation | | **Serialization** | joblib | Model persistence (save/load trained models) | ## 🔮 Future Enhancements - [ ] **Deep Learning Integration** — Add LSTM/CNN models for byte-sequence analysis - [ ] **Real Malware Dataset** — Train on real-world malware samples (VirusTotal, MalwareBazaar) - [ ] **Dynamic Analysis** — Sandbox execution for behavioral analysis - [ ] **YARA Rule Integration** — Pattern matching with YARA signatures - [ ] **Multi-file Batch Scanning** — Upload and scan multiple files simultaneously - [ ] **API Endpoint** — REST API for programmatic malware scanning - [ ] **Scan History Dashboard** — Track all previous scans with filtering - [ ] **PDF Report Export** — Download detailed analysis reports as PDF - [ ] **VirusTotal Integration** — Cross-reference with VirusTotal's database - [ ] **Docker Deployment** — One-click deployment with Docker Compose ## Related Projects - **[God's Eye](https://github.com/Fnc-Jit/Gods-Eye)** — Autonomous multi-agent SIEM platform. Malware risk scores from this tool feed into the God's Eye threat triage pipeline. - **[DeepDecoy](https://github.com/Fnc-Jit)** — Dynamic honeypot system with attacker behavioral profiling. ## 📄 License This project is licensed under the MIT License.⭐ Star this repo if you found it useful!
标签:后端开发