Abishek2511/Dynamic-Malware-Analysis

GitHub: Abishek2511/Dynamic-Malware-Analysis

Stars: 0 | Forks: 0

# Dynamic Malware Analysis ## An automated malware analysis pipeline that dynamically analyses suspicious samples, extracts behavioural indicators, and classifies them using machine learning — achieving 87% classification accuracy with reduced false positives. ## Overview This project was developed as part of my MSc Cyber Security dissertation at the University of Birmingham. The goal was to build an end-to-end automated pipeline that could: Automatically submit malware samples to a sandboxed environment Extract dynamic behavioural indicators from execution Engineer features from raw behavioural data Classify samples as malicious or benign using machine learning Rather than relying on static signature-based detection (which attackers can easily bypass), this pipeline focuses on dynamic behavioural analysis — observing what a sample actually does when executed, making it effective against obfuscated and polymorphic malware. ## Architecture Malware Sample │ ▼ ┌─────────────────┐ │ Cuckoo Sandbox │ ← Nested virtualised environment │ (Execution) │ └────────┬────────┘ │ ▼ ┌─────────────────────────────┐ │ Behavioural Data │ │ - API calls │ │ - Network activity │ │ - File system changes │ │ - Registry modifications │ │ - Process execution │ └────────┬────────────────────┘ │ ▼ ┌─────────────────┐ │ Feature │ │ Engineering │ ← Python extraction & processing └────────┬────────┘ │ ▼ ┌─────────────────┐ │ ML Classifier │ ← Classification model │ 87% Accuracy │ └─────────────────┘ ## Key Features Automated pipeline — end-to-end from sample submission to classification result Dynamic analysis — runtime behaviour extraction, not static signatures 50+ behavioural indicators extracted per sample including: Windows API call sequences Network connection attempts (IPs, domains, ports) File system read/write/delete operations Registry key modifications Process creation and injection attempts Feature engineering — raw behavioural data transformed into ML-ready feature vectors 87% classification accuracy with reduced false positives through refined feature selection Nested virtualisation — isolated sandbox environment preventing malware escape ## Tech Stack Component Technology Programming Language Python 3.8+Sandbox EnvironmentCuckoo Sandbox Virtualisation Nested VM (VirtualBox)Data Processing Pandas, NumPyMachine LearningScikit-learnFeature EngineeringCustom Python modules ## Results MetricScoreClassification Accuracy87% False Positive RateReduced via feature refinement Behavioural Indicators Extracted50+ per sample ## How It Works 1. Sample Submission Malware samples are automatically submitted to a Cuckoo Sandbox instance running in a nested virtualised environment, ensuring complete isolation from the host system. 2. Dynamic Execution The sandbox executes each sample and monitors all system interactions in real time, capturing: Every Windows API call made by the process All network connections attempted File system modifications (create, read, write, delete) Registry changes Child processes spawned 3. Behavioural Data Extraction A Python extraction module processes the raw Cuckoo JSON reports, pulling out 50+ behavioural indicators per sample and structuring them into a consistent format for analysis. 4. Feature Engineering Raw indicators are transformed into numerical feature vectors suitable for machine learning. Key engineering decisions included: API call frequency distributions Network behaviour aggregation File path pattern encoding Temporal sequence analysis 5. Classification The feature vectors are fed into a trained machine learning classifier that outputs a malicious/benign verdict with confidence score. ## Why Dynamic Analysis? Traditional antivirus relies on static signatures — known patterns in malware code. Modern malware frequently uses: Obfuscation — encoding or encrypting the payload Polymorphism — changing its own code on each execution Packing — compressing the executable to hide its true content Dynamic analysis bypasses all of these by observing what the malware does, not what it looks like. A malware sample must eventually unpack itself and call system APIs to function — and that's when we catch it.