Abishek2511/Dynamic-Malware-Analysis
GitHub: Abishek2511/Dynamic-Malware-Analysis
Stars: 0 | Forks: 0
# Dynamic Malware Analysis
## An automated malware analysis pipeline that dynamically analyses suspicious samples, extracts behavioural indicators, and classifies them using machine learning — achieving 87% classification accuracy with reduced false positives.
## Overview
This project was developed as part of my MSc Cyber Security dissertation at the University of Birmingham. The goal was to build an end-to-end automated pipeline that could:
Automatically submit malware samples to a sandboxed environment
Extract dynamic behavioural indicators from execution
Engineer features from raw behavioural data
Classify samples as malicious or benign using machine learning
Rather than relying on static signature-based detection (which attackers can easily bypass), this pipeline focuses on dynamic behavioural analysis — observing what a sample actually does when executed, making it effective against obfuscated and polymorphic malware.
## Architecture
Malware Sample
│
▼
┌─────────────────┐
│ Cuckoo Sandbox │ ← Nested virtualised environment
│ (Execution) │
└────────┬────────┘
│
▼
┌─────────────────────────────┐
│ Behavioural Data │
│ - API calls │
│ - Network activity │
│ - File system changes │
│ - Registry modifications │
│ - Process execution │
└────────┬────────────────────┘
│
▼
┌─────────────────┐
│ Feature │
│ Engineering │ ← Python extraction & processing
└────────┬────────┘
│
▼
┌─────────────────┐
│ ML Classifier │ ← Classification model
│ 87% Accuracy │
└─────────────────┘
## Key Features
Automated pipeline — end-to-end from sample submission to classification result
Dynamic analysis — runtime behaviour extraction, not static signatures
50+ behavioural indicators extracted per sample including:
Windows API call sequences
Network connection attempts (IPs, domains, ports)
File system read/write/delete operations
Registry key modifications
Process creation and injection attempts
Feature engineering — raw behavioural data transformed into ML-ready feature vectors
87% classification accuracy with reduced false positives through refined feature selection
Nested virtualisation — isolated sandbox environment preventing malware escape
## Tech Stack
Component Technology Programming Language Python 3.8+Sandbox EnvironmentCuckoo Sandbox Virtualisation Nested VM (VirtualBox)Data Processing Pandas, NumPyMachine LearningScikit-learnFeature EngineeringCustom Python modules
## Results
MetricScoreClassification Accuracy87% False Positive RateReduced via feature refinement Behavioural Indicators Extracted50+ per sample
## How It Works
1. Sample Submission
Malware samples are automatically submitted to a Cuckoo Sandbox instance running in a nested virtualised environment, ensuring complete isolation from the host system.
2. Dynamic Execution
The sandbox executes each sample and monitors all system interactions in real time, capturing:
Every Windows API call made by the process
All network connections attempted
File system modifications (create, read, write, delete)
Registry changes
Child processes spawned
3. Behavioural Data Extraction
A Python extraction module processes the raw Cuckoo JSON reports, pulling out 50+ behavioural indicators per sample and structuring them into a consistent format for analysis.
4. Feature Engineering
Raw indicators are transformed into numerical feature vectors suitable for machine learning. Key engineering decisions included:
API call frequency distributions
Network behaviour aggregation
File path pattern encoding
Temporal sequence analysis
5. Classification
The feature vectors are fed into a trained machine learning classifier that outputs a malicious/benign verdict with confidence score.
## Why Dynamic Analysis?
Traditional antivirus relies on static signatures — known patterns in malware code. Modern malware frequently uses:
Obfuscation — encoding or encrypting the payload
Polymorphism — changing its own code on each execution
Packing — compressing the executable to hide its true content
Dynamic analysis bypasses all of these by observing what the malware does, not what it looks like. A malware sample must eventually unpack itself and call system APIs to function — and that's when we catch it.