kushpathakgit/Android-Malware-Analysis

GitHub: kushpathakgit/Android-Malware-Analysis

Stars: 0 | Forks: 0

# Android Malware Detection using Machine Learning ![Python](https://img.shields.io/badge/Python-3.x-blue) ![Machine Learning](https://img.shields.io/badge/ML-ScikitLearn-orange) ![Android Security](https://img.shields.io/badge/Security-Android-green) ![Status](https://img.shields.io/badge/Project-Completed-success) # Overview Android malware has evolved rapidly, making traditional signature-based detection systems increasingly ineffective against modern threats that use: - Code obfuscation - Dynamic payload downloading - Runtime evasion - Sandbox detection techniques This project implements a dual-phase Android malware detection system using Machine Learning: ## Static Analysis Analyzes APK files without execution by extracting: - Permissions - Intents - Manifest metadata - Hardware access requests ## Dynamic Analysis Executes applications inside an isolated Android emulator to monitor: - Runtime network traffic - Behavioral communication patterns - Suspicious outbound connections - Real-time malware activity The project compares the effectiveness of static and dynamic analysis pipelines using multiple supervised ML classifiers. # Features - Hybrid malware detection architecture - Static permission & manifest analysis - Dynamic behavioral traffic analysis - Rootless network interception using PCAPdroid - Automated PCAP-to-CSV transformation pipeline - Machine learning classification engine - Balanced malware/benign dataset from AndroZoo - Comparative evaluation of multiple ML models # System Architecture ┌────────────────────┐ │ AndroZoo APKs │ └─────────┬──────────┘ │ ┌───────────────┴────────────────┐ │ │ ▼ ▼ ┌─────────────────┐ ┌──────────────────┐ │ Static Analysis │ │ Dynamic Analysis │ └─────────────────┘ └──────────────────┘ │ │ ▼ ▼ Permission & Intent Runtime Network Traffic Feature Extraction Capture (PCAP) │ │ ▼ ▼ CSV Dataset PCAP Parsing │ ▼ Behavioral Feature CSV │ ┌───────────────────────┴──────────────────────┐ ▼ ▼ Machine Learning Models Comparative Evaluation # My Contributions As a core engineer on this project, my primary responsibilities focused on the Dynamic Analysis Pipeline and Behavioral Feature Engineering. ## Dynamic Emulation Infrastructure - Configured and maintained an isolated Android Studio emulator environment - Securely executed potentially malicious APK samples - Prevented host system exposure during malware execution ## Runtime Behavioral Monitoring - Integrated PCAPdroid for rootless packet interception - Captured inbound and outbound traffic during execution - Monitored suspicious communication attempts to remote servers ## Network Traffic Feature Engineering Developed automated parsing pipelines for raw PCAP files using: - Python - Scapy - Pandas Extracted behavioral metrics such as: - TCP/UDP flow statistics - Packet sizes - Connection frequencies - Flow duration - Protocol distributions ## Dataset Engineering - Converted raw PCAP traffic into structured CSV datasets - Built machine-learning-ready behavioral datasets for classification # Tech Stack ## Languages - Python - Java ## Machine Learning - Scikit-Learn - k-Nearest Neighbors (k-NN) - Random Forest - Decision Tree - Logistic Regression ## Data Engineering - Pandas - NumPy - Scapy ## Infrastructure & Tooling - Android Studio Emulator - PCAPdroid - AndroZoo API # Dataset A balanced dataset of 100 Android applications was curated using the AndroZoo academic repository. | Category | Count | |---|---| | Benign APKs | 50 | | Malware APKs | 50 | # Static Analysis Pipeline The static analysis phase evaluates applications without runtime execution. ## Extracted Features - Dangerous permissions - Manifest intents - Hardware access requests - Broadcast receivers - Startup behaviors ### Example Indicators - READ_SMS - SEND_SMS - RECEIVE_BOOT_COMPLETED - SYSTEM_ALERT_WINDOW - ACCESS_FINE_LOCATION # Dynamic Analysis Pipeline The dynamic pipeline was designed to detect malware capable of bypassing static scanners. ## Execution Workflow 1. Install APK inside Android emulator 2. Execute application for a fixed runtime window 3. Capture network traffic using PCAPdroid 4. Parse PCAP files into behavioral metrics 5. Train ML classifiers on runtime data ## Captured Runtime Features - Flow duration - Packet size distribution - Connection frequency - Protocol type analysis - Outbound communication behavior # Machine Learning Models Evaluated | Model | Purpose | |---|---| | k-Nearest Neighbors | Similarity-based classification | | Decision Tree | Rule-based malware detection | | Random Forest | Ensemble learning | | Logistic Regression | Statistical classification | # Results # Static Analysis Performance | Classifier | Accuracy | Recall | F1 Score | |---|---|---|---| | k-NN | 95.00% | 100.00% | 95.24% | | Random Forest | 90.00% | 90.00% | 90.00% | | Decision Tree | 80.00% | 70.00% | 77.78% | | Logistic Regression | 80.00% | 80.00% | 80.00% | ## Best Static Model k-Nearest Neighbors (k-NN) - 95% Accuracy - 100% Recall - Zero false negatives on the test dataset # Dynamic Analysis Performance | Classifier | Accuracy | Recall | F1 Score | |---|---|---|---| | Logistic Regression | 75.00% | 100.00% | 80.00% | | k-NN | 50.00% | 33.33% | 40.00% | | Decision Tree | 33.33% | 33.33% | 33.33% | | Random Forest | 33.33% | 16.67% | 20.00% | ## Best Dynamic Model Logistic Regression proved most resilient in constrained dynamic environments. # Engineering Challenges Dynamic malware analysis introduced several real-world cybersecurity challenges: ## Hardware Constraints - Android emulation is computationally expensive - Several malware samples crashed during execution - Dynamic dataset generation became resource-intensive ## Anti-Emulation Techniques Some malware samples attempted sandbox evasion by checking: - Emulator hardware fingerprints - CPU architecture - Sensor availability - User interaction patterns - Network environment artifacts These techniques significantly affected dynamic behavioral collection. # Key Findings ## Static Analysis - Lightweight and computationally efficient - Excellent for first-line malware screening - Strong performance with limited datasets ## Dynamic Analysis - More resilient against obfuscation - Better for detecting advanced threats - Requires scalable infrastructure and large behavioral datasets ## Important Observation Simpler statistical models like Logistic Regression outperformed complex ensemble methods under constrained dynamic datasets. # Future Improvements - Cloud-based scalable sandbox infrastructure - Large-scale automated malware execution - Deep learning integration - Real-device behavioral analysis - API-call sequence analysis - Hybrid ensemble detection systems # Repository Structure Android-Malware-Analysis/ │ ├── dataset/ ├── static-analysis/ ├── dynamic-analysis/ ├── pcap-processing/ ├── models/ ├── results/ ├── reportEPICS.pdf └── README.md # Full Documentation The complete project report containing: - Methodology - Literature review - Experimental setup - Architecture - Comparative evaluation - Hardware limitations - Benchmark analysis is included in this repository as: 📄 `reportEPICS.pdf` # Project Repository GitHub Repository: https://github.com/kushpathakgit/Android-Malware-Analysis # Authors Developed as part of the EPICS Project at VIT Bhopal University. Core contribution areas: - Dynamic malware emulation - Runtime behavioral analysis - Network traffic interception - PCAP feature engineering - Machine learning evaluation