kushpathakgit/Android-Malware-Analysis
GitHub: kushpathakgit/Android-Malware-Analysis
Stars: 0 | Forks: 0
# Android Malware Detection using Machine Learning




# Overview
Android malware has evolved rapidly, making traditional signature-based detection systems increasingly ineffective against modern threats that use:
- Code obfuscation
- Dynamic payload downloading
- Runtime evasion
- Sandbox detection techniques
This project implements a dual-phase Android malware detection system using Machine Learning:
## Static Analysis
Analyzes APK files without execution by extracting:
- Permissions
- Intents
- Manifest metadata
- Hardware access requests
## Dynamic Analysis
Executes applications inside an isolated Android emulator to monitor:
- Runtime network traffic
- Behavioral communication patterns
- Suspicious outbound connections
- Real-time malware activity
The project compares the effectiveness of static and dynamic analysis pipelines using multiple supervised ML classifiers.
# Features
- Hybrid malware detection architecture
- Static permission & manifest analysis
- Dynamic behavioral traffic analysis
- Rootless network interception using PCAPdroid
- Automated PCAP-to-CSV transformation pipeline
- Machine learning classification engine
- Balanced malware/benign dataset from AndroZoo
- Comparative evaluation of multiple ML models
# System Architecture
┌────────────────────┐
│ AndroZoo APKs │
└─────────┬──────────┘
│
┌───────────────┴────────────────┐
│ │
▼ ▼
┌─────────────────┐ ┌──────────────────┐
│ Static Analysis │ │ Dynamic Analysis │
└─────────────────┘ └──────────────────┘
│ │
▼ ▼
Permission & Intent Runtime Network Traffic
Feature Extraction Capture (PCAP)
│ │
▼ ▼
CSV Dataset PCAP Parsing
│
▼
Behavioral Feature CSV
│
┌───────────────────────┴──────────────────────┐
▼ ▼
Machine Learning Models Comparative Evaluation
# My Contributions
As a core engineer on this project, my primary responsibilities focused on the Dynamic Analysis Pipeline and Behavioral Feature Engineering.
## Dynamic Emulation Infrastructure
- Configured and maintained an isolated Android Studio emulator environment
- Securely executed potentially malicious APK samples
- Prevented host system exposure during malware execution
## Runtime Behavioral Monitoring
- Integrated PCAPdroid for rootless packet interception
- Captured inbound and outbound traffic during execution
- Monitored suspicious communication attempts to remote servers
## Network Traffic Feature Engineering
Developed automated parsing pipelines for raw PCAP files using:
- Python
- Scapy
- Pandas
Extracted behavioral metrics such as:
- TCP/UDP flow statistics
- Packet sizes
- Connection frequencies
- Flow duration
- Protocol distributions
## Dataset Engineering
- Converted raw PCAP traffic into structured CSV datasets
- Built machine-learning-ready behavioral datasets for classification
# Tech Stack
## Languages
- Python
- Java
## Machine Learning
- Scikit-Learn
- k-Nearest Neighbors (k-NN)
- Random Forest
- Decision Tree
- Logistic Regression
## Data Engineering
- Pandas
- NumPy
- Scapy
## Infrastructure & Tooling
- Android Studio Emulator
- PCAPdroid
- AndroZoo API
# Dataset
A balanced dataset of 100 Android applications was curated using the AndroZoo academic repository.
| Category | Count |
|---|---|
| Benign APKs | 50 |
| Malware APKs | 50 |
# Static Analysis Pipeline
The static analysis phase evaluates applications without runtime execution.
## Extracted Features
- Dangerous permissions
- Manifest intents
- Hardware access requests
- Broadcast receivers
- Startup behaviors
### Example Indicators
- READ_SMS
- SEND_SMS
- RECEIVE_BOOT_COMPLETED
- SYSTEM_ALERT_WINDOW
- ACCESS_FINE_LOCATION
# Dynamic Analysis Pipeline
The dynamic pipeline was designed to detect malware capable of bypassing static scanners.
## Execution Workflow
1. Install APK inside Android emulator
2. Execute application for a fixed runtime window
3. Capture network traffic using PCAPdroid
4. Parse PCAP files into behavioral metrics
5. Train ML classifiers on runtime data
## Captured Runtime Features
- Flow duration
- Packet size distribution
- Connection frequency
- Protocol type analysis
- Outbound communication behavior
# Machine Learning Models Evaluated
| Model | Purpose |
|---|---|
| k-Nearest Neighbors | Similarity-based classification |
| Decision Tree | Rule-based malware detection |
| Random Forest | Ensemble learning |
| Logistic Regression | Statistical classification |
# Results
# Static Analysis Performance
| Classifier | Accuracy | Recall | F1 Score |
|---|---|---|---|
| k-NN | 95.00% | 100.00% | 95.24% |
| Random Forest | 90.00% | 90.00% | 90.00% |
| Decision Tree | 80.00% | 70.00% | 77.78% |
| Logistic Regression | 80.00% | 80.00% | 80.00% |
## Best Static Model
k-Nearest Neighbors (k-NN)
- 95% Accuracy
- 100% Recall
- Zero false negatives on the test dataset
# Dynamic Analysis Performance
| Classifier | Accuracy | Recall | F1 Score |
|---|---|---|---|
| Logistic Regression | 75.00% | 100.00% | 80.00% |
| k-NN | 50.00% | 33.33% | 40.00% |
| Decision Tree | 33.33% | 33.33% | 33.33% |
| Random Forest | 33.33% | 16.67% | 20.00% |
## Best Dynamic Model
Logistic Regression proved most resilient in constrained dynamic environments.
# Engineering Challenges
Dynamic malware analysis introduced several real-world cybersecurity challenges:
## Hardware Constraints
- Android emulation is computationally expensive
- Several malware samples crashed during execution
- Dynamic dataset generation became resource-intensive
## Anti-Emulation Techniques
Some malware samples attempted sandbox evasion by checking:
- Emulator hardware fingerprints
- CPU architecture
- Sensor availability
- User interaction patterns
- Network environment artifacts
These techniques significantly affected dynamic behavioral collection.
# Key Findings
## Static Analysis
- Lightweight and computationally efficient
- Excellent for first-line malware screening
- Strong performance with limited datasets
## Dynamic Analysis
- More resilient against obfuscation
- Better for detecting advanced threats
- Requires scalable infrastructure and large behavioral datasets
## Important Observation
Simpler statistical models like Logistic Regression outperformed complex ensemble methods under constrained dynamic datasets.
# Future Improvements
- Cloud-based scalable sandbox infrastructure
- Large-scale automated malware execution
- Deep learning integration
- Real-device behavioral analysis
- API-call sequence analysis
- Hybrid ensemble detection systems
# Repository Structure
Android-Malware-Analysis/
│
├── dataset/
├── static-analysis/
├── dynamic-analysis/
├── pcap-processing/
├── models/
├── results/
├── reportEPICS.pdf
└── README.md
# Full Documentation
The complete project report containing:
- Methodology
- Literature review
- Experimental setup
- Architecture
- Comparative evaluation
- Hardware limitations
- Benchmark analysis
is included in this repository as:
📄 `reportEPICS.pdf`
# Project Repository
GitHub Repository:
https://github.com/kushpathakgit/Android-Malware-Analysis
# Authors
Developed as part of the EPICS Project at VIT Bhopal University.
Core contribution areas:
- Dynamic malware emulation
- Runtime behavioral analysis
- Network traffic interception
- PCAP feature engineering
- Machine learning evaluation