ThisPlatypus/From-PCAP-to-IMAGE
GitHub: ThisPlatypus/From-PCAP-to-IMAGE
Stars: 1 | Forks: 0
# PCAP to Image Dataset Creator
## Repository Structure
.
├── 0_pcap/ # Original PCAP files (input)
│ ├── malware/ # Example: nested structure
│ │ ├── trojan/
│ │ └── ransomware/
│ └── benign/
├── 1_splitted_pcap/ # Split PCAP files (preserves structure)
│ ├── malware/
│ │ ├── trojan/
│ │ └── ransomware/
│ └── benign/
├── 2_Images/ # Generated images (preserves structure)
│ ├── malware/
│ │ ├── trojan/
│ │ └── ransomware/
│ └── benign/
├── split_pcap.py # Script to split PCAP files
├── pcap_to_image.py # Script to convert PCAP to images
├── requirements.txt # Python dependencies
└── README.md # This file
## Installation
1. **Clone or create the repository:**
mkdir pcap-dataset-creator
cd pcap-dataset-creator
2. **Create the directory structure:**
mkdir -p 0_pcap 1_splitted_pcap 2_Images
3. **Install dependencies:**
pip install -r requirements.txt
## Requirements
The `requirements.txt` file contains:
scapy>=2.5.0
numpy>=1.21.0
Pillow>=9.0.0
## Usage
### Step 1: Split PCAP Files
Split PCAP files by session or flow, with automatic trimming/padding to 800 bits (100 bytes):
**Process a single file:**
python split_pcap.py 0_pcap/capture.pcap --mode session --max-bits 800
**Process entire directory tree (preserves nested structure):**
python split_pcap.py 0_pcap/ --mode session --max-bits 800
**Custom bit size:**
python split_pcap.py 0_pcap/ --mode flow --max-bits 1600
### Step 2: Convert to Images
Convert split PCAP files to fixed-size grayscale images:
**Full packet mode with 28x28 images:**
python pcap_to_image.py --mode full --size 28x28
**Header-only mode with custom size:**
python pcap_to_image.py --mode header --size 32x32
**Custom input/output directories:**
python pcap_to_image.py --input 1_splitted_pcap --output 2_Images --mode full --size 64x64
### Parameters
#### split_pcap.py
- `input`: Input PCAP file or directory path (required)
- `--mode`: Split mode - `session` (bidirectional) or `flow` (unidirectional) [default: session]
- `--output`: Output base directory [default: 1_splitted_pcap]
- `--max-bits`: Maximum bits per split file, will trim or pad to this exact size [default: 800]
#### pcap_to_image.py
- `--input`: Input directory with PCAP files [default: 1_splitted_pcap]
- `--output`: Output directory for images [default: 2_Images]
- `--mode`: Conversion mode - `header` (headers only) or `full` (entire packet) [default: full]
- `--size`: Image dimensions as WIDTHxHEIGHT (e.g., 28x28, 32x32, 64x64) [default: 28x28]
## Complete Workflow Example
### Simple (Single File)
# Place PCAP in 0_pcap/
cp capture.pcap 0_pcap/
# Split by sessions (800 bits each)
python split_pcap.py 0_pcap/capture.pcap --mode session --max-bits 800
# Convert to 28x28 images (full packet)
python pcap_to_image.py --mode full --size 28x28
### Advanced (Nested Directory Structure)
# Organize your PCAPs by category
0_pcap/
├── malware/
│ ├── trojan/
│ │ ├── sample1.pcap
│ │ └── sample2.pcap
│ └── ransomware/
│ └── sample3.pcap
└── benign/
└── normal_traffic.pcap
# Split all files, preserving structure
python split_pcap.py 0_pcap/ --mode session --max-bits 800
# Result in 1_splitted_pcap/:
1_splitted_pcap/
├── malware/
│ ├── trojan/
│ │ ├── sample1_20241216_143052_session_0001.pcap
│ │ ├── sample1_20241216_143052_session_0002.pcap
│ │ └── ...
│ └── ransomware/
│ └── ...
└── benign/
└── ...
# Convert to images, preserving structure
python pcap_to_image.py --mode full --size 32x32
# Result in 2_Images/:
2_Images/
├── malware/
│ ├── trojan/
│ │ ├── sample1_20241216_143052_session_0001.png
│ │ ├── sample1_20241216_143052_session_0002.png
│ │ └── ...
│ └── ransomware/
│ └── ...
└── benign/
└── ...
## Key Features
### 🎯 Fixed Size Processing
- **Bit-level control**: Specify exact bit size (default: 800 bits = 100 bytes)
- **Automatic trimming**: Files larger than max-bits are truncated
- **Automatic padding**: Files smaller than max-bits are zero-padded
- **Perfect uniformity**: All split PCAPs are exactly the same size
### 📊 Image Generation
- **Fixed dimensions**: Specify exact image size (e.g., 28x28, 32x32, 64x64)
- **Automatic trimming**: Data exceeding image size is cut off
- **Automatic padding**: Data smaller than image size is zero-padded
- **Dataset-ready**: All images have identical dimensions
### 📁 Directory Structure Preservation
- **Nested folders**: Automatically preserves nested directory structure
- **Organized datasets**: Maintains your classification/labeling hierarchy
- **Batch processing**: Processes entire directory trees recursively
### 🔍 Flexible Extraction
- **Header mode**: Extract only packet headers (Ethernet + IP + TCP/UDP)
- **Full mode**: Extract complete packet data including payload
- **Privacy control**: Use header mode to exclude sensitive payload data
## Calculation Examples
### Bit Size to Image Size
800 bits = 100 bytes
→ 10x10 image (100 pixels)
→ 28x28 image (784 pixels) - requires padding
→ 8x12 image (96 pixels) - requires trimming
1600 bits = 200 bytes
→ 14x14 image (196 pixels) - requires padding
→ 10x20 image (200 pixels) - perfect fit!
### Common Configurations
# Small images (MNIST-style)
--max-bits 784 --size 28x28 # 784 bits = 98 bytes, perfect fit
# Medium images
--max-bits 1024 --size 32x32 # 1024 pixels = 128 bytes
# Larger images
--max-bits 4096 --size 64x64 # 4096 pixels = 512 bytes
## Session vs Flow
- **Session mode**: Groups packets bidirectionally (both directions of communication)
- Better for: Protocol analysis, connection behavior
- Fewer files, more context per file
- Example: All packets between 192.168.1.10:443 ↔ 10.0.0.5:8080
- **Flow mode**: Keeps each direction separate
- Better for: Asymmetric analysis, one-way traffic patterns
- More files, more granular control
- Example: 192.168.1.10:443 → 10.0.0.5:8080 (separate from reverse)
## Header vs Full Packet
- **Header mode**: Only packet headers
- Ethernet (14 bytes) + IP (20-60 bytes) + TCP/UDP (20-60 / 8 bytes)
- Typically 42-134 bytes per packet
- Privacy-preserving (no payload data)
- Good for protocol-based classification
- **Full mode**: Complete packet data
- Headers + payload
- Variable size depending on packet
- Better for deep packet inspection
- Includes application-layer data
## Output Format
### Split PCAP Files
- **Naming**: `{original}_{timestamp}_{mode}_{index}.pcap`
- **Example**: `capture_20241216_143052_session_0001.pcap`
- **Size**: Exactly `max-bits` bits (trimmed or padded)
### Images
- **Format**: PNG (grayscale, 8-bit)
- **Naming**: Same as PCAP but with `.png` extension
- **Dimensions**: Exactly as specified (e.g., 28x28)
- **Pixel values**: 0-255 (each byte becomes one pixel value)
## Use Cases
- **Network intrusion detection**: Train CNNs on packet patterns
- **Traffic classification**: Identify applications from packet headers
- **Malware detection**: Analyze C&C communication patterns
- **Protocol fingerprinting**: Visual signatures of network protocols
- **Anomaly detection**: Identify unusual network behavior
## Tips
1. **Choose appropriate max-bits**:
- 800 bits works well for header-only analysis
- 1600-3200 bits better for full packet analysis
2. **Match max-bits to image size**:
- 28x28 = 784 bytes = 6,272 bits
- Recommended: `--max-bits 6272 --size 28x28`
3. **Directory organization**:
- Organize by traffic type (malware/benign)
- Organize by protocol (HTTP/SSH/DNS)
- Scripts preserve any structure you create
4. **Header mode benefits**:
- Faster processing
- Privacy-compliant (no payload)
- Smaller files
## Troubleshooting
### Common Issues
**"No module named 'scapy'"**
pip install -r requirements.txt
**"Permission denied" when reading PCAP**
# On Linux/Mac, you may need sudo for some PCAPs
sudo python split_pcap.py 0_pcap/capture.pcap
**Images look random/noisy**
- This is normal! Network traffic as images appears random
- ML models learn patterns humans can't see
- Try different modes (header vs full) to see differences
**Empty output directories**
- Check that input PCAPs contain IP traffic
- Non-IP packets are filtered out
- Check file permissions
## License
MIT License - Feel free to use and modify for your projects.
## Citation
If you use this tool in your research, please cite:
@software{pcap_to_image_dataset,
title = {PCAP to Image Dataset Creator},
year = {2024},
url = {https://github.com/yourusername/pcap-dataset-creator}
}
## Linked to
This code is part of the following paper:
@inproceedings{camerota2024addressing,
title={Addressing data security in iot: Minimum sample size and denoising diffusion models for improved malware detection},
author={Camerota, Chiara and Pappone, Lorenzo and Pecorella, Tommaso and Esposito, Flavio},
booktitle={2024 20th International Conference on Network and Service Management (CNSM)},
pages={1--7},
year={2024},
organization={IEEE}
}