Iditc/CIC-IDS2017-analysis
GitHub: Iditc/CIC-IDS2017-analysis
Stars: 0 | Forks: 0
# CIC-IDS2017-analysis
This project applies machine learning techniques to network intrusion detection using the CIC-IDS2017 dataset. It covers data preprocessing, feature engineering, model training and evaluation — classifying network traffic as benign or malicious across multiple attack categories including DDoS, PortScan, and Brute Force.
## Files
### `preprocessing/`
| File | Description |
|------|-------------|
| `download_data.py` | Downloads the dataset and saves it to `data/raw/` |
| `load_data.py` | Loads all CSVs, normalizes column names, drops NaN/Inf rows, applies feature selection |
| `basic_feature_selection.py` | Three-pass feature selection: zero-variance → inter-feature correlation → label correlation |
| `data_cleaning.py` | Removes duplicate rows and NaN/Inf rows; saves cleaned dataset to `data/processed/clean.parquet` |
### `eda/`
| File | Description |
|------|-------------|
| `dataset_overview.py` | Shape, missing values, class distribution, label descriptions |
| `eda.py` | Label distribution plot, correlation heatmap, top discriminative features |
| `feature_descriptions.py` | Full table of all 79 features with category and description |
| `describe_features.py` | Mean, std, min, max, skewness, and outlier % per feature |
### `models/`
| File | Description |
|------|-------------|
| `baseline_random_forest.py` | Random Forest with no class balancing — serves as the reference point |
| `balanced_random_forest.py` | Random Forest with `class_weight='balanced'` |
| `utils.py` | Shared utilities: label remapping, confusion matrix plotting, comparison CSV |
## Running
python main.py
## Features
| Feature | Description |
|---|---|
| Destination Port | Port number of the destination (e.g. 80=HTTP, 443=HTTPS, 22=SSH) |
| Flow Duration | Total duration of the flow in microseconds |
| Total Fwd Packets | Total number of packets sent in the forward direction (client → server) |
| Total Backward Packets | Total number of packets sent in the backward direction (server → client) |
| Total Length of Fwd Packets | Total bytes of payload in forward packets |
| Total Length of Bwd Packets | Total bytes of payload in backward packets |
| Fwd Packet Length Max | Largest forward packet size (bytes) |
| Fwd Packet Length Min | Smallest forward packet size (bytes) |
| Fwd Packet Length Mean | Average forward packet size (bytes) |
| Fwd Packet Length Std | Standard deviation of forward packet sizes |
| Bwd Packet Length Max | Largest backward packet size (bytes) |
| Bwd Packet Length Min | Smallest backward packet size (bytes) |
| Bwd Packet Length Mean | Average backward packet size (bytes) |
| Bwd Packet Length Std | Standard deviation of backward packet sizes |
| Flow Bytes/s | Total bytes per second across the entire flow |
| Flow Packets/s | Total packets per second across the entire flow |
| Flow IAT Mean | Mean time between any two consecutive packets in the flow (µs) |
| Flow IAT Std | Standard deviation of inter-arrival times |
| Flow IAT Max | Longest gap between two consecutive packets (µs) |
| Flow IAT Min | Shortest gap between two consecutive packets (µs) |
| Fwd IAT Total | Total time between forward packets (µs) |
| Fwd IAT Mean | Mean inter-arrival time of forward packets (µs) |
| Fwd IAT Std | Standard deviation of forward IAT |
| Fwd IAT Max | Longest gap between forward packets (µs) |
| Fwd IAT Min | Shortest gap between forward packets (µs) |
| Bwd IAT Total | Total time between backward packets (µs) |
| Bwd IAT Mean | Mean inter-arrival time of backward packets (µs) |
| Bwd IAT Std | Standard deviation of backward IAT |
| Bwd IAT Max | Longest gap between backward packets (µs) |
| Bwd IAT Min | Shortest gap between backward packets (µs) |
| Fwd PSH Flags | Number of forward packets with PSH flag (push data immediately to application) |
| Bwd PSH Flags | Number of backward packets with PSH flag |
| Fwd URG Flags | Number of forward packets with URG flag (urgent data) |
| Bwd URG Flags | Number of backward packets with URG flag |
| Fwd Header Length | Total bytes used by headers in forward packets |
| Bwd Header Length | Total bytes used by headers in backward packets |
| Fwd Packets/s | Forward packets per second |
| Bwd Packets/s | Backward packets per second |
| Min Packet Length | Smallest packet in the entire flow (bytes) |
| Max Packet Length | Largest packet in the entire flow (bytes) |
| Packet Length Mean | Average packet size across the entire flow (bytes) |
| Packet Length Std | Standard deviation of packet sizes |
| Packet Length Variance | Variance of packet sizes (Std²) |
| FIN Flag Count | Number of packets with FIN flag (connection teardown) |
| SYN Flag Count | Number of packets with SYN flag — high values may indicate SYN flood |
| RST Flag Count | Number of packets with RST flag (abrupt connection reset) |
| PSH Flag Count | Total PSH flags across all packets |
| ACK Flag Count | Number of packets with ACK flag (acknowledgement) |
| URG Flag Count | Total URG flags across all packets |
| CWE Flag Count | Number of packets with CWE flag (Congestion Window Reduced Echo) |
| ECE Flag Count | Number of packets with ECE flag (Explicit Congestion Notification Echo) |
| Down/Up Ratio | Ratio of download (backward) to upload (forward) traffic |
| Average Packet Size | Mean packet size across both directions (bytes) |
| Avg Fwd Segment Size | Mean forward TCP segment size |
| Avg Bwd Segment Size | Mean backward TCP segment size |
| Fwd Avg Bytes/Bulk | Average bytes per bulk transfer in the forward direction |
| Fwd Avg Packets/Bulk | Average packets per bulk transfer in the forward direction |
| Fwd Avg Bulk Rate | Average bulk transfer rate in the forward direction (bytes/s) |
| Bwd Avg Bytes/Bulk | Average bytes per bulk transfer in the backward direction |
| Bwd Avg Packets/Bulk | Average packets per bulk transfer in the backward direction |
| Bwd Avg Bulk Rate | Average bulk transfer rate in the backward direction (bytes/s) |
| Subflow Fwd Packets | Mean number of forward packets per subflow |
| Subflow Fwd Bytes | Mean bytes of forward packets per subflow |
| Subflow Bwd Packets | Mean number of backward packets per subflow |
| Subflow Bwd Bytes | Mean bytes of backward packets per subflow |
| Init_Win_bytes_forward | Initial TCP window size in the forward direction (bytes) |
| Init_Win_bytes_backward | Initial TCP window size in the backward direction (bytes) |
| act_data_pkt_fwd | Number of forward packets that carried actual payload |
| min_seg_size_forward | Minimum TCP segment size in the forward direction |
| Active Mean | Mean time the flow was active before going idle (µs) |
| Active Std | Standard deviation of active periods |
| Active Max | Longest active period (µs) |
| Active Min | Shortest active period (µs) |
| Idle Mean | Mean time the flow was idle between active periods (µs) |
| Idle Std | Standard deviation of idle periods |
| Idle Max | Longest idle period (µs) |
| Idle Min | Shortest idle period (µs) |
| Label | Traffic classification: BENIGN or attack type (DDoS, PortScan, Brute Force, etc.) |
## Missing Values
Two features contain missing values (0.1% of rows each):
- **Flow Bytes/s** — total bytes per second in the flow. Missing in 2,867 rows (0.1%). Correlation with label: TBD (pending analysis).
- **Flow Packets/s** — total packets per second in the flow. Missing in 2,867 rows (0.1%). Correlation with label: TBD (pending analysis).
Both features affect the same rows (missing values are co-located). Given the very low missing rate (0.1%), these rows will be dropped during preprocessing. Correlation analysis will determine whether these features carry meaningful signal.
## Data Cleaning
The raw dataset contains 2,830,743 rows. Two cleaning steps were applied before training:
### Step 1 — Duplicate Removal
Exact duplicate rows (all feature values identical) were identified and removed, keeping only the first occurrence of each group.
- Rows removed: **309,956** (10.95% of the dataset)
- Most duplicates came from BENIGN traffic and common attack types (DoS Hulk, PortScan)
- After deduplication: **2,520,787 rows**
### Step 2 — NaN and Infinity Removal
Two features — `Flow Bytes/s` and `Flow Packets/s` — contain infinite values produced by CICFlowMeter when flow duration is zero (division by zero). Any row containing at least one NaN or Inf value was dropped.
- Rows removed: **~2,867** (0.1% of the dataset)
- Only these two features were affected; all other features are complete
### Result
The cleaned dataset contains **2,520,787 rows** and is saved to `data/processed/clean.parquet`.
## Baseline Model
A Random Forest classifier (100 trees) was trained on the cleaned dataset as a reference point before applying any class balancing or feature engineering.
### Setup
- **Train / Test split:** 80% / 20%, stratified by class
- **Web Attack merging:** The three Web Attack subclasses (Brute Force, XSS, SQL Injection) were merged into a single `Web Attack` label. Each subclass had too few samples (4–294 in the test set) for reliable individual evaluation, and the model consistently confused them with each other. Merging reduced the number of classes from 15 to 13.
### Results
| Model | F1 Macro | Recall Bot | Recall Web Attack | Recall Infiltration |
|-------|----------|-----------|-------------------|---------------------|
| Baseline (no balancing) | 0.9511 | 0.771 | 0.972 | 0.857 |
| Balanced (`class_weight='balanced'`) | 0.9572 | 0.774 | 0.974 | 1.000 |
### Key Findings
- Both models perform well overall (F1 Macro > 0.95)
- **Bot** is the hardest class to detect — ~25% of bot traffic is missed by both models. Bot traffic deliberately mimics normal HTTP communication, making it difficult to distinguish using network flow features alone
- `class_weight='balanced'` provides minimal improvement over the baseline. With very rare classes (Heartbleed: 11 total samples, Infiltration: 36), even heavy weighting cannot compensate for insufficient training data
- **Heartbleed and Infiltration** results are statistically unreliable due to very small test set sizes (2 and 7 samples respectively)
- Results are saved to `output/models/` with a per-model confusion matrix and a central comparison table
## Feature Engineering
19 new features were created from the cleaned dataset and saved to `data/processed/engineered.parquet`. The original `clean.parquet` was not modified.
| Group | Features | Rationale |
|-------|----------|-----------|
| **Port categories** (9) | `port_is_well_known`, `port_is_http`, `port_is_https`, `port_is_ssh`, `port_is_dns`, `port_is_ftp`, `port_is_rdp`, `port_is_registered`, `port_is_dynamic` | The raw port number has 53,805 unique values — binary flags are more interpretable |
| **Ratios** (2) | `fwd_bytes_per_packet`, `header_to_payload_ratio` | Ratios carry more signal than raw counts (e.g. bytes per packet vs total bytes) |
| **Packet size variability** (2) | `fwd_packet_size_range`, `bwd_packet_size_range` | Automated attacks send uniform-size packets; legitimate traffic varies |
| **TCP flag ratios** (3) | `rst_ratio`, `fin_ratio`, `psh_ratio` | High RST = port scan; high PSH = data burst; high FIN = teardown attack |
| **Duration bins** (3) | `is_very_short_flow`, `is_very_long_flow`, `duration_per_fwd_packet` | Port scans produce very short flows; bots produce very long ones |
## Model Results
All models use an 80/20 stratified train/test split. The three Web Attack subclasses (Brute Force, XSS, SQL Injection) were merged into a single `Web Attack` label — each subclass had too few samples for reliable individual evaluation.
### Comparison Table
| Model | Test F1 Macro | Train F1 | Gap | Recall Bot | Recall Web Attack |
|-------|--------------|----------|-----|-----------|------------------|
| Baseline — clean data | 0.9511 | — | — | 0.771 | 0.972 |
| Balanced — clean data | 0.9572 | — | — | 0.774 | 0.974 |
| Baseline — engineered | 0.9564 | 0.9998 | 0.0433 | 0.763 | 0.977 |
| Balanced — engineered | 0.9571 | 0.9998 | 0.0427 | 0.769 | 0.977 |
| **Tuned (200 trees, depth=30)** | **0.9568** | 0.9997 | **0.0429** | 0.771 | 0.977 |
### Key Findings
- **Feature engineering helped** — 5 of the top 10 most important features are engineered. `psh_ratio` (fraction of packets with PSH flag) is consistently the most important feature across all models
- **`class_weight='balanced'` provides minimal improvement** — the rare classes are simply too small for weighting alone to compensate
- **Bot remains the hardest class** — Recall ~0.77 across all models. Bot traffic deliberately mimics normal HTTP communication, making it hard to distinguish using network flow features alone
- **Overfitting gap is moderate** (~0.04) — acceptable for Random Forest but indicates the model memorizes some training patterns. Limiting `max_depth=30` reduced the gap without hurting test performance
- **Random Forest has reached its ceiling** — differences between all engineered model variants are under 0.001 F1 Macro. Further gains require a different algorithm (XGBoost, LightGBM, or deep learning)
### Benchmark Context
Compared to published results on CIC-IDS2017:
- Random Forest in the literature typically achieves F1 Macro ~0.93
- Our best result (0.957) **exceeds the typical Random Forest benchmark**
- State-of-the-art models (Stacking, deep learning) reach F1 Macro ~0.98
## Bot Detection — Focused Binary Classifier
Bot was the hardest class in the multi-class model (Recall ~0.77). A dedicated binary classifier was trained to answer a single question: **"Is this flow a Bot?"**
### Setup
- **Model:** Random Forest (200 trees, max_depth=30, class_weight='balanced')
- **Labels:** Bot=1, everything else=0
- **Dataset:** engineered.parquet (2,520,787 rows, 61 features)
- **Bot samples:** 1,948 (0.077% of dataset)
### Threshold Analysis
| Threshold | Recall | Precision | F1 |
|-----------|--------|-----------|-----|
| 0.1 | 0.990 | 0.574 | 0.727 |
| 0.2 | 0.982 | 0.680 | 0.804 |
| **0.3** | **0.967** | **0.702** | **0.813** |
| 0.4 | 0.941 | 0.708 | 0.808 |
| 0.5 | 0.913 | 0.719 | 0.805 |
**Chosen threshold: 0.3** — best F1 (0.813) with Recall=0.967. In security, missing a bot is worse than a false alarm, so high Recall is preferred.
### Comparison to Multi-Class Model
| Model | Bot Recall |
|-------|-----------|
| Multi-class Random Forest | 0.774 |
| **Bot vs. All (threshold=0.3)** | **0.967** |
The focused binary model improved Bot Recall by **+19 percentage points**.
### Error Analysis (threshold=0.3)
- **Total errors:** 173
- **False Negatives** (Bots missed): **13** out of 390 test Bots
- **False Positives** (predicted Bot, actually other class): **160**
- 100% of false positives are **BENIGN** — no attack class is confused with Bot
- False positive rate on BENIGN: 0.038% (160 out of 419,010)
This clean error profile makes Bot vs. All ideal for a cascaded classifier: all misclassified samples are benign traffic, not attacks.
## Cascade Classifier
A two-stage pipeline that handles Bot separately from the rest of the classes.
### Architecture
Input flow
↓
[Stage 1 — Bot Detector] threshold=0.3
↓ ↓
BOT ✓ Not Bot
↓
[Stage 2 — Multi-class RF]
↓
BENIGN / DDoS / PortScan / ...
- **Stage 1:** Binary RF (200 trees, max_depth=30, class_weight='balanced') trained on all traffic
- **Stage 2:** Multi-class RF (200 trees, max_depth=30) trained **only on non-Bot traffic** (12 classes)
### Results
| Metric | Tuned RF (single model) | Cascade |
|--------|------------------------|---------|
| F1 Macro | 0.9568 | 0.9558 |
| **Bot Recall** | 0.771 | **0.961** |
| Web Attack Recall | 0.977 | 0.977 |
| BENIGN Recall | 0.999 | 0.999 |
### Key Findings
- **Bot Recall improved by +19 percentage points** (0.771 → 0.961) with no meaningful cost to overall F1 Macro (-0.001)
- All other classes are unaffected — Stage 2 performs identically to the standalone multi-class model
- **Bot-specific IAT features** (`iat_cv`, `fwd_iat_cv`, `small_packet_ratio`) were tested but removed — they added only +0.003 to Bot Recall while reducing F1 Macro by -0.006 and hurting Infiltration recall
### Why the Cascade Works
In the multi-class model, Bot is confused with BENIGN because both use HTTP. The binary Stage 1 focuses entirely on learning this subtle boundary, while Stage 2 is freed from modeling Bot behavior entirely.
## EDA Plots
#### Label Distribution

#### Correlation Heatmap

#### Top 6 Discriminative Features

#### Dropped Features Heatmap

## Anomaly Detection
Two unsupervised models were trained on BENIGN traffic only and evaluated on all classes.
### Isolation Forest vs. Autoencoder
| Class | Isolation Forest | Autoencoder |
|-------|-----------------|-------------|
| BENIGN (FP rate) | **1.02%** | 5.03% |
| Heartbleed | 50% | **100%** |
| Infiltration | 29% | **100%** |
| DoS Hulk | 12% | **94%** |
| DoS Slowhttptest | 8% | **96%** |
| DDoS | 0.1% | **72%** |
| Bot | 1.8% | 5.1% |
| Web Attack | 0% | 4.2% |
| **Average Precision** | 0.50 | **0.83** |
**Autoencoder** significantly outperforms Isolation Forest. It captures the non-linear feature structure of BENIGN traffic, making it sensitive to attacks that deviate from normal patterns.
**Bot and Web Attack are not detected** by either model — they deliberately mimic normal HTTP traffic and are handled by the supervised cascade instead.
## Full Model Comparison
All models evaluated on the engineered dataset with merged Web Attack label (13 classes).
| Model | F1 Macro | Bot Recall | Web Attack Recall | Heartbleed Recall |
|-------|----------|-----------|------------------|------------------|
| Baseline RF | 0.9511 | 0.771 | 0.972 | 0.500 |
| Balanced RF (`class_weight='balanced'`) | 0.9572 | 0.774 | 0.974 | 0.500 |
| Baseline engineered | 0.9564 | 0.763 | 0.977 | 0.500 |
| Balanced engineered | 0.9571 | 0.769 | 0.977 | 0.500 |
| Tuned RF (200 trees, depth=30) | 0.9568 | 0.771 | 0.977 | 0.500 |
| Cascade RF (Bot + Multi-class RF) | 0.9558 | 0.961 | 0.977 | 0.500 |
| **Cascade LightGBM (Bot + Multi-class LGBM)** | **0.9827** | **0.964** | **0.995** | **1.000** |
## Final Model — Cascade LightGBM
### Architecture
Stage 1 — Bot Detector (Binary RF, threshold=0.3)
Bot → "Bot"
else → Stage 2
Stage 2 — Multi-class LightGBM (500 trees, 63 leaves, lr=0.05)
Classifies: BENIGN / DoS / DDoS / PortScan / Heartbleed /
Infiltration / Web Attack / FTP-Patator / SSH-Patator
### Why Cascade?
Single multi-class models confuse Bot with BENIGN because both use HTTP. The binary Stage 1 focuses exclusively on the subtle patterns that distinguish Bot traffic (PSH ratio, packet size uniformity, timing regularity). Stage 2 benefits from not having to model Bot at all.
### Why LightGBM over Random Forest?
LightGBM uses gradient boosting — each tree corrects the errors of the previous one. This is especially effective on rare classes (Heartbleed, Web Attack) where RF struggles. On 2.5M rows, LightGBM also trains in ~10 minutes vs ~20 minutes for RF.
### Final Results
| Class | Baseline RF | Balanced RF | Cascade RF | **Cascade LightGBM** |
|-------|------------|------------|-----------|---------------------|
| F1 Macro | 0.9511 | 0.9572 | 0.9558 | **0.9827** |
| **Bot** | 0.771 | 0.774 | 0.961 | **0.964** |
| **Web Attack** | 0.972 | 0.974 | 0.977 | **0.995** |
| **Heartbleed** | 0.500 | 0.500 | 0.500 | **1.000** |
| Infiltration | 0.857 | 1.000 | 1.000 | **1.000** |
| BENIGN | 0.999 | 0.999 | 0.999 | 0.999 |
| DDoS | 1.000 | 1.000 | 1.000 | 1.000 |
| PortScan | 0.988 | 0.988 | 0.988 | **0.999** |
### Benchmark Context
Random Forest models in the literature typically achieve F1 Macro ~0.93 on CIC-IDS2017. Our Cascade LightGBM reaches **0.9827** — significantly above the published benchmark.
### Confusion Matrix

### Known Limitations
- **SQL Injection** — Very few samples (~21 total), making subclass classification unreliable.
- **Web Attack subclasses** — When splitting Web Attack into Brute Force / XSS / SQL Injection: Brute Force reaches 0.83 Recall, XSS 0.31, SQL Injection 0.00. Sample scarcity is the bottleneck.