shawmoonazad/RL-driven-Hybrid-PQC-TLS-Negotiation
GitHub: shawmoonazad/RL-driven-Hybrid-PQC-TLS-Negotiation
Stars: 1 | Forks: 0
# Offline RL for Hybrid PQC-TLS Protocol Selection
Codebase for a research project applying **offline reinforcement learning** to adaptive cryptographic policy selection in hybrid post-quantum TLS handshakes. Five offline RL algorithms are trained on real handshake measurement data and evaluated against a rule-based baseline.
## What This Does
At TLS handshake time, an agent selects one of **12 cryptographic configurations** — four policy modes × three NIST security levels — based on observed network and cryptographic timing features. The goal is to minimize latency while respecting security constraints (no classical-only fallback below a minimum level).
**Policy modes:** `REQUIRE_HYBRID` · `PQC_ONLY` · `ALLOW_FALLBACK` · `CLASSICAL_ONLY`
**NIST security levels:** 1 · 3 · 5
**State space:** 15 features (RTT, key generation/encapsulation/verification timings, wire overhead)
## Algorithms
| Algorithm | Description |
|-----------|-------------|
| **BC** | Behavioral Cloning — supervised imitation baseline |
| **CQL** | Conservative Q-Learning (Kumar et al., 2020) |
| **IQL** | Implicit Q-Learning (Kostrikov et al., 2021) |
| **BCQ** | Batch-Constrained Q-Learning (Fujimoto et al., 2019) |
| **AWAC** | Advantage Weighted Actor-Critic (Nair et al., 2020) |
All models use a shared MLP backbone (2 × 256 hidden units, ReLU) and operate over the discrete 12-action space.
## Project Structure
hybrid_pqc_tls/ # Main package
├── rl_config.py # Action space, state space, reward config, hyperparameters
├── rl_models.py # PyTorch implementations of all 5 algorithms
├── rl_train.py # Training pipeline
├── rl_offline_dataset.py # Base dataset builder
├── rl_dataset_improved.py # Improved dataset (epsilon=0.3)
├── rl_dataset_improved_v2.py # Diverse dataset (epsilon=0.6, min 2% per action)
├── rl_evaluate.py # Evaluation v1
├── rl_evaluate_v2.py # Evaluation v2 — RL vs Rule-Based comparison
├── rl_inference.py # Single-model inference
├── rl_inference_multi.py # Multi-model inference (deployment)
├── rl_env.py # Gymnasium-compatible environment wrapper
├── generate_paper_figures.py # Publication figure generation
│
├── config.py # TLS/crypto configuration
├── policy.py # Cryptographic policy logic
├── primitives.py # Crypto primitives
├── protocol.py # TLS protocol simulation
└── session.py # Session management
run_rl_pipeline.py # Main pipeline entry point (place in project root)
run_pipeline_v2.py # Pipeline v2 (diverse dataset variant)
run_action_masking_eval.py # Inference-time action masking experiment
## Setup
pip install torch numpy pandas matplotlib scikit-learn gymnasium cryptography stable-baselines3
Python 3.9+ recommended.
## Usage
### Run the full pipeline
# Default (100 epochs)
python run_rl_pipeline.py
# Custom options
python run_rl_pipeline.py --data-path path/to/handshake_raw.csv --epochs 200
# Skip steps if already done
python run_rl_pipeline.py --skip-dataset # reuse existing dataset
python run_rl_pipeline.py --skip-training # reuse existing models
### Run the diverse-dataset variant
python run_pipeline_v2.py
### Run action masking evaluation (no retraining)
python run_action_masking_eval.py
## Data
The pipeline expects raw TLS handshake measurements at:
results/eval_grid/handshake_raw.csv
The dataset builder samples ~10,000 transitions with configurable exploration epsilon and minimum action coverage, then saves to:
results/rl/offline_rl_dataset_v2.npz
## Outputs
After running, results are written to:
results/rl/
├── models/ # Trained model checkpoints (.pt)
├── evaluation/ # CSV comparison tables
│ ├── action_masking/ # Masked vs unmasked results
│ └── latex/ # LaTeX-ready tables
└── figures/ # PNG plots
## Key Hyperparameters
| Parameter | Default |
|-----------|---------|
| Hidden dims | 256 × 256 |
| Batch size | 256 |
| Learning rate | 3e-4 |
| Discount γ | 0.99 |
| CQL α | 1.0 |
| IQL τ (expectile) | 0.7 |
| BCQ threshold | 0.3 |
| AWAC λ | 1.0 |
| Min acceptable security level | 3 |
## Reward Function
The reward is RTT-dependent and encodes the security priority hierarchy:
R = base_reward
− α(RTT) × latency # RTT-scaled latency penalty
− β × wire_overhead_KB # Wire cost penalty
+ γ(RTT) × security_level # RTT-scaled security bonus
+ mode_bonus # REQUIRE_HYBRID: +3.0, PQC_ONLY: +1.5,
# ALLOW_FALLBACK: −1.0, CLASSICAL_ONLY: −15.0
− violation_penalty # −5.0 if level < 3