White-Hat-007/DARKLAYR---Tor-Traffic-Forensic-Correlation-Engine
GitHub: White-Hat-007/DARKLAYR---Tor-Traffic-Forensic-Correlation-Engine
Stars: 1 | Forks: 0
# 🕵️ DARKLAYR: Tor-Traffic Forensic Correlation Engine






# 🔍 Overview
The system does **not attempt deanonymization** and does not claim deterministic attribution. Instead, it reconstructs investigative context through:
- PCAP evidence acquisition
- Zeek protocol telemetry
- Tor relay intelligence
- ASN infrastructure attribution
- Behavioral and anomaly correlation
- Probabilistic confidence scoring
The engine is designed around a core investigative principle:
# 🧠 Investigative Philosophy
Traditional monitoring systems lose visibility once traffic enters encrypted onion-routed infrastructure.
DARKLAYR addresses this by focusing on:
- Observable behavior
- Infrastructure context
- Timing relationships
- Protocol transitions
- Statistical consistency
- Anomaly interpretation
The objective is not to identify users directly, but to produce:
- High-confidence infrastructure candidates
- Network attribution context
- Reproducible forensic evidence
- Explainable investigative leads
# ⚙️ Layered Architecture
The engine operates across five controlled investigative layers:
| Layer | Function |
|---|---|
| **Traffic Generation & Capture** | Controlled Tor traffic simulation and packet acquisition |
| **Network Evidence Extraction** | Zeek-based protocol analysis and structured telemetry |
| **Tor Intelligence & Topology** | Relay metadata collection and infrastructure mapping |
| **Forensic Correlation & Attribution** | Multi-factor probabilistic correlation |
| **Confidence Scoring & Reporting** | Ranked investigative output with explainable scoring |
# 📌 Table of Contents
- [Core Features](#-core-features)
- [Pipeline Architecture](#-pipeline-architecture)
- [Workflow](#-workflow)
- [Components](#-components)
- [Traffic Generation](#1-traffic-generation)
- [PCAP Capture](#2-pcap-capture)
- [Zeek Analysis](#3-zeek-analysis)
- [ASN Enrichment](#4-asn-enrichment)
- [Tor Relay Intelligence](#5-tor-relay-fetcher)
- [Correlation Engine](#6-correlation-engine)
- [Confidence Scoring Model](#-confidence-scoring-model)
- [Output](#-output)
- [Project Structure](#-project-structure)
- [Installation](#-installation)
- [Tech Stack](#-tech-stack)
- [Ethical Notice](#-ethical-notice)
- [Conclusion](#-conclusion)
# 🚀 Core Features
| Feature | Description |
|---|---|
| 🌐 **Controlled Tor Traffic Generation** | Simulates Tor-based traffic via SOCKS in isolated environments |
| 📡 **PCAP Evidence Acquisition** | Raw packet capture using tcpdump |
| 🔍 **Zeek Telemetry Extraction** | Multi-layer protocol analysis and forensic logging |
| 🌍 **ASN Intelligence Enrichment** | IP → ASN → Organization mapping using Team Cymru |
| 🧠 **Probabilistic Correlation Engine** | Multi-signal confidence scoring |
| 🕸️ **Tor Relay Intelligence Fetcher** | Live Tor relay metadata acquisition via Onionoo |
| 📊 **Forensic-Ready Reporting** | CSV outputs and investigative summaries |
# 🔄 Pipeline Architecture
graph TD
A[Tor Traffic Generation
torsocks + curl] --> B[PCAP Capture
tcpdump / Raw Evidence] B --> C[Zeek Analysis
Protocol Telemetry logs] C --> D[CSV Normalization
zeek-cut tool] D --> E[ASN Enrichment
Team Cymru WHOIS] D --> F[Tor Relay Intelligence
Onionoo API details] E --> G[Probabilistic Correlation Engine
Multi-Signal Attribution] F --> G G --> H[Confidence-Scored Output
Investigative Leads] style A fill:#2e7d32,stroke:#1b5e20,stroke-width:2px,color:#fff style B fill:#1565c0,stroke:#0d47a1,stroke-width:2px,color:#fff style C fill:#00838f,stroke:#006064,stroke-width:2px,color:#fff style G fill:#e65100,stroke:#bf360c,stroke-width:2px,color:#fff style H fill:#c62828,stroke:#b71c1c,stroke-width:2px,color:#fff Each stage is modular, reproducible, and independently auditable. # 🔧 Workflow Tor Traffic → PCAP → Zeek Logs → CSV → ASN Enrichment → Correlation → Confidence Output # 📦 Components # 1. Traffic Generation Controlled Tor traffic is generated through SOCKS-based routing using `torsocks`. This produces realistic encrypted browsing behavior without interacting with live users. #!/bin/bash echo "[*] Starting Tor..." tor & sleep 10 echo "[*] Generating traffic..." for i in {1..200}; do torsocks curl -s https://duckduckgo.com > /dev/null done echo "[✓] Traffic generation complete" # 2. PCAP Capture Traffic is captured using tcpdump to preserve raw packet-level evidence for offline forensic analysis. #!/bin/bash echo "[*] Capturing traffic for 10 minutes..." timeout 600 tcpdump -i enp0s3 -nn -w data/raw/tor_traffic.pcap echo "[✓] Capture complete" # 3. Zeek Analysis Captured PCAPs are processed using Zeek to generate structured forensic telemetry. Generated logs include: * `conn.log` * `ssl.log` * `socks.log` * `tunnel.log` * `weird.log` * `packet_filter.log` ## Run Zeek mkdir -p data/zeek_logs cd data/zeek_logs zeek -C -r ../raw/tor_traffic.pcap ## Convert Logs to CSV for f in *.log; do zeek-cut < "$f" | tr '\t' ',' > "../csv/${f%.log}.csv" done # 4. ASN Enrichment Observed IPs are enriched with: * Autonomous System Number (ASN) * Organization ownership * Infrastructure attribution via Team Cymru's WHOIS intelligence service. import pandas as pd from cymruwhois import Client conn = pd.read_csv("data/csv/conn.csv") cymru = Client() def lookup(ip): try: r = cymru.lookup(ip) return f"AS{r.asn}", r.owner except: return "Unknown", "Unknown" results = [] for ip in conn["id.orig_h"].dropna().unique(): if ip.startswith(("127.", "10.", "192.168")): continue asn, org = lookup(ip) results.append([ip, asn, org]) df = pd.DataFrame(results, columns=["ip", "asn", "organization"]) df.to_csv("data/outputs/asn_results.csv", index=False) # 5. Tor Relay Fetcher Fetches live relay metadata from Onionoo API to build infrastructure context around observed traffic. import requests url = "https://onionoo.torproject.org/details?running=true" data = requests.get(url).json() relays = [] for r in data["relays"]: relays.append({ "ip": r.get("or_addresses", [""])[0].split(":")[0], "nickname": r.get("nickname"), "country": r.get("country") }) print(relays[:5]) # 6. Correlation Engine The forensic core of DARKLAYR. The engine correlates: * Flow volume * Protocol diversity * Temporal consistency * ASN reputation * Infrastructure ownership * Behavioral anomalies to produce confidence-ranked investigative candidates. import pandas as pd conn = pd.read_csv("data/csv/conn.csv") asn = pd.read_csv("data/outputs/asn_results.csv") flow_counts = conn["id.resp_h"].value_counts().to_dict() results = [] for ip, count in flow_counts.items(): score = 0 # Signal 1 — Flow Volume score += min(count / 100, 0.3) # Signal 2 — ASN Reputation match = asn[asn["ip"] == ip] if not match.empty: if "Hetzner" in match["organization"].values[0]: score += 0.3 else: score += 0.2 # Signal 3 — Protocol Diversity proto_count = conn[conn["id.resp_h"] == ip]["proto"].nunique() score += min(proto_count * 0.05, 0.1) # Penalty — Loopback Artifact if ip == "127.0.0.1": score -= 0.2 results.append([ip, score]) df = pd.DataFrame(results, columns=["IP", "Confidence"]) df = df.sort_values("Confidence", ascending=False) df.to_csv("data/outputs/correlation_results.csv", index=False) print(df) # 📊 Confidence Scoring Model | Signal | Max Weight | Purpose | | ------------------ | ---------- | ---------------------------------------------- | | Flow Volume | +0.30 | Higher activity increases relevance | | ASN Reputation | +0.30 | Tor-friendly hosting providers weighted higher | | Protocol Diversity | +0.10 | Multi-protocol behavior correlation | | Loopback Penalty | -0.20 | Reduces localhost false positives | Higher scores indicate stronger investigative priority — not certainty. # 📄 Output Generated outputs include: | File | Purpose | | ------------------------- | ----------------------------------- | | `asn_results.csv` | ASN & organization attribution | | `correlation_results.csv` | Ranked confidence-scored candidates | Example: IP Confidence 203.0.113.45 0.70 198.51.100.12 0.55 192.0.2.88 0.40 # 📁 Project Structure DARKLAYR/ ├── data/ │ ├── raw/ │ ├── zeek_logs/ │ ├── csv/ │ └── outputs/ │ ├── scripts/ │ ├── traffic/ │ ├── capture/ │ ├── zeek/ │ ├── enrichment/ │ └── correlation/ │ ├── fetcher/ ├── dashboard/ ├── docs/ ├── requirements.txt ├── setup.sh └── README.md # ⚡ Installation ## Install Dependencies sudo apt update sudo apt install -y tor tcpdump zeek tshark python3-pip pip3 install -r requirements.txt # 🛠️ Tech Stack ## | Layer | Tools | | ------------------ | ---------------------------- | | Traffic Generation | `tor`, `torsocks`, `curl` | | Packet Capture | `tcpdump` | | Protocol Analysis | `zeek`, `zeek-cut`, `tshark` | | Enrichment | `cymruwhois` | | Processing | `python`, `pandas`, `numpy` | | Intelligence | Onionoo API | | Reporting | CSV / structured outputs | # 🔬 How It Works: Zeek + ASN Feeds + Correlation DARKLAYR correlates traffic patterns to identify probable Tor relays by combining three core layers of network intelligence: 1. **Zeek Telemetry Extraction**: - Zeek monitors raw PCAP captures and parses them into protocol-specific, structured event logs (`conn.log`, `ssl.log`, `socks.log`). - The engine analyzes network connections to detect characteristic Tor TLS handshake fingerprints (e.g., specific cipher suites, randomized server names) and traffic volume/frequency anomalies. 2. **ASN Infrastructure Enrichment**: - The engine processes destination IPs through Team Cymru's WHOIS service to map IPs to their respective **Autonomous System Numbers (ASNs)** and Organization owners. - Traffic targeting ASNs known for hosting Tor relays (e.g., Hetzner, OVH, DigitalOcean) is prioritized and assigned higher attribution weights. 3. **Probabilistic Correlation**: - The **Correlation Engine** aggregates connection frequency, protocol diversity, and infrastructure reputation to compute a **Confidence Score** (ranging from `0.0` to `1.0`). - Relays are ranked dynamically, outputting high-probability investigative candidates without decrypting any payloads. # 📸 Screenshots of Output Below are visualizations and reports generated by the DARKLAYR forensic correlation engine: ### 1. Network Evidence & Metadata Captured packets showing connection details and metadata across multiple network flows:  ### 2. Node Fingerprinting Correlated IP list indicating identified node footprints and their respective organization properties:  ### 3. Traffic Pattern Verification Graph showing normal network flows compared to Tor-like traffic sequences:  ### 4. Correlation Analysis Analysis of connection timing, flow size distribution, and relay verification matching:  ### 5. Flow Overview Architecture High-level overview of packet flows and how the correlation model maps traffic:  # ⚠️ Limitations While DARKLAYR is highly effective at identifying infrastructure candidates, it operates under the following key limitations: * **No Personal Anonymity Bypass**: The engine does not bypass Tor's cryptographic layers or reveal user identities; it only maps network infrastructure paths. * **Hosting False Positives**: Shared ASNs (such as major cloud providers hosting both Tor relays and normal web servers) can sometimes skew correlation scores, requiring manual verification. * **Circuit Dynamics**: Tor circuits change dynamically (typically every 10 minutes). Correlation accuracy degrades over long timeframes if capture windows are not aligned. * **Traffic Padding Defense**: Tor's built-in traffic padding and cell-drop defenses can obscure timing signals, lowering the effectiveness of flow-volume metrics. # ⚖️ Ethical Notice DARKLAYR is a forensic research framework. It: * ❌ Does NOT deanonymize live users * ❌ Does NOT attack the Tor network * ❌ Does NOT bypass cryptographic protections * ✅ Operates exclusively in controlled environments * ✅ Focuses on behavioral correlation methodology * ✅ Preserves ethical and legal investigative boundaries All usage must comply with applicable laws, institutional policies, and ethical research standards. # 🧾 Conclusion DARKLAYR demonstrates that effective cybercrime investigation does not require breaking anonymity — it requires understanding layers, correlating evidence, and extracting meaning from observable behavior. The system reconstructs investigative context through: * timing relationships * protocol transitions * infrastructure attribution * anomaly interpretation * probabilistic confidence modeling DARKLAYR does not produce identities. It produces leads. Because in layered infrastructure, every observable pattern leaves traceable behavior — and every peel leaves a lead. *Built for forensic research. Designed with intent. Deployed with responsibility.*
torsocks + curl] --> B[PCAP Capture
tcpdump / Raw Evidence] B --> C[Zeek Analysis
Protocol Telemetry logs] C --> D[CSV Normalization
zeek-cut tool] D --> E[ASN Enrichment
Team Cymru WHOIS] D --> F[Tor Relay Intelligence
Onionoo API details] E --> G[Probabilistic Correlation Engine
Multi-Signal Attribution] F --> G G --> H[Confidence-Scored Output
Investigative Leads] style A fill:#2e7d32,stroke:#1b5e20,stroke-width:2px,color:#fff style B fill:#1565c0,stroke:#0d47a1,stroke-width:2px,color:#fff style C fill:#00838f,stroke:#006064,stroke-width:2px,color:#fff style G fill:#e65100,stroke:#bf360c,stroke-width:2px,color:#fff style H fill:#c62828,stroke:#b71c1c,stroke-width:2px,color:#fff Each stage is modular, reproducible, and independently auditable. # 🔧 Workflow Tor Traffic → PCAP → Zeek Logs → CSV → ASN Enrichment → Correlation → Confidence Output # 📦 Components # 1. Traffic Generation Controlled Tor traffic is generated through SOCKS-based routing using `torsocks`. This produces realistic encrypted browsing behavior without interacting with live users. #!/bin/bash echo "[*] Starting Tor..." tor & sleep 10 echo "[*] Generating traffic..." for i in {1..200}; do torsocks curl -s https://duckduckgo.com > /dev/null done echo "[✓] Traffic generation complete" # 2. PCAP Capture Traffic is captured using tcpdump to preserve raw packet-level evidence for offline forensic analysis. #!/bin/bash echo "[*] Capturing traffic for 10 minutes..." timeout 600 tcpdump -i enp0s3 -nn -w data/raw/tor_traffic.pcap echo "[✓] Capture complete" # 3. Zeek Analysis Captured PCAPs are processed using Zeek to generate structured forensic telemetry. Generated logs include: * `conn.log` * `ssl.log` * `socks.log` * `tunnel.log` * `weird.log` * `packet_filter.log` ## Run Zeek mkdir -p data/zeek_logs cd data/zeek_logs zeek -C -r ../raw/tor_traffic.pcap ## Convert Logs to CSV for f in *.log; do zeek-cut < "$f" | tr '\t' ',' > "../csv/${f%.log}.csv" done # 4. ASN Enrichment Observed IPs are enriched with: * Autonomous System Number (ASN) * Organization ownership * Infrastructure attribution via Team Cymru's WHOIS intelligence service. import pandas as pd from cymruwhois import Client conn = pd.read_csv("data/csv/conn.csv") cymru = Client() def lookup(ip): try: r = cymru.lookup(ip) return f"AS{r.asn}", r.owner except: return "Unknown", "Unknown" results = [] for ip in conn["id.orig_h"].dropna().unique(): if ip.startswith(("127.", "10.", "192.168")): continue asn, org = lookup(ip) results.append([ip, asn, org]) df = pd.DataFrame(results, columns=["ip", "asn", "organization"]) df.to_csv("data/outputs/asn_results.csv", index=False) # 5. Tor Relay Fetcher Fetches live relay metadata from Onionoo API to build infrastructure context around observed traffic. import requests url = "https://onionoo.torproject.org/details?running=true" data = requests.get(url).json() relays = [] for r in data["relays"]: relays.append({ "ip": r.get("or_addresses", [""])[0].split(":")[0], "nickname": r.get("nickname"), "country": r.get("country") }) print(relays[:5]) # 6. Correlation Engine The forensic core of DARKLAYR. The engine correlates: * Flow volume * Protocol diversity * Temporal consistency * ASN reputation * Infrastructure ownership * Behavioral anomalies to produce confidence-ranked investigative candidates. import pandas as pd conn = pd.read_csv("data/csv/conn.csv") asn = pd.read_csv("data/outputs/asn_results.csv") flow_counts = conn["id.resp_h"].value_counts().to_dict() results = [] for ip, count in flow_counts.items(): score = 0 # Signal 1 — Flow Volume score += min(count / 100, 0.3) # Signal 2 — ASN Reputation match = asn[asn["ip"] == ip] if not match.empty: if "Hetzner" in match["organization"].values[0]: score += 0.3 else: score += 0.2 # Signal 3 — Protocol Diversity proto_count = conn[conn["id.resp_h"] == ip]["proto"].nunique() score += min(proto_count * 0.05, 0.1) # Penalty — Loopback Artifact if ip == "127.0.0.1": score -= 0.2 results.append([ip, score]) df = pd.DataFrame(results, columns=["IP", "Confidence"]) df = df.sort_values("Confidence", ascending=False) df.to_csv("data/outputs/correlation_results.csv", index=False) print(df) # 📊 Confidence Scoring Model | Signal | Max Weight | Purpose | | ------------------ | ---------- | ---------------------------------------------- | | Flow Volume | +0.30 | Higher activity increases relevance | | ASN Reputation | +0.30 | Tor-friendly hosting providers weighted higher | | Protocol Diversity | +0.10 | Multi-protocol behavior correlation | | Loopback Penalty | -0.20 | Reduces localhost false positives | Higher scores indicate stronger investigative priority — not certainty. # 📄 Output Generated outputs include: | File | Purpose | | ------------------------- | ----------------------------------- | | `asn_results.csv` | ASN & organization attribution | | `correlation_results.csv` | Ranked confidence-scored candidates | Example: IP Confidence 203.0.113.45 0.70 198.51.100.12 0.55 192.0.2.88 0.40 # 📁 Project Structure DARKLAYR/ ├── data/ │ ├── raw/ │ ├── zeek_logs/ │ ├── csv/ │ └── outputs/ │ ├── scripts/ │ ├── traffic/ │ ├── capture/ │ ├── zeek/ │ ├── enrichment/ │ └── correlation/ │ ├── fetcher/ ├── dashboard/ ├── docs/ ├── requirements.txt ├── setup.sh └── README.md # ⚡ Installation ## Install Dependencies sudo apt update sudo apt install -y tor tcpdump zeek tshark python3-pip pip3 install -r requirements.txt # 🛠️ Tech Stack ## | Layer | Tools | | ------------------ | ---------------------------- | | Traffic Generation | `tor`, `torsocks`, `curl` | | Packet Capture | `tcpdump` | | Protocol Analysis | `zeek`, `zeek-cut`, `tshark` | | Enrichment | `cymruwhois` | | Processing | `python`, `pandas`, `numpy` | | Intelligence | Onionoo API | | Reporting | CSV / structured outputs | # 🔬 How It Works: Zeek + ASN Feeds + Correlation DARKLAYR correlates traffic patterns to identify probable Tor relays by combining three core layers of network intelligence: 1. **Zeek Telemetry Extraction**: - Zeek monitors raw PCAP captures and parses them into protocol-specific, structured event logs (`conn.log`, `ssl.log`, `socks.log`). - The engine analyzes network connections to detect characteristic Tor TLS handshake fingerprints (e.g., specific cipher suites, randomized server names) and traffic volume/frequency anomalies. 2. **ASN Infrastructure Enrichment**: - The engine processes destination IPs through Team Cymru's WHOIS service to map IPs to their respective **Autonomous System Numbers (ASNs)** and Organization owners. - Traffic targeting ASNs known for hosting Tor relays (e.g., Hetzner, OVH, DigitalOcean) is prioritized and assigned higher attribution weights. 3. **Probabilistic Correlation**: - The **Correlation Engine** aggregates connection frequency, protocol diversity, and infrastructure reputation to compute a **Confidence Score** (ranging from `0.0` to `1.0`). - Relays are ranked dynamically, outputting high-probability investigative candidates without decrypting any payloads. # 📸 Screenshots of Output Below are visualizations and reports generated by the DARKLAYR forensic correlation engine: ### 1. Network Evidence & Metadata Captured packets showing connection details and metadata across multiple network flows:  ### 2. Node Fingerprinting Correlated IP list indicating identified node footprints and their respective organization properties:  ### 3. Traffic Pattern Verification Graph showing normal network flows compared to Tor-like traffic sequences:  ### 4. Correlation Analysis Analysis of connection timing, flow size distribution, and relay verification matching:  ### 5. Flow Overview Architecture High-level overview of packet flows and how the correlation model maps traffic:  # ⚠️ Limitations While DARKLAYR is highly effective at identifying infrastructure candidates, it operates under the following key limitations: * **No Personal Anonymity Bypass**: The engine does not bypass Tor's cryptographic layers or reveal user identities; it only maps network infrastructure paths. * **Hosting False Positives**: Shared ASNs (such as major cloud providers hosting both Tor relays and normal web servers) can sometimes skew correlation scores, requiring manual verification. * **Circuit Dynamics**: Tor circuits change dynamically (typically every 10 minutes). Correlation accuracy degrades over long timeframes if capture windows are not aligned. * **Traffic Padding Defense**: Tor's built-in traffic padding and cell-drop defenses can obscure timing signals, lowering the effectiveness of flow-volume metrics. # ⚖️ Ethical Notice DARKLAYR is a forensic research framework. It: * ❌ Does NOT deanonymize live users * ❌ Does NOT attack the Tor network * ❌ Does NOT bypass cryptographic protections * ✅ Operates exclusively in controlled environments * ✅ Focuses on behavioral correlation methodology * ✅ Preserves ethical and legal investigative boundaries All usage must comply with applicable laws, institutional policies, and ethical research standards. # 🧾 Conclusion DARKLAYR demonstrates that effective cybercrime investigation does not require breaking anonymity — it requires understanding layers, correlating evidence, and extracting meaning from observable behavior. The system reconstructs investigative context through: * timing relationships * protocol transitions * infrastructure attribution * anomaly interpretation * probabilistic confidence modeling DARKLAYR does not produce identities. It produces leads. Because in layered infrastructure, every observable pattern leaves traceable behavior — and every peel leaves a lead. *Built for forensic research. Designed with intent. Deployed with responsibility.*