White-Hat-007/DARKLAYR---Tor-Traffic-Forensic-Correlation-Engine

GitHub: White-Hat-007/DARKLAYR---Tor-Traffic-Forensic-Correlation-Engine

Stars: 1 | Forks: 0

# 🕵️ DARKLAYR: Tor-Traffic Forensic Correlation Engine ![Python](https://img.shields.io/badge/python-3.8%2B-brightgreen) ![Platform](https://img.shields.io/badge/platform-Linux-FCC624?logo=linux&logoColor=black) ![Ubuntu](https://img.shields.io/badge/platform-Ubuntu-E95420?logo=ubuntu&logoColor=white) ![Status](https://img.shields.io/badge/status-research--only-orange) ![Type](https://img.shields.io/badge/type-forensic--research-blueviolet) ![Focus](https://img.shields.io/badge/focus-Tor%20Traffic%20Correlation-red) # 🔍 Overview The system does **not attempt deanonymization** and does not claim deterministic attribution. Instead, it reconstructs investigative context through: - PCAP evidence acquisition - Zeek protocol telemetry - Tor relay intelligence - ASN infrastructure attribution - Behavioral and anomaly correlation - Probabilistic confidence scoring The engine is designed around a core investigative principle: # 🧠 Investigative Philosophy Traditional monitoring systems lose visibility once traffic enters encrypted onion-routed infrastructure. DARKLAYR addresses this by focusing on: - Observable behavior - Infrastructure context - Timing relationships - Protocol transitions - Statistical consistency - Anomaly interpretation The objective is not to identify users directly, but to produce: - High-confidence infrastructure candidates - Network attribution context - Reproducible forensic evidence - Explainable investigative leads # ⚙️ Layered Architecture The engine operates across five controlled investigative layers: | Layer | Function | |---|---| | **Traffic Generation & Capture** | Controlled Tor traffic simulation and packet acquisition | | **Network Evidence Extraction** | Zeek-based protocol analysis and structured telemetry | | **Tor Intelligence & Topology** | Relay metadata collection and infrastructure mapping | | **Forensic Correlation & Attribution** | Multi-factor probabilistic correlation | | **Confidence Scoring & Reporting** | Ranked investigative output with explainable scoring | # 📌 Table of Contents - [Core Features](#-core-features) - [Pipeline Architecture](#-pipeline-architecture) - [Workflow](#-workflow) - [Components](#-components) - [Traffic Generation](#1-traffic-generation) - [PCAP Capture](#2-pcap-capture) - [Zeek Analysis](#3-zeek-analysis) - [ASN Enrichment](#4-asn-enrichment) - [Tor Relay Intelligence](#5-tor-relay-fetcher) - [Correlation Engine](#6-correlation-engine) - [Confidence Scoring Model](#-confidence-scoring-model) - [Output](#-output) - [Project Structure](#-project-structure) - [Installation](#-installation) - [Tech Stack](#-tech-stack) - [Ethical Notice](#-ethical-notice) - [Conclusion](#-conclusion) # 🚀 Core Features | Feature | Description | |---|---| | 🌐 **Controlled Tor Traffic Generation** | Simulates Tor-based traffic via SOCKS in isolated environments | | 📡 **PCAP Evidence Acquisition** | Raw packet capture using tcpdump | | 🔍 **Zeek Telemetry Extraction** | Multi-layer protocol analysis and forensic logging | | 🌍 **ASN Intelligence Enrichment** | IP → ASN → Organization mapping using Team Cymru | | 🧠 **Probabilistic Correlation Engine** | Multi-signal confidence scoring | | 🕸️ **Tor Relay Intelligence Fetcher** | Live Tor relay metadata acquisition via Onionoo | | 📊 **Forensic-Ready Reporting** | CSV outputs and investigative summaries | # 🔄 Pipeline Architecture graph TD A[Tor Traffic Generation
torsocks + curl] --> B[PCAP Capture
tcpdump / Raw Evidence] B --> C[Zeek Analysis
Protocol Telemetry logs] C --> D[CSV Normalization
zeek-cut tool] D --> E[ASN Enrichment
Team Cymru WHOIS] D --> F[Tor Relay Intelligence
Onionoo API details] E --> G[Probabilistic Correlation Engine
Multi-Signal Attribution] F --> G G --> H[Confidence-Scored Output
Investigative Leads] style A fill:#2e7d32,stroke:#1b5e20,stroke-width:2px,color:#fff style B fill:#1565c0,stroke:#0d47a1,stroke-width:2px,color:#fff style C fill:#00838f,stroke:#006064,stroke-width:2px,color:#fff style G fill:#e65100,stroke:#bf360c,stroke-width:2px,color:#fff style H fill:#c62828,stroke:#b71c1c,stroke-width:2px,color:#fff Each stage is modular, reproducible, and independently auditable. # 🔧 Workflow Tor Traffic → PCAP → Zeek Logs → CSV → ASN Enrichment → Correlation → Confidence Output # 📦 Components # 1. Traffic Generation Controlled Tor traffic is generated through SOCKS-based routing using `torsocks`. This produces realistic encrypted browsing behavior without interacting with live users. #!/bin/bash echo "[*] Starting Tor..." tor & sleep 10 echo "[*] Generating traffic..." for i in {1..200}; do torsocks curl -s https://duckduckgo.com > /dev/null done echo "[✓] Traffic generation complete" # 2. PCAP Capture Traffic is captured using tcpdump to preserve raw packet-level evidence for offline forensic analysis. #!/bin/bash echo "[*] Capturing traffic for 10 minutes..." timeout 600 tcpdump -i enp0s3 -nn -w data/raw/tor_traffic.pcap echo "[✓] Capture complete" # 3. Zeek Analysis Captured PCAPs are processed using Zeek to generate structured forensic telemetry. Generated logs include: * `conn.log` * `ssl.log` * `socks.log` * `tunnel.log` * `weird.log` * `packet_filter.log` ## Run Zeek mkdir -p data/zeek_logs cd data/zeek_logs zeek -C -r ../raw/tor_traffic.pcap ## Convert Logs to CSV for f in *.log; do zeek-cut < "$f" | tr '\t' ',' > "../csv/${f%.log}.csv" done # 4. ASN Enrichment Observed IPs are enriched with: * Autonomous System Number (ASN) * Organization ownership * Infrastructure attribution via Team Cymru's WHOIS intelligence service. import pandas as pd from cymruwhois import Client conn = pd.read_csv("data/csv/conn.csv") cymru = Client() def lookup(ip): try: r = cymru.lookup(ip) return f"AS{r.asn}", r.owner except: return "Unknown", "Unknown" results = [] for ip in conn["id.orig_h"].dropna().unique(): if ip.startswith(("127.", "10.", "192.168")): continue asn, org = lookup(ip) results.append([ip, asn, org]) df = pd.DataFrame(results, columns=["ip", "asn", "organization"]) df.to_csv("data/outputs/asn_results.csv", index=False) # 5. Tor Relay Fetcher Fetches live relay metadata from Onionoo API to build infrastructure context around observed traffic. import requests url = "https://onionoo.torproject.org/details?running=true" data = requests.get(url).json() relays = [] for r in data["relays"]: relays.append({ "ip": r.get("or_addresses", [""])[0].split(":")[0], "nickname": r.get("nickname"), "country": r.get("country") }) print(relays[:5]) # 6. Correlation Engine The forensic core of DARKLAYR. The engine correlates: * Flow volume * Protocol diversity * Temporal consistency * ASN reputation * Infrastructure ownership * Behavioral anomalies to produce confidence-ranked investigative candidates. import pandas as pd conn = pd.read_csv("data/csv/conn.csv") asn = pd.read_csv("data/outputs/asn_results.csv") flow_counts = conn["id.resp_h"].value_counts().to_dict() results = [] for ip, count in flow_counts.items(): score = 0 # Signal 1 — Flow Volume score += min(count / 100, 0.3) # Signal 2 — ASN Reputation match = asn[asn["ip"] == ip] if not match.empty: if "Hetzner" in match["organization"].values[0]: score += 0.3 else: score += 0.2 # Signal 3 — Protocol Diversity proto_count = conn[conn["id.resp_h"] == ip]["proto"].nunique() score += min(proto_count * 0.05, 0.1) # Penalty — Loopback Artifact if ip == "127.0.0.1": score -= 0.2 results.append([ip, score]) df = pd.DataFrame(results, columns=["IP", "Confidence"]) df = df.sort_values("Confidence", ascending=False) df.to_csv("data/outputs/correlation_results.csv", index=False) print(df) # 📊 Confidence Scoring Model | Signal | Max Weight | Purpose | | ------------------ | ---------- | ---------------------------------------------- | | Flow Volume | +0.30 | Higher activity increases relevance | | ASN Reputation | +0.30 | Tor-friendly hosting providers weighted higher | | Protocol Diversity | +0.10 | Multi-protocol behavior correlation | | Loopback Penalty | -0.20 | Reduces localhost false positives | Higher scores indicate stronger investigative priority — not certainty. # 📄 Output Generated outputs include: | File | Purpose | | ------------------------- | ----------------------------------- | | `asn_results.csv` | ASN & organization attribution | | `correlation_results.csv` | Ranked confidence-scored candidates | Example: IP Confidence 203.0.113.45 0.70 198.51.100.12 0.55 192.0.2.88 0.40 # 📁 Project Structure DARKLAYR/ ├── data/ │ ├── raw/ │ ├── zeek_logs/ │ ├── csv/ │ └── outputs/ │ ├── scripts/ │ ├── traffic/ │ ├── capture/ │ ├── zeek/ │ ├── enrichment/ │ └── correlation/ │ ├── fetcher/ ├── dashboard/ ├── docs/ ├── requirements.txt ├── setup.sh └── README.md # ⚡ Installation ## Install Dependencies sudo apt update sudo apt install -y tor tcpdump zeek tshark python3-pip pip3 install -r requirements.txt # 🛠️ Tech Stack ## | Layer | Tools | | ------------------ | ---------------------------- | | Traffic Generation | `tor`, `torsocks`, `curl` | | Packet Capture | `tcpdump` | | Protocol Analysis | `zeek`, `zeek-cut`, `tshark` | | Enrichment | `cymruwhois` | | Processing | `python`, `pandas`, `numpy` | | Intelligence | Onionoo API | | Reporting | CSV / structured outputs | # 🔬 How It Works: Zeek + ASN Feeds + Correlation DARKLAYR correlates traffic patterns to identify probable Tor relays by combining three core layers of network intelligence: 1. **Zeek Telemetry Extraction**: - Zeek monitors raw PCAP captures and parses them into protocol-specific, structured event logs (`conn.log`, `ssl.log`, `socks.log`). - The engine analyzes network connections to detect characteristic Tor TLS handshake fingerprints (e.g., specific cipher suites, randomized server names) and traffic volume/frequency anomalies. 2. **ASN Infrastructure Enrichment**: - The engine processes destination IPs through Team Cymru's WHOIS service to map IPs to their respective **Autonomous System Numbers (ASNs)** and Organization owners. - Traffic targeting ASNs known for hosting Tor relays (e.g., Hetzner, OVH, DigitalOcean) is prioritized and assigned higher attribution weights. 3. **Probabilistic Correlation**: - The **Correlation Engine** aggregates connection frequency, protocol diversity, and infrastructure reputation to compute a **Confidence Score** (ranging from `0.0` to `1.0`). - Relays are ranked dynamically, outputting high-probability investigative candidates without decrypting any payloads. # 📸 Screenshots of Output Below are visualizations and reports generated by the DARKLAYR forensic correlation engine: ### 1. Network Evidence & Metadata Captured packets showing connection details and metadata across multiple network flows: ![Captured Network Metadata](https://static.pigsec.cn/wp-content/uploads/repos/2026/06/d574271f8b041438.png) ### 2. Node Fingerprinting Correlated IP list indicating identified node footprints and their respective organization properties: ![Nodes' Fingerprints](https://static.pigsec.cn/wp-content/uploads/repos/2026/06/a90d6fe5fc041439.png) ### 3. Traffic Pattern Verification Graph showing normal network flows compared to Tor-like traffic sequences: ![No Tor-like Flows](https://static.pigsec.cn/wp-content/uploads/repos/2026/06/3aa749d8bd041440.png) ### 4. Correlation Analysis Analysis of connection timing, flow size distribution, and relay verification matching: ![Correlation Engine](https://static.pigsec.cn/wp-content/uploads/repos/2026/06/2142aaa849041441.jpg) ### 5. Flow Overview Architecture High-level overview of packet flows and how the correlation model maps traffic: ![Flow Overview](https://static.pigsec.cn/wp-content/uploads/repos/2026/06/9d1dd94051041441.jpg) # ⚠️ Limitations While DARKLAYR is highly effective at identifying infrastructure candidates, it operates under the following key limitations: * **No Personal Anonymity Bypass**: The engine does not bypass Tor's cryptographic layers or reveal user identities; it only maps network infrastructure paths. * **Hosting False Positives**: Shared ASNs (such as major cloud providers hosting both Tor relays and normal web servers) can sometimes skew correlation scores, requiring manual verification. * **Circuit Dynamics**: Tor circuits change dynamically (typically every 10 minutes). Correlation accuracy degrades over long timeframes if capture windows are not aligned. * **Traffic Padding Defense**: Tor's built-in traffic padding and cell-drop defenses can obscure timing signals, lowering the effectiveness of flow-volume metrics. # ⚖️ Ethical Notice DARKLAYR is a forensic research framework. It: * ❌ Does NOT deanonymize live users * ❌ Does NOT attack the Tor network * ❌ Does NOT bypass cryptographic protections * ✅ Operates exclusively in controlled environments * ✅ Focuses on behavioral correlation methodology * ✅ Preserves ethical and legal investigative boundaries All usage must comply with applicable laws, institutional policies, and ethical research standards. # 🧾 Conclusion DARKLAYR demonstrates that effective cybercrime investigation does not require breaking anonymity — it requires understanding layers, correlating evidence, and extracting meaning from observable behavior. The system reconstructs investigative context through: * timing relationships * protocol transitions * infrastructure attribution * anomaly interpretation * probabilistic confidence modeling DARKLAYR does not produce identities. It produces leads. Because in layered infrastructure, every observable pattern leaves traceable behavior — and every peel leaves a lead. *Built for forensic research. Designed with intent. Deployed with responsibility.*