YashJagdale2122/leakdb-platform-engine
GitHub: YashJagdale2122/leakdb-platform-engine
Stars: 1 | Forks: 0
# LeakDB Platform Engine
[](https://fastapi.tiangolo.com)
[](https://dragonflydb.io)
[](https://docs.celeryq.dev)
[](https://www.elastic.co)
[](https://neo4j.com)
[](https://www.docker.com)
LeakDB is a production-grade, distributed cyber threat intelligence (CTI) pipeline architecture engineered to safely ingest, parse, analyze, and index massive-scale, unstructured breach data out-of-band.
The platform securely processes gigabytes of high-entropy raw data, extracts forensic indicators (text layers, EXIF metadata), cascades through structural optical character recognition engines, and maps actor relationships under heavy concurrent load without stalling front-facing APIs.
## Core System Architecture Flow
[ Client Ingestion Request ]
│
(HTTPS / Secure Payload)
│
▼
┌─────────────────────────┐
│ FastAPI Edge Gateway │ ───(Pool Session)───► [ PostgreSQL Ledger ]
└─────────────────────────┘ (Pipeline State Audits)
│
(Pushes Deferred Task)
│
▼
┌─────────────────────────┐
│ Dragonfly Memory Grid │
└─────────────────────────┘
│
(RESP Multi-Threaded Queue)
│
▼
┌─────────────────────────┐
│ Celery Worker Pool │ ◄───(Stream Downloads)───► [ MinIO S3 Object Store ]
└─────────────────────────┘
│ │
(Indexes Cleaned Text Layer) (Maps Complex Intelligence Graph)
│ │
▼ ▼
┌──────────────┐ ┌──────────────┐
│Elasticsearch │ │ Neo4j Graph │
│ (Search/PII) │ │ (Entity Dots)│
└──────────────┘ └──────────────┘
## Deep-Dive Architecture Highlights
### Asynchronous Edge Routing & Topology Isolation
### High-Throughput Cache Fabric (Dragonfly)
### Three-Tier High-Availability OCR Fallback Matrix
Unstructured graphic assets (scanned breach ledgers, screenshots, target identity papers) cascade sequentially through a fault-tolerant OCR processing chain inside the services layer:
* **Tier 1 (Florence-2 Vision API)**: Prioritizes deep semantic document mapping and spatial structures.
* **Tier 2 (Tesseract OCR + OpenCV CLAHE Preprocessing)**: Triggered if Tier 1 times out. Applies custom CLAHE contrast filters, gray-scaling transforms, and adaptive threshold masks before running native language character matching.
* **Tier 3 (EasyOCR Engine Fallback)**: High-entropy neural fallback pass to rescue remaining token targets.
### Connected Intelligence Mapping & Telemetry Harvesting
* **Graph Relations (Neo4j)**: Converts flat metadata maps into multi-dimensional graphs. Executes parameterized Cypher vectors to connect threat actors, affected countries, targeted networks, and files across distinct breaches.
* **Forensic EXIF Harvesting**: Strips down binary file footprints (JPEG/PNG layers) to harvest tracking telemetry (device signatures, GPS markers, software fingerprints) and logs it directly inside the primary search index.
* **OOM Prevention Framework**: Uses memory-safe byte-streaming buffers (32 KB chunk allocation cycles) to stream massive objects into local sandboxed file spaces, entirely preventing container OOM crash loops.
## Production Repository File Blueprint
leakdb-platform-engine/
├── app/
│ ├── __init__.py
│ ├── main.py # Gateway setup & middleware router wiring
│ ├── api/
│ │ ├── __init__.py
│ │ ├── deps.py # Gateway security access decorators
│ │ └── v1/
│ │ ├── router.py # Module routing aggregator
│ │ └── endpoints/
│ │ ├── ingestion.py # Asynchronous target submission handlers
│ │ └── search.py # Multi-match cluster interface queries
│ ├── core/
│ │ ├── __init__.py
│ │ ├── config.py # Type-validated Pydantic setting system
│ │ ├── database.py # High-performance async connection pools
│ │ ├── logging.py # Structured JSON log aggregation engine
│ │ └── celery_app.py # Task scheduler configurations
│ ├── models/
│ │ ├── __init__.py
│ │ └── base.py # PostgreSQL declarative system ledger maps
│ ├── schemas/
│ │ ├── __init__.py
│ │ ├── ingestion.py # Pydantic input/output structural rules
│ │ └── search.py # Query definition constraints
│ ├── services/
│ │ ├── __init__.py
│ │ └── analyzer.py # Independent processing services
│ └── workers/
│ ├── __init__.py
│ └── tasks.py # Worker execution context loops
├── scripts/
│ └── seed.py # One-click mock environment infrastructure seeder
├── .env.example # Explicitly defined environment skeleton configuration
├── .gitignore # Enforces security containment bounds
├── Dockerfile # Multi-stage optimized distribution base image
├── docker-compose.yml # Local stack orchestration setup blueprint
└── requirements.txt # Base package requirement dependencies
## Prerequisites and External Dependency Setup
Before initializing the core application containers, ensure that the necessary foundational infrastructure elements and local deep-learning inference models are pulled, configured, and running.
### 1. External Storage and Search Clusters
If you are connecting to existing instances rather than the local stack definitions, verify the target networks are exposed:
### 2. Large Language Model Service (vLLM / Ollama Backend)
The analysis engine depends on an open-weights foundational model (default: `granite-3.0-8b`) accessible via an OpenAI-compatible completion route.
To run this locally via Ollama, execute:
# Pull and instantiate the target inference context model
ollama pull granite-3.0-8b
ollama serve
### 3. Computer Vision Services (Florence-2 Docker Deployment)
The multi-tier OCR cascade utilizes Microsoft's Florence-2 vision model containerized via a dedicated gRPC/REST service to parse unstructured visual artifacts:
# Pull and execute the specialized document analysis vision container
docker pull [mcr.microsoft.com/oryx/python:3.11](https://mcr.microsoft.com/oryx/python:3.11)
# Ensure the service endpoint aligns with the FLORENCE_API property inside your environment configuration
## Quickstart Deployment Guide
### 1. Initialize System Workspace Environments
# Clone the infrastructure engineering workspace
git clone [https://github.com/YOUR_USERNAME/leakdb-platform-engine.git](https://github.com/YOUR_USERNAME/leakdb-platform-engine.git)
cd leakdb-platform-engine
# Copy the environment template skeleton file to local target tracking bounds
cp .env.example .env
### 2. Configure Your Local System Settings (.env)
Update your private, local `.env` file with your target development credentials. Note: The underlying system utilizes multi-stage docker orchestration mechanics to safely read parameters via dynamic environment reference injection `${VAR}` to completely prevent secret leaks.
### 3. Launch the Core Infrastructure Stack
# Build multi-stage execution layers and launch stack daemons (Postgres, Dragonfly, Gateway, Workers)
docker compose up --build -d
# Verify infrastructure container allocations are healthy and online
docker compose ps
### 4. Seed the Storage Infrastructure and Run Verification Tests
# Run the automated database & object storage seeder utility to create Elasticsearch indices and buckets
python -m scripts.seed
# Trigger a sample ingestion workload using curl to verify end-to-end task routing
curl -X POST "http://localhost:8000/api/v1/ingestion/trigger" \
-H "X-LeakDB-API-Key: vclabs_platform_gateway_fallback_token_string" \
-H "Content-Type: application/json" \
-d '{
"db_name": "intel_breach_test_2026",
"actor": ["ThreatGroup-7"],
"country": ["Global"],
"db_context": "Sample unstructured audit data payload for pipeline verification."
}'
# Monitor live worker pipelines via the structured JSON output formatter
docker compose logs -f worker
## Security Model & Infrastructure Hygiene
* **Zero Secret Persistence Policy**: Absolutely no credentials, encryption keys, internal cluster IPs, or database routes are hardcoded inside the code layout.
* **Strict Runtime Isolation**: Local settings are managed via `pydantic-settings` to enforce type matching on startup, failing fast if parameters are incorrect.
* **Deterministic Docker Layering**: Multi-stage Docker definitions separate target dependencies, preventing build tools or local environment noise from leaking into your production runtimes.