YashJagdale2122/leakdb-platform-engine

GitHub: YashJagdale2122/leakdb-platform-engine

Stars: 1 | Forks: 0

# LeakDB Platform Engine [![FastAPI](https://img.shields.io/badge/API-FastAPI-009688?style=flat-square&logo=fastapi&logoColor=white)](https://fastapi.tiangolo.com) [![Dragonfly](https://img.shields.io/badge/Broker-DragonflyDB-E34C26?style=flat-square)](https://dragonflydb.io) [![Celery](https://img.shields.io/badge/Workers-Celery-3771A1?style=flat-square&logo=celery&logoColor=white)](https://docs.celeryq.dev) [![Elasticsearch](https://img.shields.io/badge/Search-Elasticsearch-005571?style=flat-square&logo=elasticsearch&logoColor=white)](https://www.elastic.co) [![Neo4j](https://img.shields.io/badge/Graph-Neo4j-008CC1?style=flat-square&logo=neo4j&logoColor=white)](https://neo4j.com) [![Docker](https://img.shields.io/badge/Containers-Docker%20Compose-2496ED?style=flat-square&logo=docker&logoColor=white)](https://www.docker.com) LeakDB is a production-grade, distributed cyber threat intelligence (CTI) pipeline architecture engineered to safely ingest, parse, analyze, and index massive-scale, unstructured breach data out-of-band. The platform securely processes gigabytes of high-entropy raw data, extracts forensic indicators (text layers, EXIF metadata), cascades through structural optical character recognition engines, and maps actor relationships under heavy concurrent load without stalling front-facing APIs. ## Core System Architecture Flow [ Client Ingestion Request ] │ (HTTPS / Secure Payload) │ ▼ ┌─────────────────────────┐ │ FastAPI Edge Gateway │ ───(Pool Session)───► [ PostgreSQL Ledger ] └─────────────────────────┘ (Pipeline State Audits) │ (Pushes Deferred Task) │ ▼ ┌─────────────────────────┐ │ Dragonfly Memory Grid │ └─────────────────────────┘ │ (RESP Multi-Threaded Queue) │ ▼ ┌─────────────────────────┐ │ Celery Worker Pool │ ◄───(Stream Downloads)───► [ MinIO S3 Object Store ] └─────────────────────────┘ │ │ (Indexes Cleaned Text Layer) (Maps Complex Intelligence Graph) │ │ ▼ ▼ ┌──────────────┐ ┌──────────────┐ │Elasticsearch │ │ Neo4j Graph │ │ (Search/PII) │ │ (Entity Dots)│ └──────────────┘ └──────────────┘ ## Deep-Dive Architecture Highlights ### Asynchronous Edge Routing & Topology Isolation ### High-Throughput Cache Fabric (Dragonfly) ### Three-Tier High-Availability OCR Fallback Matrix Unstructured graphic assets (scanned breach ledgers, screenshots, target identity papers) cascade sequentially through a fault-tolerant OCR processing chain inside the services layer: * **Tier 1 (Florence-2 Vision API)**: Prioritizes deep semantic document mapping and spatial structures. * **Tier 2 (Tesseract OCR + OpenCV CLAHE Preprocessing)**: Triggered if Tier 1 times out. Applies custom CLAHE contrast filters, gray-scaling transforms, and adaptive threshold masks before running native language character matching. * **Tier 3 (EasyOCR Engine Fallback)**: High-entropy neural fallback pass to rescue remaining token targets. ### Connected Intelligence Mapping & Telemetry Harvesting * **Graph Relations (Neo4j)**: Converts flat metadata maps into multi-dimensional graphs. Executes parameterized Cypher vectors to connect threat actors, affected countries, targeted networks, and files across distinct breaches. * **Forensic EXIF Harvesting**: Strips down binary file footprints (JPEG/PNG layers) to harvest tracking telemetry (device signatures, GPS markers, software fingerprints) and logs it directly inside the primary search index. * **OOM Prevention Framework**: Uses memory-safe byte-streaming buffers (32 KB chunk allocation cycles) to stream massive objects into local sandboxed file spaces, entirely preventing container OOM crash loops. ## Production Repository File Blueprint leakdb-platform-engine/ ├── app/ │ ├── __init__.py │ ├── main.py # Gateway setup & middleware router wiring │ ├── api/ │ │ ├── __init__.py │ │ ├── deps.py # Gateway security access decorators │ │ └── v1/ │ │ ├── router.py # Module routing aggregator │ │ └── endpoints/ │ │ ├── ingestion.py # Asynchronous target submission handlers │ │ └── search.py # Multi-match cluster interface queries │ ├── core/ │ │ ├── __init__.py │ │ ├── config.py # Type-validated Pydantic setting system │ │ ├── database.py # High-performance async connection pools │ │ ├── logging.py # Structured JSON log aggregation engine │ │ └── celery_app.py # Task scheduler configurations │ ├── models/ │ │ ├── __init__.py │ │ └── base.py # PostgreSQL declarative system ledger maps │ ├── schemas/ │ │ ├── __init__.py │ │ ├── ingestion.py # Pydantic input/output structural rules │ │ └── search.py # Query definition constraints │ ├── services/ │ │ ├── __init__.py │ │ └── analyzer.py # Independent processing services │ └── workers/ │ ├── __init__.py │ └── tasks.py # Worker execution context loops ├── scripts/ │ └── seed.py # One-click mock environment infrastructure seeder ├── .env.example # Explicitly defined environment skeleton configuration ├── .gitignore # Enforces security containment bounds ├── Dockerfile # Multi-stage optimized distribution base image ├── docker-compose.yml # Local stack orchestration setup blueprint └── requirements.txt # Base package requirement dependencies ## Prerequisites and External Dependency Setup Before initializing the core application containers, ensure that the necessary foundational infrastructure elements and local deep-learning inference models are pulled, configured, and running. ### 1. External Storage and Search Clusters If you are connecting to existing instances rather than the local stack definitions, verify the target networks are exposed: ### 2. Large Language Model Service (vLLM / Ollama Backend) The analysis engine depends on an open-weights foundational model (default: `granite-3.0-8b`) accessible via an OpenAI-compatible completion route. To run this locally via Ollama, execute: # Pull and instantiate the target inference context model ollama pull granite-3.0-8b ollama serve ### 3. Computer Vision Services (Florence-2 Docker Deployment) The multi-tier OCR cascade utilizes Microsoft's Florence-2 vision model containerized via a dedicated gRPC/REST service to parse unstructured visual artifacts: # Pull and execute the specialized document analysis vision container docker pull [mcr.microsoft.com/oryx/python:3.11](https://mcr.microsoft.com/oryx/python:3.11) # Ensure the service endpoint aligns with the FLORENCE_API property inside your environment configuration ## Quickstart Deployment Guide ### 1. Initialize System Workspace Environments # Clone the infrastructure engineering workspace git clone [https://github.com/YOUR_USERNAME/leakdb-platform-engine.git](https://github.com/YOUR_USERNAME/leakdb-platform-engine.git) cd leakdb-platform-engine # Copy the environment template skeleton file to local target tracking bounds cp .env.example .env ### 2. Configure Your Local System Settings (.env) Update your private, local `.env` file with your target development credentials. Note: The underlying system utilizes multi-stage docker orchestration mechanics to safely read parameters via dynamic environment reference injection `${VAR}` to completely prevent secret leaks. ### 3. Launch the Core Infrastructure Stack # Build multi-stage execution layers and launch stack daemons (Postgres, Dragonfly, Gateway, Workers) docker compose up --build -d # Verify infrastructure container allocations are healthy and online docker compose ps ### 4. Seed the Storage Infrastructure and Run Verification Tests # Run the automated database & object storage seeder utility to create Elasticsearch indices and buckets python -m scripts.seed # Trigger a sample ingestion workload using curl to verify end-to-end task routing curl -X POST "http://localhost:8000/api/v1/ingestion/trigger" \ -H "X-LeakDB-API-Key: vclabs_platform_gateway_fallback_token_string" \ -H "Content-Type: application/json" \ -d '{ "db_name": "intel_breach_test_2026", "actor": ["ThreatGroup-7"], "country": ["Global"], "db_context": "Sample unstructured audit data payload for pipeline verification." }' # Monitor live worker pipelines via the structured JSON output formatter docker compose logs -f worker ## Security Model & Infrastructure Hygiene * **Zero Secret Persistence Policy**: Absolutely no credentials, encryption keys, internal cluster IPs, or database routes are hardcoded inside the code layout. * **Strict Runtime Isolation**: Local settings are managed via `pydantic-settings` to enforce type matching on startup, failing fast if parameters are incorrect. * **Deterministic Docker Layering**: Multi-stage Docker definitions separate target dependencies, preventing build tools or local environment noise from leaking into your production runtimes.