LevaAverGit/pd-scanner-152fz-v2

GitHub: LevaAverGit/pd-scanner-152fz-v2

Stars: 0 | Forks: 0

# PD Scanner [![CI](https://github.com/LevaAverGit/pd-scanner-152fz/actions/workflows/ci.yml/badge.svg)](https://github.com/LevaAverGit/pd-scanner-152fz/actions/workflows/ci.yml) A local-first web application for privacy compliance pre-analysis of public websites. Given a URL, PD Scanner launches a headless browser, crawls the site, classifies personal data collection points, and produces structured evidence reports aligned with Russian Federal Law 152-FZ "On Personal Data" and general data protection principles. All analysis runs locally. No data leaves your machine. ## What PD Scanner Does - **Crawls** up to 20 pages of a public site (bounded same-site BFS) - **Detects** form fields collecting personal data and classifies them into 12 categories - **Observes** outbound network requests and classifies third-party hosts by vendor type - **Detects** privacy links, consent checkboxes, marketing consent, and bundled consent text - **Parses** linked policy pages — HTML, PDF, and DOCX — for 8 standard policy sections - **Infers** downstream data processors from form action URLs, scripts, and network patterns - **Builds** a structured 152-FZ evidence layer: risk level, policy gaps, manual validation targets - **Exports** full findings as JSON and Markdown reports - **Screenshots** the seed URL for visual record ## What PD Scanner Does NOT Do | Boundary | Why | |---|---| | Does not submit real personal data into forms | Safety by design — only synthetic placeholder values are used when synthetic mode is explicitly enabled | | Does not bypass CAPTCHA | Never attempts any CAPTCHA circumvention | | Does not bypass authentication | Login-gated pages are out of scope | | Does not follow external links | Same-host only; no cross-domain crawling | | Does not claim definitive legal compliance | All findings are heuristic public-signal observations; legal conclusions require expert analysis | | Rule-based heuristic classification | No external LLM or AI API dependency; all classification is deterministic and auditable | | Does not store data remotely | SQLite database is local-only | | Does not scan private IP ranges | SSRF guard blocks all private/loopback/link-local addresses | ## Why Observed / Inferred / Operator-Supplied Separation Matters Every finding in PD Scanner is clearly labelled by its epistemic status: - **`observed`** — directly seen by the scanner (e.g., a network request to `analytics.google.com`, a form action pointing to HubSpot) - **`inferred`** — derived from public signals with stated confidence (e.g., a hidden form field with `portalId` suggests HubSpot even if not confirmed) - **`operator_supplied`** — provided by the operator (e.g., `integration_evidence.crm_destination = "Bitrix24"`) This separation matters because: 1. It makes the provenance of every claim transparent and auditable 2. It prevents conflating speculation with observation in audit reports 3. It allows operator-supplied context to enrich findings without being confused with scanner observations ## Stack | Layer | Technology | |---|---| | Backend | Python 3.11, FastAPI, Playwright (async), aiosqlite | | Data models | Pydantic v2 | | PDF extraction | PyMuPDF (fitz) | | DOCX extraction | python-docx | | Frontend | React 18, TypeScript, Vite, Tailwind CSS | | Database | SQLite (local file, auto-migrated) | | Tests | pytest-asyncio, httpx ASGITransport | ## Engineering Highlights - **Async pipeline** — FastAPI + Playwright + aiosqlite; scan runs as a background task so the API returns immediately; no blocking I/O in the event loop - **Bounded BFS crawler** — same-host only, max 20 pages; SSRF guard resolves hostnames and blocks RFC1918, loopback, and link-local before any outbound request - **Layered analysis** — DOM classification → vendor detection → consent signals → policy parsing (HTML + PDF + DOCX) → 152-FZ evidence synthesis; each layer is independently testable - **Epistemic labelling** — every finding is tagged `observed`, `inferred`, or `operator_supplied`; provenance is preserved through the full data model and reports - **Pydantic v2 throughout** — all API inputs/outputs, database models, and inter-service data pass through Pydantic validation; no stringly-typed result handling - **303 tests, 0 warnings** — in-process HTTP via `httpx.ASGITransport`; per-test DB isolation via `tmp_path` + `patch`; no real network calls in tests ## Backend Architecture POST /api/scan ↓ URL Validation (SSRF guard, scheme check) ↓ Scan record persisted (status=pending) → HTTP 200 returned ↓ [background task] Playwright BFS crawler (up to 20 pages, same-host) │ ├── DOM Parser → form fields, links ├── PD Classifier → DataCategoryItem list ├── Consent Detector → checkbox / bundled text / absent ├── Vendor Classifier → VendorSummaryItem list └── Network Capture → third-party hosts ↓ Policy Parser (HTML/PDF/DOCX) → PolicyAnalysis ↓ Integration Audit → ProcessorMapItem list ↓ 152-FZ Assessment → FZ152Assessment (risk level, gaps, targets) ↓ Screenshot + Report export (JSON + Markdown) ↓ Scan record updated (status=complete) ## API and Data Flow | Endpoint | Method | Description | |---|---|---| | `/api/scan` | POST | Submit URL; returns `scan_id` immediately | | `/api/scan/{id}` | GET | Poll for results; returns full `ScanResult` when complete | | `/api/scan/diff` | POST | Compare two completed scans | | `/api/history` | GET | Paginated scan history | | `/api/history/{id}` | DELETE | Delete scan record and files | | `/api/health` | GET | Liveness check | See `docs/API_OVERVIEW.md` for request/response examples and error handling. See `docs/DATA_FLOW.md` for the full stage-by-stage pipeline breakdown. ## Testing Strategy make test # 303 backend tests make type-check # TypeScript strict check make verify # both + frontend build All backend tests use `httpx.ASGITransport` (no real server) and isolated SQLite databases via `tmp_path` + `unittest.mock.patch`. The 2 Playwright tests use a local fixture server. See `docs/QUALITY_ASSURANCE.md` for the full test strategy and patterns. ## Why This Project Matters for Developer Roles - **Full-stack implementation**: async Python API + React TypeScript frontend + SQLite - **Non-trivial backend pipeline**: multi-stage async processing with distinct service boundaries - **Production-oriented design decisions**: SSRF guard, local-only storage, epistemic labelling, safety gates — each documented with rationale in `docs/ARCHITECTURE_DECISIONS.md` - **Test discipline**: 303 tests, in-process HTTP testing, per-test DB isolation — testable architecture, not just coverage numbers - **Domain understanding**: 152-FZ requirements translated into detectable signals with explicit limitations — shows the ability to scope and build a tool that is honest about what it can and cannot do ## Prerequisites | Dependency | Version | Install | |---|---|---| | Python | 3.11+ | `brew install python@3.11` or [python.org](https://www.python.org/) | | Node.js | 18+ | `brew install node` or [nodejs.org](https://nodejs.org/) | | make | any | pre-installed on macOS / Linux | ## Setup # Create Python venv, install all backend deps, # install Playwright Chromium, install frontend deps make install ## Run **Terminal 1 — backend** (FastAPI on port 8000): make dev-backend **Terminal 2 — frontend** (Vite dev server on port 5173): make dev-frontend Open **http://localhost:5173** in your browser. ## Demo Flow 1. Start backend + frontend 2. Open http://localhost:5173 3. Paste a public registration page URL (e.g. a company's `/register` or `/signup` page) 4. Click **Scan** 5. Wait 10–30 seconds for the crawler to finish 6. View: detected data categories, vendor observations, policy analysis, 152-FZ risk assessment 7. Download JSON or Markdown report from the Export section ## Tests make test # backend tests (303 tests) make type-check # frontend TypeScript strict check make build # frontend production build make verify # all three in sequence — full clean check **Current status: 303 tests, 0 warnings.** Coverage: - **API**: health, scan create/get, history, delete, SSRF URL validation (loopback, RFC1918, link-local all blocked) - **Classifier**: 12 data categories, confidence scoring, false-positive guard - **Page classifier**: registration relevance, page-type detection - **Vendor classification**: analytics, ad-tech, CDN, payment, tracker patterns - **Policy parser**: section detection, operator name / contact extraction - **Synthetic submission**: safety gates (CAPTCHA block, max-submissions cap, sensitive field detection) - **Downstream processor inference**: operator evidence schema, processor map building - **152-FZ assessment**: consent mechanism typing, risk scoring, gap generation - **Document parsing**: PDF / DOCX type detection, text extraction, parse-status propagation ## Configuration Backend settings via environment variables (all optional, prefix `PD_`): | Variable | Default | Description | |---|---|---| | `PD_DB_PATH` | `pd_scanner.db` | SQLite database file path | | `PD_CORS_ORIGINS` | `["http://localhost:5173"]` | Allowed CORS origins | | `PD_LOG_LEVEL` | `INFO` | Logging level | | `PD_ALLOW_LOCAL_TEST_TARGETS` | `false` | Allow localhost / 127.x — **local fixture testing only, never in production** | ## API Reference | Method | Path | Description | |---|---|---| | `POST` | `/api/scan` | Submit URL for scanning | | `GET` | `/api/scan/{scan_id}` | Poll for results | | `GET` | `/api/history` | Paginated scan history (`limit`, `offset`) | | `DELETE` | `/api/history/{scan_id}` | Delete a scan record | | `GET` | `/api/health` | Liveness check → `{"status": "ok"}` | `POST /api/scan` request body: { "url": "https://example.com/register", "notes": "optional free-text notes", "enable_synthetic_submission": false, "integration_evidence": null, "operator_metadata": null } `integration_evidence` and `operator_metadata` accept operator-supplied context (CRM platform, webhook URLs, legal name, INN) that is clearly labelled `operator_supplied` in all outputs — never mixed with scanner observations. ## Project Structure pd-scanner-152fz/ ├── backend/ │ ├── app/ │ │ ├── main.py FastAPI app factory │ │ ├── api/routes_scan.py Scan endpoints │ │ ├── api/routes_history.py History endpoints │ │ ├── core/config.py pydantic-settings (PD_ prefix) │ │ ├── models/schemas.py All Pydantic v2 models │ │ ├── models/db.py aiosqlite + auto-migration │ │ └── services/ │ │ ├── scanner_service.py Full pipeline orchestrator │ │ ├── crawler_service.py Bounded BFS crawler │ │ ├── classifier_service.py PD field classifier │ │ ├── consent_detection_service.py Consent signals │ │ ├── vendor_classification_service.py Vendor types │ │ ├── policy_parser_service.py Policy page + doc routing │ │ ├── document_extraction_service.py PDF / DOCX extraction │ │ ├── synthetic_submission_service.py Controlled form submission │ │ ├── integration_audit_service.py Processor inference │ │ ├── fz152_assessment_service.py 152-FZ evidence builder │ │ └── report_service.py JSON + Markdown export │ ├── tests/ 303 tests, 0 warnings │ └── requirements.txt ├── frontend/ │ ├── src/pages/DashboardPage.tsx │ ├── src/pages/ScanDetailsPage.tsx │ ├── src/components/ PolicyAnalysisPanel, FZ152AssessmentPanel, … │ └── package.json ├── docs/ │ ├── ARCHITECTURE.md │ ├── THREAT_MODEL.md │ ├── PRD.md │ ├── 152FZ_CHECKLIST.md │ ├── EVIDENCE_MODEL.md │ ├── PRIVACY_AUDIT_MAPPING.md │ └── RISK_SCORING.md ├── sample_reports/ │ ├── example_report.md │ └── example_result.json ├── pytest.ini ├── Makefile └── README.md ## Documentation | Document | Description | |---|---| | [`docs/API_OVERVIEW.md`](docs/API_OVERVIEW.md) | Endpoints, request/response examples, async scan lifecycle, error handling | | [`docs/DATA_FLOW.md`](docs/DATA_FLOW.md) | Full stage-by-stage pipeline: URL → crawler → classification → export | | [`docs/EXTENDING_SCANNER.md`](docs/EXTENDING_SCANNER.md) | How to add PD categories, vendor signatures, policy sections, tests | | [`docs/ARCHITECTURE.md`](docs/ARCHITECTURE.md) | System architecture, component breakdown, pipeline diagram | | [`docs/ARCHITECTURE_DECISIONS.md`](docs/ARCHITECTURE_DECISIONS.md) | Why the architecture was designed this way (async, Playwright, SQLite, rule-based) | | [`docs/THREAT_MODEL.md`](docs/THREAT_MODEL.md) | Trust boundaries, SSRF guard, synthetic submission safety | | [`docs/QUALITY_ASSURANCE.md`](docs/QUALITY_ASSURANCE.md) | Test strategy, patterns, DB isolation, manual validation checklist | | [`docs/152FZ_CHECKLIST.md`](docs/152FZ_CHECKLIST.md) | 152-FZ signal checklist with article mapping and limitations | | [`docs/EVIDENCE_MODEL.md`](docs/EVIDENCE_MODEL.md) | Evidence types, confidence model, and what each finding represents | | [`docs/PRIVACY_AUDIT_MAPPING.md`](docs/PRIVACY_AUDIT_MAPPING.md) | How scanner output maps to structured privacy audit phases | | [`docs/RISK_SCORING.md`](docs/RISK_SCORING.md) | Heuristic risk scoring: factors, weights, thresholds | | [`docs/INTERVIEW_NOTES.md`](docs/INTERVIEW_NOTES.md) | Interview pitch and Q&A with strict scope/limitations framing | | [`CONTRIBUTING.md`](CONTRIBUTING.md) | Setup, running tests, adding categories/vendors/sections | ## Use Case PD Scanner automates the public-site evidence collection phase of a 152-FZ or GDPR pre-audit. An analyst can scan a client's registration flow in under a minute and receive a structured report showing: what personal data is collected, via which forms, routed to which third parties, with what consent mechanism, against what published policy. ## How This Maps to Real Privacy Compliance Work Privacy compliance under 152-FZ requires evidence collection, gap analysis, and structured reporting. This tool automates the evidence collection phase against the publicly observable layer of a website. | This tool | Real compliance workflow | |---|---| | Bounded BFS crawler | Evidence gathering scope (in-scope URLs) | | PD category classifier | PD inventory — what data is collected and where | | Consent signal detection | Consent mechanism review | | Policy section flags | Privacy policy adequacy review | | Vendor / processor map | Third-party and processor register (Art. 6(4)) | | 152-FZ gap list | Preliminary gap analysis for legal review | | `manual_validation_targets` | Audit findings requiring specialist follow-up | | JSON / Markdown export | Audit evidence package | See [`docs/PRIVACY_AUDIT_MAPPING.md`](docs/PRIVACY_AUDIT_MAPPING.md) for a full phase-by-phase breakdown. ## What This Project Demonstrates for Security Roles - Practical 152-FZ knowledge: Articles 6, 9, 12, 14, 18.1, 21 translated into heuristic detection logic - Privacy-by-design principles: SSRF guard, no real data submission, local-only storage, epistemic labelling of findings - Evidence model design: observed vs. inferred vs. operator-supplied distinction maintained throughout the data model - Full-stack implementation: async Python pipeline + React TypeScript UI + SQLite - Test discipline: 303 tests with async fixtures, DB isolation via tmp_path + patch, comprehensive coverage of all detection layers - Structured reporting: compliance-oriented output for both technical and non-technical audiences ## Known Limitations - BFS crawler is bounded at 20 pages / depth 2 — deep sites are partially covered - JavaScript-heavy SPAs requiring interaction to reveal forms may not be fully captured - Classifier relies on field `name`, `id`, `label`, `placeholder`, `aria-label`; obfuscated attributes reduce accuracy - PDF / DOCX policy parsing requires a text layer; image-only scanned PDFs return `unreadable` status - Synthetic submission is off by default; when enabled, only clearly synthetic values are submitted with strict safety gates - No rate limiting, multi-user support, or remote deployment hardening - DNS rebinding is not defended against (post-resolution host validation is deferred) ## Safety Disclaimer PD Scanner is a **read-only analysis tool** by default: - Analyses only URLs you explicitly provide - Never submits real personal data - Never bypasses authentication or CAPTCHA - Never follows links to external domains - Stores all output locally on your machine - Intended for use on publicly accessible pages only When `enable_synthetic_submission: true` is set, only clearly synthetic placeholder values are used (e.g. `test@example.invalid`), submissions are blocked on CAPTCHA / payment / sensitive-field pages, and only request metadata (no bodies, no cookies) is captured. ## Legal / Compliance Disclaimer This tool performs **heuristic technical analysis only**. - It does **not** determine legal compliance with 152-FZ or any other regulation. - It does **not** replace a legal, DPO, or professional compliance audit. - All findings are potential risk indicators that require manual validation. - No output constitutes a legal opinion or a guarantee of regulatory conformance. - The tool is intended for educational, portfolio, and pre-audit assistance purposes.