nuclide-research/recongraph

GitHub: nuclide-research/recongraph

Stars: 0 | Forks: 0

# recongraph Seed-polymorphic reconnaissance engine: one seed in, a typed provenance graph out. recongraph accepts six seed types (IP, CIDR, Domain, ASN, CertFP, BannerString) and runs a fixed-point iteration of probes against them. Every finding becomes a typed node with a provenance chain back to the original seed. Passive sources run first; active non-intrusive probes fire only where passive signal left a node ambiguous. The engine halts when the queue drains, the budget cap is hit, or the iteration adds zero new nodes. After stabilization, every Service node receives an exposure classification from a rule set that records which rule fired. The one fully-implemented real probe is a crt.sh certificate-transparency lookup. All other probe stubs are in `probes.py` and `probes_real/`; the orchestration is probe-agnostic. ## Install git clone https://github.com/nuclide-research/recongraph cd recongraph Python 3.8+, standard library only, no dependencies. All network I/O lives in probe implementations; the engine, graph, budget, and classification logic are pure. ## Usage from recongraph import Engine, Seed, SeedType engine = Engine() graph = engine.run([Seed(SeedType.IP, "192.0.2.10")]) print(graph.to_json()) Run the smoke test (no network): python smoke_test.py Run the real pipeline (requires a clean network environment): python upgraded_runs.py ## Seed types | SeedType | Example value | |----------|---------------| | `IP` | `192.0.2.10` | | `CIDR` | `192.0.2.0/24` | | `DOMAIN` | `example.com` | | `ASN` | `AS15169` | | `CERT_FP` | sha256 of DER | | `BANNER` | `Server: nginx/1.18` | ## Node and edge types Nodes: `HOST`, `SERVICE`, `CERT`, `DOMAIN`, `NETBLOCK`, `ORG`, `ASN`. Edges: `OBSERVED_ON`, `ISSUED_FOR`, `RESOLVES_TO`, `ANNOUNCED_BY`, `CO_HOSTED_WITH`, `SHARES_CERT_WITH`, `BELONGS_TO`, `DRIFT_FROM`. ## Exposure classes Every Service node receives one of five labels after the graph stabilizes: | Class | Meaning | |-------|---------| | `public_intended` | http, https, dns, smtp, submission, imaps, pop3s | | `public_accidental` | staging/dev/test subdomains, `.git`, `.env`, `/backup`, `/phpinfo`, and similar | | `mgmt_exposed` | ssh, rdp, vnc, ipmi, mysql, postgres, mongodb, redis, elasticsearch, kubelet, etcd, docker-api, ldap | | `legacy_drift` | finger, telnet, gopher, tftp, rsh, rlogin, chargen, qi | | `unknown` | no rule matched | Rules are ordered; first match wins. `legacy_drift` fires before `mgmt_exposed`. ## Budget defaults | Cap | Default | |-----|---------| | Wallclock | 300 s | | Probe cost | 1000 units | | Unique hosts | 500 | | Requests per /24 | 30 | | Requests per ASN | 100 | ## Graph output shape { "created_at": 1717430400.0, "nodes": [ { "type": "host", "value": "192.0.2.10", "attrs": {}, "provenance": [["seed-id"]], "first_seen": 1717430400.0, "last_seen": 1717430401.2, "exposure": null, "id": "a1b2c3d4e5f60001" } ], "edges": [ { "src": "a1b2c3d4e5f60001", "dst": "b2c3d4e5f6a70002", "type": "resolves_to", "attrs": {}, "first_seen": 1717430401.0 } ] } ## Drift detection Two runs can be compared to produce `DRIFT_FROM` edges automatically: ## Additional modules | Module | Purpose | |--------|---------| | `cloud_ranges.py` | Classifier over GCP/AWS/Cloudflare published range files, plus rDNS pattern matching for GCP/AWS/Azure/Huawei/Alibaba/OVH/DigitalOcean/Linode/Vultr/Hetzner. Weekly on-disk cache. | | `l7_fingerprint.py` | Raw HTTP probe ladder, canonical error-page signature library with fail-closed matching, HTTP/2 cleartext support detection. | | `neighbors.py` | /24 and /20 homogeneity sweep with verdict classification (highly homogeneous → shared edge pool; highly heterogeneous → single tenant per IP). | | `tenant_model.py` | `TenantModel` taxonomy, `IdentificationConfidence` levels, `EnvironmentalConstraints` for recording what the environment prevented observing. | | `sandbox_detect.py` | Startup check: probe unrelated reference IPs with identical payloads, compare response shapes. Identical across targets means the environment is intercepting. Downgrades L7-derived tenant conclusions to OPAQUE when detected. | `upgraded_runs.py` is the reference pipeline that wires everything together. ## Adding a probe from recongraph import Seed, SeedType, Finding, ProbeMode, Probe, Node, NodeType def my_probe(seed: Seed, budget) -> Finding: if seed.type != SeedType.IP: return Finding(source="my-probe", mode=ProbeMode.PASSIVE, confidence=0) return Finding( source="my-probe", mode=ProbeMode.PASSIVE, confidence=0.8, nodes=[Node(type=NodeType.HOST, value=seed.value, attrs={})], edges=[], ) registry.register(Probe( name="my_probe", accepts=(SeedType.IP,), mode=ProbeMode.PASSIVE, fn=my_probe, cost=2, )) ## Adding an exposure rule def rule_my_thing(node, graph): if node.type != NodeType.SERVICE: return None if some_condition(node): return (ExposureClass.MGMT_EXPOSED, "my_reason") return None classify_graph(graph, rules=[rule_my_thing] + DEFAULT_RULES) ## What recongraph is not recongraph is not a scanner. It does not sweep ranges, does not send exploit traffic, and does not emit findings it cannot explain. The one real probe (crt.sh) is passive and read-only. Every other probe in the default registry is a stub waiting for a real implementation. The orchestration runs regardless; what fires depends entirely on which probes have real implementations registered. ## License MIT. Part of the NuClide toolchain. Contact: [nuclide-research.com](https://nuclide-research.com)