nuclide-research/recongraph
GitHub: nuclide-research/recongraph
Stars: 0 | Forks: 0
# recongraph
Seed-polymorphic reconnaissance engine: one seed in, a typed provenance graph out.
recongraph accepts six seed types (IP, CIDR, Domain, ASN, CertFP, BannerString) and runs a fixed-point iteration of probes against them. Every finding becomes a typed node with a provenance chain back to the original seed. Passive sources run first; active non-intrusive probes fire only where passive signal left a node ambiguous. The engine halts when the queue drains, the budget cap is hit, or the iteration adds zero new nodes. After stabilization, every Service node receives an exposure classification from a rule set that records which rule fired.
The one fully-implemented real probe is a crt.sh certificate-transparency lookup. All other probe stubs are in `probes.py` and `probes_real/`; the orchestration is probe-agnostic.
## Install
git clone https://github.com/nuclide-research/recongraph
cd recongraph
Python 3.8+, standard library only, no dependencies. All network I/O lives in probe implementations; the engine, graph, budget, and classification logic are pure.
## Usage
from recongraph import Engine, Seed, SeedType
engine = Engine()
graph = engine.run([Seed(SeedType.IP, "192.0.2.10")])
print(graph.to_json())
Run the smoke test (no network):
python smoke_test.py
Run the real pipeline (requires a clean network environment):
python upgraded_runs.py
## Seed types
| SeedType | Example value |
|----------|---------------|
| `IP` | `192.0.2.10` |
| `CIDR` | `192.0.2.0/24` |
| `DOMAIN` | `example.com` |
| `ASN` | `AS15169` |
| `CERT_FP` | sha256 of DER |
| `BANNER` | `Server: nginx/1.18` |
## Node and edge types
Nodes: `HOST`, `SERVICE`, `CERT`, `DOMAIN`, `NETBLOCK`, `ORG`, `ASN`.
Edges: `OBSERVED_ON`, `ISSUED_FOR`, `RESOLVES_TO`, `ANNOUNCED_BY`, `CO_HOSTED_WITH`, `SHARES_CERT_WITH`, `BELONGS_TO`, `DRIFT_FROM`.
## Exposure classes
Every Service node receives one of five labels after the graph stabilizes:
| Class | Meaning |
|-------|---------|
| `public_intended` | http, https, dns, smtp, submission, imaps, pop3s |
| `public_accidental` | staging/dev/test subdomains, `.git`, `.env`, `/backup`, `/phpinfo`, and similar |
| `mgmt_exposed` | ssh, rdp, vnc, ipmi, mysql, postgres, mongodb, redis, elasticsearch, kubelet, etcd, docker-api, ldap |
| `legacy_drift` | finger, telnet, gopher, tftp, rsh, rlogin, chargen, qi |
| `unknown` | no rule matched |
Rules are ordered; first match wins. `legacy_drift` fires before `mgmt_exposed`.
## Budget defaults
| Cap | Default |
|-----|---------|
| Wallclock | 300 s |
| Probe cost | 1000 units |
| Unique hosts | 500 |
| Requests per /24 | 30 |
| Requests per ASN | 100 |
## Graph output shape
{
"created_at": 1717430400.0,
"nodes": [
{
"type": "host",
"value": "192.0.2.10",
"attrs": {},
"provenance": [["seed-id"]],
"first_seen": 1717430400.0,
"last_seen": 1717430401.2,
"exposure": null,
"id": "a1b2c3d4e5f60001"
}
],
"edges": [
{
"src": "a1b2c3d4e5f60001",
"dst": "b2c3d4e5f6a70002",
"type": "resolves_to",
"attrs": {},
"first_seen": 1717430401.0
}
]
}
## Drift detection
Two runs can be compared to produce `DRIFT_FROM` edges automatically:
## Additional modules
| Module | Purpose |
|--------|---------|
| `cloud_ranges.py` | Classifier over GCP/AWS/Cloudflare published range files, plus rDNS pattern matching for GCP/AWS/Azure/Huawei/Alibaba/OVH/DigitalOcean/Linode/Vultr/Hetzner. Weekly on-disk cache. |
| `l7_fingerprint.py` | Raw HTTP probe ladder, canonical error-page signature library with fail-closed matching, HTTP/2 cleartext support detection. |
| `neighbors.py` | /24 and /20 homogeneity sweep with verdict classification (highly homogeneous → shared edge pool; highly heterogeneous → single tenant per IP). |
| `tenant_model.py` | `TenantModel` taxonomy, `IdentificationConfidence` levels, `EnvironmentalConstraints` for recording what the environment prevented observing. |
| `sandbox_detect.py` | Startup check: probe unrelated reference IPs with identical payloads, compare response shapes. Identical across targets means the environment is intercepting. Downgrades L7-derived tenant conclusions to OPAQUE when detected. |
`upgraded_runs.py` is the reference pipeline that wires everything together.
## Adding a probe
from recongraph import Seed, SeedType, Finding, ProbeMode, Probe, Node, NodeType
def my_probe(seed: Seed, budget) -> Finding:
if seed.type != SeedType.IP:
return Finding(source="my-probe", mode=ProbeMode.PASSIVE, confidence=0)
return Finding(
source="my-probe",
mode=ProbeMode.PASSIVE,
confidence=0.8,
nodes=[Node(type=NodeType.HOST, value=seed.value, attrs={})],
edges=[],
)
registry.register(Probe(
name="my_probe",
accepts=(SeedType.IP,),
mode=ProbeMode.PASSIVE,
fn=my_probe,
cost=2,
))
## Adding an exposure rule
def rule_my_thing(node, graph):
if node.type != NodeType.SERVICE:
return None
if some_condition(node):
return (ExposureClass.MGMT_EXPOSED, "my_reason")
return None
classify_graph(graph, rules=[rule_my_thing] + DEFAULT_RULES)
## What recongraph is not
recongraph is not a scanner. It does not sweep ranges, does not send exploit traffic, and does not emit findings it cannot explain. The one real probe (crt.sh) is passive and read-only. Every other probe in the default registry is a stub waiting for a real implementation. The orchestration runs regardless; what fires depends entirely on which probes have real implementations registered.
## License
MIT. Part of the NuClide toolchain. Contact: [nuclide-research.com](https://nuclide-research.com)