elchacal801/threat-intel-reference
GitHub: elchacal801/threat-intel-reference
Stars: 0 | Forks: 0
# Threat Intel Reference
A daily-updated, open-source reference database of known malware, PUPs, PUAs, adware, and riskware — aggregated from multiple free threat intelligence sources.
## What's in the data?
| File | Description |
|---|---|
| `data/normalized/malware_samples.csv` | All malware samples with hashes, family names, classifications |
| `data/normalized/pup_pua_samples.csv` | PUP/PUA/adware/riskware samples |
| `data/normalized/malware_families.csv` | Canonical family list with aliases and descriptions |
| `data/normalized/iocs.csv` | IOCs (IPs, domains, URLs, hashes) with confidence scores |
| `data/normalized/techniques.csv` | MITRE ATT&CK technique-to-family mappings |
| `data/normalized/behavioral_indicators.csv` | Hash-to-domain/IP associations from sandbox + URLhaus |
| `data/raw/` | Raw per-source CSVs before normalization |
All files available as both CSV and JSON.
## Data Sources
- **[MalwareBazaar](https://bazaar.abuse.ch)** — Malware sample hashes, family signatures, ClamAV detections
- **[ThreatFox](https://threatfox.abuse.ch)** — IOCs with confidence scores and family attribution
- **[YARAify](https://yaraify.abuse.ch)** — YARA rule-to-malware-family mappings
- **[MITRE ATT&CK](https://attack.mitre.org)** — Technique-to-malware relationships
- **[MISP Galaxy](https://github.com/MISP/misp-galaxy)** — Malware family taxonomy with aliases
- **[URLhaus](https://urlhaus.abuse.ch)** — Payload hashes linked to malicious URLs/domains
- **[AlienVault OTX](https://otx.alienvault.com)** — Community pulse IOCs with family associations
- **[Hybrid Analysis](https://www.hybrid-analysis.com)** — Sandbox behavioral data (contacted domains/IPs)
- **[VirusTotal](https://www.virustotal.com)** — Multi-AV verdicts and PUP/PUA classification
Family name normalization powered by [malware_name_mapping](https://github.com/certtools/malware_name_mapping).
## How it works
Two GitHub Actions pipelines run automatically:
**Daily pipeline** (6 AM UTC): Collectors pull bulk data from all sources into `data/raw/`, normalizer merges and classifies into `data/normalized/`.
**Enrichment pipeline** (every 6 hours): Hybrid Analysis and VirusTotal enrich existing samples with behavioral data and multi-AV verdicts (rate-limited, processes batches).
## Quick start
### Use the data
Download any CSV directly:
curl -O https://raw.githubusercontent.com/elchacal801/threat-intel-reference/main/data/normalized/malware_samples.csv
### Run locally
git clone https://github.com/elchacal801/threat-intel-reference.git
cd threat-intel-reference
pip install -r requirements.txt
export MALWAREBAZAAR_API_KEY="your_key"
export THREATFOX_API_KEY="your_key"
export YARAIFY_API_KEY="your_key"
python run_pipeline.py
### Fork & use your own keys
1. Fork this repo
2. Get free API keys at [auth.abuse.ch](https://auth.abuse.ch)
3. Add as GitHub Secrets: `MALWAREBAZAAR_API_KEY`, `THREATFOX_API_KEY`, `YARAIFY_API_KEY`, `URLHAUS_API_KEY`, `OTX_API_KEY`, `HYBRID_ANALYSIS_API_KEY`, `VT_API_KEY`
4. Enable Actions — the daily cron will start updating your fork
## Classification
Samples are classified using these signals (first match wins):
| Signal | Classification |
|---|---|
| ClamAV signature starts with `PUA.` | pua |
| ClamAV signature starts with `Adware.` | adware |
| Tags contain `adware` | adware |
| Tags contain `pup` / `pua` | pup / pua |
| Tags contain `riskware` | riskware |
| Tags contain `bundler` | pua |
| MISP Galaxy family type | as labeled |
| VirusTotal popular_threat_classification | as labeled |
| Default | malware |
## License
Data is sourced from publicly available threat intelligence feeds. Each source has its own terms of use. See the respective source websites for details.