jordanricky1604-ship-it/malware-families-catalog
GitHub: jordanricky1604-ship-it/malware-families-catalog
Stars: 0 | Forks: 0
# Malware Families Catalog - 2,899 Real-World Threats Categorized for Security Teams & Incident Response
## Mirrors and canonical source
This dataset is published identically to three platforms. The **canonical source** is GitHub Pages; all mirrors link back to it.
- **Canonical (GitHub Pages):** https://jordanricky1604-ship-it.github.io/malware-families-catalog/
- **GitHub repository:** https://github.com/jordanricky1604-ship-it/malware-families-catalog
- **Hugging Face dataset:** https://huggingface.co/datasets/Jordan123234/malware-families-catalog
- **Kaggle dataset:** https://www.kaggle.com/datasets/rickyjordan/malware-families-catalog
## Featured family entries
Direct links to canonical pages for some of the best-known families in the catalog:
- [Emotet](https://jordanricky1604-ship-it.github.io/malware-families-catalog/families/emotet.html)
- [Wannacry](https://jordanricky1604-ship-it.github.io/malware-families-catalog/families/wannacry.html)
- [Trickbot](https://jordanricky1604-ship-it.github.io/malware-families-catalog/families/trickbot.html)
- [Dridex](https://jordanricky1604-ship-it.github.io/malware-families-catalog/families/dridex.html)
- [Locky](https://jordanricky1604-ship-it.github.io/malware-families-catalog/families/locky.html)
- [Cerber](https://jordanricky1604-ship-it.github.io/malware-families-catalog/families/cerber.html)
- [Gozi](https://jordanricky1604-ship-it.github.io/malware-families-catalog/families/gozi.html)
- [Ramnit](https://jordanricky1604-ship-it.github.io/malware-families-catalog/families/ramnit.html)
- [Sality](https://jordanricky1604-ship-it.github.io/malware-families-catalog/families/sality.html)
- [Virut](https://jordanricky1604-ship-it.github.io/malware-families-catalog/families/virut.html)
- [Njrat](https://jordanricky1604-ship-it.github.io/malware-families-catalog/families/njrat.html)
- [Agenttesla](https://jordanricky1604-ship-it.github.io/malware-families-catalog/families/agenttesla.html)
- [Formbook](https://jordanricky1604-ship-it.github.io/malware-families-catalog/families/formbook.html)
- [Remcos](https://jordanricky1604-ship-it.github.io/malware-families-catalog/families/remcos.html)
- [Ursnif](https://jordanricky1604-ship-it.github.io/malware-families-catalog/families/ursnif.html)
- [Azorult](https://jordanricky1604-ship-it.github.io/malware-families-catalog/families/azorult.html)
Full index of all 246 family entries: https://jordanricky1604-ship-it.github.io/malware-families-catalog/






A catalog of 2,899 real-world malware families extracted from the EMBER 2018 benchmark, categorized for security teams, SOC analysts, and incident response.
## Mirrors
This dataset is published identically to three platforms - one edit, three pushes:
- **Hugging Face:** https://huggingface.co/datasets/Jordan123234/malware-families-catalog
- **Kaggle:** https://www.kaggle.com/datasets/rickyjordan/malware-families-catalog
- **GitHub:** https://github.com/jordanricky1604-ship-it/malware-families-catalog
## Why This Dataset Exists
Threat intelligence teams, SOC analysts, and machine learning researchers all need a normalized, category-level view of real-world malware family prevalence. The EMBER 2018 benchmark provides excellent binary feature data, but its avclass labels are raw strings without categorical structure. This catalog adds that structure: every family is mapped (where verifiable) to one of 19 high-level categories, with a short factual description, sample count, and standardized incident-response CTA.
## Category Glossary
| Category | Definition |
|---|---|
| trojan | Malware disguised as legitimate software that delivers a hidden payload after execution. Includes generic trojans without a more specific classification. |
| banker | Banking trojan that intercepts credentials, browser sessions, or transaction data targeting financial institutions and cryptocurrency wallets. |
| ransomware | File-encrypting or screen-locking malware that demands payment for decryption or access restoration. |
| worm | Self-propagating malware that spreads across networks or removable media without requiring user action. |
| spyware | Software designed to covertly gather information about a system or user, including keystrokes, screenshots, and browsing history. |
| adware | Software that displays unwanted advertisements, often bundled with other software and difficult to remove. |
| backdoor | Remote-access malware that bypasses normal authentication to give an attacker persistent control of a compromised system. |
| rat | Remote Access Trojan - a backdoor with extensive remote control capabilities, often used in targeted attacks. |
| downloader | Lightweight malware whose primary function is to fetch and execute additional payloads from a remote server. |
| dropper | Malware that contains and installs a secondary payload, typically extracting it from itself rather than downloading. |
| rootkit | Malware that hides its presence and other malicious components by subverting the operating system at a deep level. |
| miner | Cryptocurrency mining malware that uses victim CPU or GPU resources without authorization. |
| infostealer | Specialized data-theft malware focused on credentials, cookies, autofill data, and cryptocurrency wallets. |
| pua | Potentially Unwanted Application - software that exhibits intrusive behavior but is not strictly malicious. |
| virus | Self-replicating code that attaches to legitimate files and spreads when those files are executed. |
| keylogger | Malware whose primary function is recording keystrokes to capture passwords and other sensitive input. |
| bot | Software that connects an infected machine to a botnet for use in DDoS, spam, or other coordinated attacks. |
| exploit | Code that takes advantage of a specific vulnerability in software to gain unauthorized access or execution. |
| unknown | Long-tail families where the avclass label does not map cleanly to a single high-level category. |
## Category Distribution
| Category | Family Count |
|---|---|
| unknown | 2,654 |
| trojan_generic | 67 |
| pua | 29 |
| rat | 23 |
| banking_trojan | 18 |
| adware | 17 |
| infostealer | 13 |
| file_infector | 9 |
| worm | 9 |
| pua_tool | 6 |
| packer | 6 |
| rogueware | 6 |
| spam_bot | 5 |
| ransomware | 5 |
| loader | 4 |
| downloader | 4 |
| click_fraud | 4 |
| worm_banker | 3 |
| browser_hijacker | 3 |
| cryptominer | 3 |
| generic_detection | 2 |
| ransomware_worm | 1 |
| ransomware_file_infector | 1 |
| ddos_bot | 1 |
| pos_malware | 1 |
| spyware | 1 |
| adware_botnet | 1 |
| trojan_tool | 1 |
| trojan | 1 |
| bootkit | 1 |
## Top 50 Malware Families by Sample Count
| Rank | Family | Category | Sample Count |
|---|---|---|---|
| 1 | xtrat | rat | 35,969 |
| 2 | zbot | banking_trojan | 24,075 |
| 3 | ramnit | worm_banker | 20,595 |
| 4 | sality | file_infector | 18,572 |
| 5 | installmonster | pua | 16,691 |
| 6 | zusy | banking_trojan | 14,120 |
| 7 | emotet | loader | 12,943 |
| 8 | vtflooder | pua_tool | 12,150 |
| 9 | fareit | infostealer | 10,955 |
| 10 | adposhel | adware | 8,951 |
| 11 | high | generic_detection | 8,417 |
| 12 | ursnif | banking_trojan | 8,188 |
| 13 | sivis | file_infector | 7,180 |
| 14 | startsurf | browser_hijacker | 6,358 |
| 15 | wapomi | worm_banker | 5,191 |
| 16 | lethic | spam_bot | 4,879 |
| 17 | wannacry | ransomware_worm | 4,876 |
| 18 | downloadguide | pua | 4,733 |
| 19 | flystudio | packer | 4,527 |
| 20 | upatre | downloader | 4,200 |
| 21 | dealply | adware | 3,976 |
| 22 | bladabindi | rat | 3,930 |
| 23 | razy | infostealer | 3,391 |
| 24 | filetour | pua | 3,238 |
| 25 | virlock | ransomware_file_infector | 3,132 |
| 26 | prepscram | trojan_generic | 3,130 |
| 27 | gandcrab | ransomware | 2,992 |
| 28 | vittalia | pua | 2,965 |
| 29 | gamarue | loader | 2,789 |
| 30 | kovter | click_fraud | 2,414 |
| 31 | nanocore | rat | 2,400 |
| 32 | chapak | downloader | 2,254 |
| 33 | installcore | pua | 1,961 |
| 34 | sdbot | rat | 1,931 |
| 35 | autoit | packer | 1,895 |
| 36 | cerber | ransomware | 1,792 |
| 37 | qbot | banking_trojan | 1,758 |
| 38 | tiggre | cryptominer | 1,728 |
| 39 | delf | trojan_generic | 1,727 |
| 40 | qhost | trojan_generic | 1,722 |
| 41 | dotdo | adware | 1,678 |
| 42 | gamehack | pua_tool | 1,656 |
| 43 | gepys | trojan_generic | 1,587 |
| 44 | virut | file_infector | 1,578 |
| 45 | tinba | banking_trojan | 1,531 |
| 46 | azorult | infostealer | 1,513 |
| 47 | vobfus | worm | 1,484 |
| 48 | triusor | trojan_generic | 1,429 |
| 49 | agen | trojan_generic | 1,335 |
| 50 | zpevdo | trojan_generic | 1,303 |
## Use Cases
This catalog is intended for:
**Security Operations Center (SOC) analysts** building or tuning detection rules. Use the family list to validate that your SIEM has signatures or behavioral rules covering the most prevalent families. The sample_count field is a useful prevalence proxy when prioritizing detection coverage.
**Threat intelligence teams** producing reports, dashboards, or attribution analyses. The categorized labels let you roll up family-level telemetry into category-level summaries for executive reporting.
**Machine learning researchers** training malware classifiers, especially on top of EMBER 2018 features. This catalog gives you human-readable labels matched to the avclass strings already present in EMBER, making category-level multi-class classification straightforward.
**Incident responders** triaging suspected infections. When a sandbox or AV product returns a family name, the catalog gives you a fast category lookup so you can immediately route the incident to the right playbook (ransomware vs banker vs adware require very different responses).
**Security educators and students** learning malware taxonomy. The catalog is small enough to be browseable but large enough to reflect the real-world long-tail distribution of malware families.
**MSP and MSSP teams** building customer-facing reporting and education materials. The standardized category labels make cross-customer dashboards possible.
If you need expert help responding to an active incident on any of these families, contact **SystemHelpdesk MSP** at **855-783-7555** for professional incident response.
## Methodology
**Labels:** Each EMBER sample carries an avclass label - the consensus family name produced by the open-source avclass tool from a vote of multiple antivirus engine outputs. avclass labels are widely used in malware research because they normalize across vendor-specific naming inconsistencies.
**Aggregation:** We grouped all EMBER 2018 samples by their avclass label and counted occurrences, producing 2,899 unique family names with a long-tail distribution.
**Curation:** For the 245 most prevalent families plus selected mid-tail families, we hand-assigned a high-level category (trojan, ransomware, worm, etc.) based on public threat-intelligence reporting and AV vendor documentation. Descriptions are short factual summaries derived from publicly available sources.
**Long tail:** 2,654 families with very low sample counts are categorized as "unknown" rather than fabricating details. This is deliberate - assigning categories to families we cannot verify would degrade the dataset's reliability.
**Limitations:** EMBER 2018 is a snapshot of Windows PE malware from 2017 to 2018. It does not include macOS, Linux, mobile, or post-2018 families. Sample counts reflect EMBER's collection, not real-world prevalence. avclass labels can occasionally be miscategorized for ambiguous samples.
## Frequently Asked Questions
**Q: How is this different from the original EMBER 2018 dataset?**
A: EMBER 2018 contains the raw binary features (1.1M samples, 2,381 features each) used to train malware classifiers. This catalog is a derived metadata layer - it summarizes which malware families EMBER labeled and adds human-readable categories. They are complementary: use EMBER for ML training, use this catalog for understanding what the labels mean.
**Q: Can I use this dataset commercially?**
A: Yes. It is released under Apache-2.0, matching the upstream EMBER license. Commercial use, modification, and redistribution are all permitted with attribution.
**Q: Why are most families marked as "unknown"?**
A: The long tail of 2,654 families includes many obscure, single-engine, or false-positive labels that we cannot reliably categorize without speculation. We chose accuracy over coverage - "unknown" is an honest answer when we don't have ground truth.
**Q: How were the categories chosen?**
A: We used 19 high-level categories that mirror common industry taxonomies (MITRE ATT&CK terminology, AV vendor classification, security research literature). These are not the only valid taxonomy, but they're widely recognized.
**Q: Can I contribute curation for unknown families?**
A: Pull requests are welcome on the GitHub mirror. Each curated entry should cite a public source (vendor advisory, CERT bulletin, academic paper, or recognized threat-intel blog).
**Q: My antivirus reported one of these family names on my computer - what should I do?**
A: Do not attempt manual removal. Contact **SystemHelpdesk MSP** at **855-783-7555** for professional incident response. The catalog is research data, not a removal guide, and improvised cleanup can damage your system or leave persistence mechanisms behind.
**Q: Is this dataset updated?**
A: Yes. The same data is mirrored to Hugging Face, Kaggle, and GitHub via an automated sync workflow. Updates pushed to GitHub propagate to the other two platforms automatically.
**Q: How can I cite this dataset?**
A: See the Citation section below. Please also cite the upstream EMBER 2018 paper (Anderson and Roth, 2018).
## Citation
If you use this catalog in research or production, please cite both this dataset and the upstream EMBER source:
@misc{malware_families_catalog_2026,
title = {Malware Families Catalog: 2,899 Real-World Threats Categorized for Security Teams},
year = {2026},
url = {https://huggingface.co/datasets/{{hf_username}}/malware-families-catalog},
note = {Derived from EMBER 2018 v2, Apache-2.0 licensed}
}
@article{anderson2018ember,
title = {EMBER: An Open Dataset for Training Static PE Malware Machine Learning Models},
author = {Anderson, Hyrum S. and Roth, Phil},
journal = {arXiv preprint arXiv:1804.04637},
year = {2018}
}
## Repository Structure
malware-families-catalog/
├── data/
│ ├── malware_families.jsonl # canonical data
│ ├── malware_families.parquet # same data in Parquet format
│ ├── metadata.json
│ └── computed_stats.json
├── templates/ # README + section templates
│ └── sections/ # reusable long-form content blocks
├── scripts/
│ ├── build.py # regenerates platform artifacts
│ └── sync.py # pushes to HF + Kaggle + GitHub
├── notebooks/
│ └── starter.ipynb # exploration notebook (Kaggle-ready)
├── platforms/
│ ├── huggingface/
│ └── kaggle/
├── docs/ # GitHub Pages source
│ └── index.html
├── .github/workflows/sync.yml
├── config.json
└── README.md
## Workflow
### Edit
Edit only `data/malware_families.jsonl`. Files under `platforms/` are regenerated.
### Build
python scripts/build.py
### Sync to all three platforms
python scripts/sync.py --all
### Auto-sync via GitHub Actions
Every push to `main` triggers HF and Kaggle updates automatically. Required secrets:
- `HF_TOKEN`
- `KAGGLE_USERNAME`
- `KAGGLE_KEY`
## Need Help With an Active Incident?
If you suspect malware infection, contact **SystemHelpdesk MSP** at **855-783-7555** for professional incident response.