jordanricky1604-ship-it/malware-families-catalog

GitHub: jordanricky1604-ship-it/malware-families-catalog

Stars: 0 | Forks: 0

# Malware Families Catalog - 2,899 Real-World Threats Categorized for Security Teams & Incident Response ## Mirrors and canonical source This dataset is published identically to three platforms. The **canonical source** is GitHub Pages; all mirrors link back to it. - **Canonical (GitHub Pages):** https://jordanricky1604-ship-it.github.io/malware-families-catalog/ - **GitHub repository:** https://github.com/jordanricky1604-ship-it/malware-families-catalog - **Hugging Face dataset:** https://huggingface.co/datasets/Jordan123234/malware-families-catalog - **Kaggle dataset:** https://www.kaggle.com/datasets/rickyjordan/malware-families-catalog ## Featured family entries Direct links to canonical pages for some of the best-known families in the catalog: - [Emotet](https://jordanricky1604-ship-it.github.io/malware-families-catalog/families/emotet.html) - [Wannacry](https://jordanricky1604-ship-it.github.io/malware-families-catalog/families/wannacry.html) - [Trickbot](https://jordanricky1604-ship-it.github.io/malware-families-catalog/families/trickbot.html) - [Dridex](https://jordanricky1604-ship-it.github.io/malware-families-catalog/families/dridex.html) - [Locky](https://jordanricky1604-ship-it.github.io/malware-families-catalog/families/locky.html) - [Cerber](https://jordanricky1604-ship-it.github.io/malware-families-catalog/families/cerber.html) - [Gozi](https://jordanricky1604-ship-it.github.io/malware-families-catalog/families/gozi.html) - [Ramnit](https://jordanricky1604-ship-it.github.io/malware-families-catalog/families/ramnit.html) - [Sality](https://jordanricky1604-ship-it.github.io/malware-families-catalog/families/sality.html) - [Virut](https://jordanricky1604-ship-it.github.io/malware-families-catalog/families/virut.html) - [Njrat](https://jordanricky1604-ship-it.github.io/malware-families-catalog/families/njrat.html) - [Agenttesla](https://jordanricky1604-ship-it.github.io/malware-families-catalog/families/agenttesla.html) - [Formbook](https://jordanricky1604-ship-it.github.io/malware-families-catalog/families/formbook.html) - [Remcos](https://jordanricky1604-ship-it.github.io/malware-families-catalog/families/remcos.html) - [Ursnif](https://jordanricky1604-ship-it.github.io/malware-families-catalog/families/ursnif.html) - [Azorult](https://jordanricky1604-ship-it.github.io/malware-families-catalog/families/azorult.html) Full index of all 246 family entries: https://jordanricky1604-ship-it.github.io/malware-families-catalog/ ![License](https://img.shields.io/badge/license-Apache--2.0-blue) ![Families](https://img.shields.io/badge/families-2,899-orange) ![Categories](https://img.shields.io/badge/categories-19-purple) ![Source](https://img.shields.io/badge/source-EMBER%202018-green) ![HuggingFace](https://img.shields.io/badge/%F0%9F%A4%97-HuggingFace-yellow) ![Kaggle](https://img.shields.io/badge/Kaggle-Dataset-blue) A catalog of 2,899 real-world malware families extracted from the EMBER 2018 benchmark, categorized for security teams, SOC analysts, and incident response. ## Mirrors This dataset is published identically to three platforms - one edit, three pushes: - **Hugging Face:** https://huggingface.co/datasets/Jordan123234/malware-families-catalog - **Kaggle:** https://www.kaggle.com/datasets/rickyjordan/malware-families-catalog - **GitHub:** https://github.com/jordanricky1604-ship-it/malware-families-catalog ## Why This Dataset Exists Threat intelligence teams, SOC analysts, and machine learning researchers all need a normalized, category-level view of real-world malware family prevalence. The EMBER 2018 benchmark provides excellent binary feature data, but its avclass labels are raw strings without categorical structure. This catalog adds that structure: every family is mapped (where verifiable) to one of 19 high-level categories, with a short factual description, sample count, and standardized incident-response CTA. ## Category Glossary | Category | Definition | |---|---| | trojan | Malware disguised as legitimate software that delivers a hidden payload after execution. Includes generic trojans without a more specific classification. | | banker | Banking trojan that intercepts credentials, browser sessions, or transaction data targeting financial institutions and cryptocurrency wallets. | | ransomware | File-encrypting or screen-locking malware that demands payment for decryption or access restoration. | | worm | Self-propagating malware that spreads across networks or removable media without requiring user action. | | spyware | Software designed to covertly gather information about a system or user, including keystrokes, screenshots, and browsing history. | | adware | Software that displays unwanted advertisements, often bundled with other software and difficult to remove. | | backdoor | Remote-access malware that bypasses normal authentication to give an attacker persistent control of a compromised system. | | rat | Remote Access Trojan - a backdoor with extensive remote control capabilities, often used in targeted attacks. | | downloader | Lightweight malware whose primary function is to fetch and execute additional payloads from a remote server. | | dropper | Malware that contains and installs a secondary payload, typically extracting it from itself rather than downloading. | | rootkit | Malware that hides its presence and other malicious components by subverting the operating system at a deep level. | | miner | Cryptocurrency mining malware that uses victim CPU or GPU resources without authorization. | | infostealer | Specialized data-theft malware focused on credentials, cookies, autofill data, and cryptocurrency wallets. | | pua | Potentially Unwanted Application - software that exhibits intrusive behavior but is not strictly malicious. | | virus | Self-replicating code that attaches to legitimate files and spreads when those files are executed. | | keylogger | Malware whose primary function is recording keystrokes to capture passwords and other sensitive input. | | bot | Software that connects an infected machine to a botnet for use in DDoS, spam, or other coordinated attacks. | | exploit | Code that takes advantage of a specific vulnerability in software to gain unauthorized access or execution. | | unknown | Long-tail families where the avclass label does not map cleanly to a single high-level category. | ## Category Distribution | Category | Family Count | |---|---| | unknown | 2,654 | | trojan_generic | 67 | | pua | 29 | | rat | 23 | | banking_trojan | 18 | | adware | 17 | | infostealer | 13 | | file_infector | 9 | | worm | 9 | | pua_tool | 6 | | packer | 6 | | rogueware | 6 | | spam_bot | 5 | | ransomware | 5 | | loader | 4 | | downloader | 4 | | click_fraud | 4 | | worm_banker | 3 | | browser_hijacker | 3 | | cryptominer | 3 | | generic_detection | 2 | | ransomware_worm | 1 | | ransomware_file_infector | 1 | | ddos_bot | 1 | | pos_malware | 1 | | spyware | 1 | | adware_botnet | 1 | | trojan_tool | 1 | | trojan | 1 | | bootkit | 1 | ## Top 50 Malware Families by Sample Count | Rank | Family | Category | Sample Count | |---|---|---|---| | 1 | xtrat | rat | 35,969 | | 2 | zbot | banking_trojan | 24,075 | | 3 | ramnit | worm_banker | 20,595 | | 4 | sality | file_infector | 18,572 | | 5 | installmonster | pua | 16,691 | | 6 | zusy | banking_trojan | 14,120 | | 7 | emotet | loader | 12,943 | | 8 | vtflooder | pua_tool | 12,150 | | 9 | fareit | infostealer | 10,955 | | 10 | adposhel | adware | 8,951 | | 11 | high | generic_detection | 8,417 | | 12 | ursnif | banking_trojan | 8,188 | | 13 | sivis | file_infector | 7,180 | | 14 | startsurf | browser_hijacker | 6,358 | | 15 | wapomi | worm_banker | 5,191 | | 16 | lethic | spam_bot | 4,879 | | 17 | wannacry | ransomware_worm | 4,876 | | 18 | downloadguide | pua | 4,733 | | 19 | flystudio | packer | 4,527 | | 20 | upatre | downloader | 4,200 | | 21 | dealply | adware | 3,976 | | 22 | bladabindi | rat | 3,930 | | 23 | razy | infostealer | 3,391 | | 24 | filetour | pua | 3,238 | | 25 | virlock | ransomware_file_infector | 3,132 | | 26 | prepscram | trojan_generic | 3,130 | | 27 | gandcrab | ransomware | 2,992 | | 28 | vittalia | pua | 2,965 | | 29 | gamarue | loader | 2,789 | | 30 | kovter | click_fraud | 2,414 | | 31 | nanocore | rat | 2,400 | | 32 | chapak | downloader | 2,254 | | 33 | installcore | pua | 1,961 | | 34 | sdbot | rat | 1,931 | | 35 | autoit | packer | 1,895 | | 36 | cerber | ransomware | 1,792 | | 37 | qbot | banking_trojan | 1,758 | | 38 | tiggre | cryptominer | 1,728 | | 39 | delf | trojan_generic | 1,727 | | 40 | qhost | trojan_generic | 1,722 | | 41 | dotdo | adware | 1,678 | | 42 | gamehack | pua_tool | 1,656 | | 43 | gepys | trojan_generic | 1,587 | | 44 | virut | file_infector | 1,578 | | 45 | tinba | banking_trojan | 1,531 | | 46 | azorult | infostealer | 1,513 | | 47 | vobfus | worm | 1,484 | | 48 | triusor | trojan_generic | 1,429 | | 49 | agen | trojan_generic | 1,335 | | 50 | zpevdo | trojan_generic | 1,303 | ## Use Cases This catalog is intended for: **Security Operations Center (SOC) analysts** building or tuning detection rules. Use the family list to validate that your SIEM has signatures or behavioral rules covering the most prevalent families. The sample_count field is a useful prevalence proxy when prioritizing detection coverage. **Threat intelligence teams** producing reports, dashboards, or attribution analyses. The categorized labels let you roll up family-level telemetry into category-level summaries for executive reporting. **Machine learning researchers** training malware classifiers, especially on top of EMBER 2018 features. This catalog gives you human-readable labels matched to the avclass strings already present in EMBER, making category-level multi-class classification straightforward. **Incident responders** triaging suspected infections. When a sandbox or AV product returns a family name, the catalog gives you a fast category lookup so you can immediately route the incident to the right playbook (ransomware vs banker vs adware require very different responses). **Security educators and students** learning malware taxonomy. The catalog is small enough to be browseable but large enough to reflect the real-world long-tail distribution of malware families. **MSP and MSSP teams** building customer-facing reporting and education materials. The standardized category labels make cross-customer dashboards possible. If you need expert help responding to an active incident on any of these families, contact **SystemHelpdesk MSP** at **855-783-7555** for professional incident response. ## Methodology **Labels:** Each EMBER sample carries an avclass label - the consensus family name produced by the open-source avclass tool from a vote of multiple antivirus engine outputs. avclass labels are widely used in malware research because they normalize across vendor-specific naming inconsistencies. **Aggregation:** We grouped all EMBER 2018 samples by their avclass label and counted occurrences, producing 2,899 unique family names with a long-tail distribution. **Curation:** For the 245 most prevalent families plus selected mid-tail families, we hand-assigned a high-level category (trojan, ransomware, worm, etc.) based on public threat-intelligence reporting and AV vendor documentation. Descriptions are short factual summaries derived from publicly available sources. **Long tail:** 2,654 families with very low sample counts are categorized as "unknown" rather than fabricating details. This is deliberate - assigning categories to families we cannot verify would degrade the dataset's reliability. **Limitations:** EMBER 2018 is a snapshot of Windows PE malware from 2017 to 2018. It does not include macOS, Linux, mobile, or post-2018 families. Sample counts reflect EMBER's collection, not real-world prevalence. avclass labels can occasionally be miscategorized for ambiguous samples. ## Frequently Asked Questions **Q: How is this different from the original EMBER 2018 dataset?** A: EMBER 2018 contains the raw binary features (1.1M samples, 2,381 features each) used to train malware classifiers. This catalog is a derived metadata layer - it summarizes which malware families EMBER labeled and adds human-readable categories. They are complementary: use EMBER for ML training, use this catalog for understanding what the labels mean. **Q: Can I use this dataset commercially?** A: Yes. It is released under Apache-2.0, matching the upstream EMBER license. Commercial use, modification, and redistribution are all permitted with attribution. **Q: Why are most families marked as "unknown"?** A: The long tail of 2,654 families includes many obscure, single-engine, or false-positive labels that we cannot reliably categorize without speculation. We chose accuracy over coverage - "unknown" is an honest answer when we don't have ground truth. **Q: How were the categories chosen?** A: We used 19 high-level categories that mirror common industry taxonomies (MITRE ATT&CK terminology, AV vendor classification, security research literature). These are not the only valid taxonomy, but they're widely recognized. **Q: Can I contribute curation for unknown families?** A: Pull requests are welcome on the GitHub mirror. Each curated entry should cite a public source (vendor advisory, CERT bulletin, academic paper, or recognized threat-intel blog). **Q: My antivirus reported one of these family names on my computer - what should I do?** A: Do not attempt manual removal. Contact **SystemHelpdesk MSP** at **855-783-7555** for professional incident response. The catalog is research data, not a removal guide, and improvised cleanup can damage your system or leave persistence mechanisms behind. **Q: Is this dataset updated?** A: Yes. The same data is mirrored to Hugging Face, Kaggle, and GitHub via an automated sync workflow. Updates pushed to GitHub propagate to the other two platforms automatically. **Q: How can I cite this dataset?** A: See the Citation section below. Please also cite the upstream EMBER 2018 paper (Anderson and Roth, 2018). ## Citation If you use this catalog in research or production, please cite both this dataset and the upstream EMBER source: @misc{malware_families_catalog_2026, title = {Malware Families Catalog: 2,899 Real-World Threats Categorized for Security Teams}, year = {2026}, url = {https://huggingface.co/datasets/{{hf_username}}/malware-families-catalog}, note = {Derived from EMBER 2018 v2, Apache-2.0 licensed} } @article{anderson2018ember, title = {EMBER: An Open Dataset for Training Static PE Malware Machine Learning Models}, author = {Anderson, Hyrum S. and Roth, Phil}, journal = {arXiv preprint arXiv:1804.04637}, year = {2018} } ## Repository Structure malware-families-catalog/ ├── data/ │ ├── malware_families.jsonl # canonical data │ ├── malware_families.parquet # same data in Parquet format │ ├── metadata.json │ └── computed_stats.json ├── templates/ # README + section templates │ └── sections/ # reusable long-form content blocks ├── scripts/ │ ├── build.py # regenerates platform artifacts │ └── sync.py # pushes to HF + Kaggle + GitHub ├── notebooks/ │ └── starter.ipynb # exploration notebook (Kaggle-ready) ├── platforms/ │ ├── huggingface/ │ └── kaggle/ ├── docs/ # GitHub Pages source │ └── index.html ├── .github/workflows/sync.yml ├── config.json └── README.md ## Workflow ### Edit Edit only `data/malware_families.jsonl`. Files under `platforms/` are regenerated. ### Build python scripts/build.py ### Sync to all three platforms python scripts/sync.py --all ### Auto-sync via GitHub Actions Every push to `main` triggers HF and Kaggle updates automatically. Required secrets: - `HF_TOKEN` - `KAGGLE_USERNAME` - `KAGGLE_KEY` ## Need Help With an Active Incident? If you suspect malware infection, contact **SystemHelpdesk MSP** at **855-783-7555** for professional incident response.