AnonymousDoubleBlind123/Malware-Analysis-with-LLM
GitHub: AnonymousDoubleBlind123/Malware-Analysis-with-LLM
Stars: 0 | Forks: 0
## Malware Analysis with Large Language Models
This repository contains materials to reproduce and understand the experiments from:
Paper: Malware Analysis with Large Language Models
Authors: Anonymous because of Double Blind
Conference/Journal:EASE 2026, Glasgow UK
## Summary
We evaluate how large language models (LLMs) respond when they are provided chunked representations of Portable Executable (PE) files using a constant (fixed) prompt across samples. We compare outputs from:
- ChatGPT (4o/5.2 thinking)
- Gemini (3 fast)
Goal: quantify and analyze the consistency, accuracy, and failure modes of LLM judgments/explanations given partial PE content.
## Tools
-IDA: https://hex-rays.com/ida-free + https://hex-rays.com/classroom
-Ghidra: https://github.com/nationalsecurityagency/ghidra
-Radare2 + r2dec: https://github.com/radareorg/radare2
## PE Samples
Samples
- Benign PEs: https://github.com/iosifache/DikeDataset
- Malicious PEs: https://bazaar.abuse.ch/
- Labels: {benign, malicious}
What is stored in this repo
For safety and policy compliance, this repo does not contain raw malware binaries. Instead, we provide:
- cryptographic hashes
- Prompt
## Prompts
We use a **constant prompt** for all samples to control prompt variability.
- Prompt file: `Prompt.docx`
- Temperature / decoding settings: default
- Output format constraints : 0/1
## Model Querying
ChatGPT
- Model: 4o
- Access method: UI
- Date range of runs: 12-2025 to 01-2026
ChatGPT
- Model: 5.2 Thinking
- Access method: UI
- Date range of runs: 02-2026
Gemini
- Model: 3 fast
- Access method: UI
- Date range of runs: 12-2025 to 02-2026
## Evaluation
We report:
- Classification metrics: accuracy, precision, recall, F1