AnonymousDoubleBlind123/Malware-Analysis-with-LLM

GitHub: AnonymousDoubleBlind123/Malware-Analysis-with-LLM

Stars: 0 | Forks: 0

## Malware Analysis with Large Language Models This repository contains materials to reproduce and understand the experiments from: Paper: Malware Analysis with Large Language Models Authors: Anonymous because of Double Blind Conference/Journal:EASE 2026, Glasgow UK ## Summary We evaluate how large language models (LLMs) respond when they are provided chunked representations of Portable Executable (PE) files using a constant (fixed) prompt across samples. We compare outputs from: - ChatGPT (4o/5.2 thinking) - Gemini (3 fast) Goal: quantify and analyze the consistency, accuracy, and failure modes of LLM judgments/explanations given partial PE content. ## Tools -IDA: https://hex-rays.com/ida-free + https://hex-rays.com/classroom -Ghidra: https://github.com/nationalsecurityagency/ghidra -Radare2 + r2dec: https://github.com/radareorg/radare2 ## PE Samples Samples - Benign PEs: https://github.com/iosifache/DikeDataset - Malicious PEs: https://bazaar.abuse.ch/ - Labels: {benign, malicious} What is stored in this repo For safety and policy compliance, this repo does not contain raw malware binaries. Instead, we provide: - cryptographic hashes - Prompt ## Prompts We use a **constant prompt** for all samples to control prompt variability. - Prompt file: `Prompt.docx` - Temperature / decoding settings: default - Output format constraints : 0/1 ## Model Querying ChatGPT - Model: 4o - Access method: UI - Date range of runs: 12-2025 to 01-2026 ChatGPT - Model: 5.2 Thinking - Access method: UI - Date range of runs: 02-2026 Gemini - Model: 3 fast - Access method: UI - Date range of runs: 12-2025 to 02-2026 ## Evaluation We report: - Classification metrics: accuracy, precision, recall, F1