dahanam/CTF-Challenge-Solving-Assistant
GitHub: dahanam/CTF-Challenge-Solving-Assistant
Stars: 0 | Forks: 0
# CTF Challenge-Solving Assistant — LLM Agent
An AI-powered agent that autonomously solves beginner and intermediate
text-based Capture The Flag (CTF) cybersecurity challenges using GPT-4o,
chain-of-thought prompting, custom CTF tools, and Retrieval-Augmented
Generation (RAG).
**NOTE:** Need to add docker!
## Overview
This project implements and evaluates three progressively advanced LLM
agent strategies for solving CTF challenges without a Docker environment:
| Strategy | Description |
|----------|-------------|
| **Zero-Shot Baseline** | GPT-4o with chain-of-thought prompting only |
| **Few-Shot Prompt Tuning** | GPT-4o with in-context examples |
| **RAG Agent** | GPT-4o + retrieval from a holdout challenge corpus |
## Key Results
| Dataset | Zero-Shot | Few-Shot | RAG |
|---------|-----------|----------|-----|
| InterCode-CTF | 25% | — | **75%** (+50%) |
| NYU CTF Bench | 0% | — | 0% |
RAG was invoked on all 26 challenges and produced a **+50% task success
rate improvement** on InterCode-CTF by retrieving from a closely matched
holdout corpus. NYU CTF Bench results confirm that RAG is most effective
when the retrieval corpus matches the test distribution.
## Agent Tools
The agent has access to four custom CTF-solving tools:
- **Base64 Decode** — Decodes Base64-encoded strings
- **Hex Decode** — Converts hexadecimal to plaintext
- **Caesar Brute-Force** — Tries all 25 Caesar cipher rotations
- **Python Execution** — Runs arbitrary Python code for complex challenges
## Datasets
- **InterCode-CTF** (`ic_ctf.json`) — 100 text-based picoCTF challenges,
no Docker required. Covers the same challenges as the PicoCTF Writeups
dataset referenced in the original proposal.
- **NYU CTF Bench** — Development and test splits filtered to text-solvable
categories (`cry`, `misc`) using the `nyuctf` library
## Evaluation
Task 4 includes a full ablation study comparing all three strategies across
both datasets on:
- **Exact Match Accuracy**
- **F1 / Task Success Rate**
- **Average Partial Score**
Ablation conditions:
- Removing retrieval → RAG vs Zero-Shot (retrieval contribution)
- Removing tools → Zero-Shot vs No-Tools (tool contribution)
- Removing CoT prompt → Few-Shot partial score drop (prompt impact)
## Tech Stack
- **Python** — Core language
- **OpenAI GPT-4o** — Core reasoning model
- **LangChain / OpenAI API** — Agent framework and API calls
- **nyuctf** — NYU CTF Bench dataset loader
- **Google Colab** — Development environment
- **Datasets:** InterCode-CTF (picoCTF), NYU CTF Bench
## Setup
### Requirements
pip install openai nyuctf pandas numpy tqdm
### API Key
This notebook uses the OpenAI API. Add your key to Colab Secrets:
1. Click the 🔑 icon in the left sidebar
2. Add a secret named `OPENAI_API_KEY`
3. The notebook loads it automatically via `userdata.get("OPENAI_API_KEY")`
### Datasets
- **InterCode-CTF:** Place `ic_ctf.json` in your Google Drive and update
the path in the notebook
- **NYU CTF Bench:** Place the dataset folder in
`MyDrive/CTF_Dataset` and mount your Drive when prompted
## Project Structure
| Section | Description |
|---------|-------------|
| Task 1 | Dataset loading and preprocessing |
| Task 2 | Baseline LLM agent (Zero-Shot + Chain-of-Thought) |
| Task 3 | Few-Shot prompt tuning + RAG agent with tool use |
| Task 4 | Evaluation and ablation study |
## Authors
Dahana Moz Ruiz & Maria Santos — Kean University, Spring 2026