SagarBiswas-MultiHAT/Web_Vulnerability_Scanner-AI

GitHub: SagarBiswas-MultiHAT/Web_Vulnerability_Scanner-AI

Stars: 22 | Forks: 0

# Learning Grade AI Web Vulnerability Scanner
[![CI](https://img.shields.io/github/actions/workflow/status/SagarBiswas-MultiHAT/AI_Web_Vulnerability_Scanner/get-started-actions.yml?branch=main)](https://github.com/SagarBiswas-MultiHAT/AI_Web_Vulnerability_Scanner/actions)   [![Python](https://img.shields.io/badge/python-3.10%2B-blue)](https://www.python.org/)   [![Tests](https://img.shields.io/badge/tests-pytest-orange)](https://github.com/SagarBiswas-MultiHAT/Web_Vulnerability_Scanner-AI)   [![License](https://img.shields.io/github/license/SagarBiswas-MultiHAT/Web_Vulnerability_Scanner-AI)](https://github.com/SagarBiswas-MultiHAT/Web_Vulnerability_Scanner-AI/blob/main/LICENSE)   [![Last commit](https://img.shields.io/github/last-commit/SagarBiswas-MultiHAT/Web_Vulnerability_Scanner-AI)](https://github.com/SagarBiswas-MultiHAT/Web_Vulnerability_Scanner-AI/commits)   [![Issues](https://img.shields.io/github/issues/SagarBiswas-MultiHAT/Web_Vulnerability_Scanner-AI)](https://github.com/SagarBiswas-MultiHAT/Web_Vulnerability_Scanner-AI/issues)
A portfolio-ready, **learning-grade** web vulnerability scanner and lightweight **AI-assisted report viewer**. This project demonstrates a polite, non-destructive approach to crawling and finding common web security issues (security headers, insecure cookie flags, reflected XSS heuristics, and basic error-based SQL injection indicators). It ships with a small Flask-based AI proxy intended to power the in-report AI assistance (optional).
Pictures
**After running python app.py**
![python app.py](https://imgur.com/Gy4JZKZ.png)
**After running mainScaner.py**
![python mainScaner.py httpssagarbiswas-multihat.github.io --confirm --ai-enabled --ai-server http127.0.0.15000apiai-chat](https://imgur.com/zg26tAn.png)
**After running cd Reports, python -m http.server 8080**
![python -m http.server 8080](https://imgur.com/HZH6JZK.png)
**JSON Report Example**
![JSON Report Example](https://imgur.com/smYnJiW.png)
**HTML Report Example**
![HTML Report Example](https://imgur.com/gqfVDXM.png)
**AI Help Center** **1).** ![AI Help Center](https://imgur.com/0Ne17Fz.png) **2).** ![AI Help Center](https://imgur.com/YAsnDvb.png)
**withOut Anonymize before send**
![withOut Anonymize before send](https://imgur.com/Tf39Scd.png)
**with Anonymize before send**
![with Anonymize before send](https://imgur.com/CcSuAUs.png)
# Table of contents - [Key features](#key-features) - [What this scanner does (and doesn't)](#what-this-scanner-does-and-doesnt) - [Requirements](#requirements) - [Files](#files) - [Installation](#installation) - [Usage & Command-line arguments](#usage--command-line-arguments) - [Examples](#examples) - [Output files & report format](#output-files--report-format) - [Internals & design decisions](#internals--design-decisions) - [Extending the scanner](#extending-the-scanner) - [Safety & legal notes](#safety--legal-notes) - [Contributing](#contributing) - [Contact / Acknowledgements](#contact--acknowledgements) # Key features - Queue-based polite crawler (no recursive thread spawning).
Quick explanation — Queue-based crawler, Polite crawler & No recursive thread spawning (beginner-friendly) #### **1.** Queue-based crawler A **queue-based crawler** uses a **single shared work queue** to manage all URLs that need to be visited. **How it works (conceptually):** 1. Start with the base URL → put it into the queue 2. Worker threads repeatedly: - Take **one URL** from the queue - Fetch the page - Extract links - Add **new, allowed URLs** back into the same queue 3. Repeat until the queue is empty or the depth limit is reached **What this means:** - Each worker processes **one URL at a time** - There is **central control** over what gets scanned - Crawl order, depth, and limits remain predictable #### **2.** Polite crawler A **polite crawler** is designed **not to stress or harm servers**. In this scanner, politeness includes: - ⏱ Per-host rate limiting (`--delay`) - 🤖 Respecting `robots.txt` - 🌱 GET-only requests (non-destructive) - 🚫 No brute-force or payload floods - 🧵 Controlled number of threads Instead of hammering a site, the scanner behaves more like a **careful human using a browser**. #### **3.** No recursive thread spawning (critical design choice) This explains what the crawler **deliberately does NOT do**. **✘ Bad design (recursive thread spawning):**
| | | ------------------------------ | | Thread A visits URL A | | └── spawns Thread B for link B | | └── spawns Thread C for link C | | └── spawns Thread D for link D |
A **thread** is a lightweight unit of execution that allows a program to do work in **parallel (In parallel means multiple tasks are executed at the same time, instead of one after another.)**. **Problems with this approach:** - Unbounded thread growth - Loss of concurrency control - Servers get overwhelmed - Scanner runs out of memory or sockets - Hard to enforce delays and crawl depth ### ✓ What this scanner does instead? [ Queue ] ↓ Worker Thread Pool (fixed size) ↓ Fetch → Extract → Enqueue (back to Queue) #### Why this design is considered best practice?
| Aspect | Queue-based crawler | Recursive spawning | | ---------------------- | ---------------------- | ------------------ | | Thread control | ✅ Fixed & predictable | ❌ Unbounded | | Rate limiting | ✅ Enforceable | ❌ Difficult | | Server safety | ✅ Polite | ❌ Aggressive | | Memory safety | ✅ Stable | ❌ Risky | | Debugging | ✅ Easier | ❌ Chaotic | | Legal / ethical safety | ✅ Much safer | ❌ Risky |
This is why **professional tools and search engine crawlers** use queue-based designs. **Result:** safer scans, predictable behavior, ethical crawling, and easier extensibility.
- Per-host rate limiting (configurable `--delay`) to avoid hammering a server - `robots.txt` awareness (the scanner checks and respects rules where available).
robots.txt ### robots.txt **robots.txt:** A website rule file that tells scanners which URLs they should avoid.