taguianas/PhishGuard-AI
GitHub: taguianas/PhishGuard-AI
一个端到端的安全平台,通过多技术融合实现 URL 与邮件的钓鱼威胁检测与用户隔离。
Stars: 0 | Forks: 0
# 🛡️ PhishGuard
### 人工智能驱动的网络钓鱼检测平台
*Heuristics · Threat Intel · Machine Learning · LLM Analysis · Full Auth*
**Author : [Anas TAGUI](https://github.com/taguianas)**
### 🌐 [在线演示](https://phishguard-frontend-7ir8.onrender.com)
A full-stack cybersecurity platform that analyzes URLs and emails for phishing threats,
combining heuristic rules, typosquatting detection, threat intelligence APIs, a
trained XGBoost classifier, and LLM-based email analysis, all behind a complete
user authentication system with per-user data isolation.
## 架构
```
phish-guard/
├── frontend/ Next.js 16.1.6 (App Router) + TailwindCSS + NextAuth.js v5
├── backend/ Node.js + Express API (JWT-protected)
├── ml-service/ Python FastAPI + XGBoost (trained model included)
├── browser-extension/ Chrome Manifest V3 extension
└── tests/ End-to-end test suite (Python)
```
**Data storage:**
- `backend/data/phishguard.db` : SQLite scan history (url_scans, email_scans), filtered by user
- **Neon PostgreSQL** : user accounts (email/password via bcrypt, Google OAuth) — persistent across deploys
## 当前状态
| Service | Port | State |
|---------|------|-------|
| Frontend (Next.js 16) | 3000 | Ready : auth enabled |
| Backend (Express) | 4000 | Ready : JWT-protected |
| ML Service (FastAPI) | 8000 | Ready : model trained |
## 快速开始
### 1. 后端
```
cd backend
cp .env.example .env # fill in API keys + NEXTAUTH_SECRET
npm install
npm run dev # http://localhost:4000
```
### 2. ML 服务
```
cd ml-service
pip install -r requirements.txt
# 构建数据集(自动下载约 789k 个网络钓鱼 URL)
python build_dataset.py # creates data/urls.csv (100k rows)
# 训练模型
python train_model.py # creates model.pkl
# 启动 API
python -m uvicorn main:app --port 8000
```
### 3. 前端
```
cd frontend
cp .env.local.example .env.local # fill in NEXTAUTH_SECRET (same as backend)
npm install
npm run dev # http://localhost:3000
```
## 用户认证
PhishGuard requires a user account to access any page or API endpoint.
- **Email + password** registration and login (bcrypt-hashed, stored in Neon PostgreSQL)
- **Google OAuth** : enable by setting `GOOGLE_CLIENT_ID` and `GOOGLE_CLIENT_SECRET` in `frontend/.env.local`
- **Session strategy:** JWT (NextAuth v5, `authjs.session-token` cookie)
- **Route protection:** Next.js middleware redirects unauthenticated requests to `/login`, preserving `?callbackUrl`
- **Backend protection:** Every Express route verifies the JWT from `Authorization: Bearer
` using the shared `NEXTAUTH_SECRET`
- **Data isolation:** Each user sees only their own scan history : all queries filter by `user_id`
### 认证流程
```
Browser Next.js (3000) Express (4000)
|-- POST /api/auth/register -->| |
|<-- 201 {"ok":true} ----------| |
|-- POST /api/auth/callback -->| |
|<-- authjs.session-token ckv--| |
|-- POST /api/analyze/url ---->| |
| getToken() |-- Bearer ------->|
| |<-- analysis JSON -------|
|<-- analysis JSON ------------| |
```
The frontend proxy routes (`/api/analyze/*`) extract the session token server-side using `getToken()` and re-sign a backend-compatible JWT using `jose`. The raw token never reaches the browser.
## API 端点
### 前端代理(端口 3000):需要会话 Cookie
| Method | Path | Description |
|--------|------|-------------|
| POST | `/api/analyze/url` | Analyze a URL (proxies to backend, adds auth header) |
| POST | `/api/analyze/email` | Analyze email content (proxies to backend) |
| GET | `/api/analyze/history` | Fetch user's scan history |
| GET | `/api/analyze/history?type=stats` | Fetch user's scan stats |
| POST | `/api/auth/register` | Register a new account |
| GET/POST | `/api/auth/[...nextauth]` | NextAuth.js handlers (login, session, signout, CSRF) |
### 后端(端口 4000):需要 `Authorization: Bearer `
| Method | Path | Description |
|--------|------|-------------|
| POST | `/api/url/analyze` | Analyze a URL for phishing risk |
| POST | `/api/email/analyze` | Analyze email content |
| GET | `/api/history` | User's scan history |
| GET | `/api/history/stats` | User's aggregate stats |
| GET | `/health` | Health check (public) |
### ML 服务(端口 8000):公开(内部使用)
| Method | Path | Description |
|--------|------|-------------|
| POST | `/predict` | Classify a URL (returns prediction + probability + features) |
| GET | `/health` | Health check + model load status |
#### URL 分析:响应示例
```
{
"url": "http://paypa1.com/login",
"risk_score": 65,
"classification": "Medium Risk",
"reasons": [
"Suspicious keyword(s): login",
"Not using HTTPS",
"Possible typosquatting of \"paypal\" (distance: 1)",
"Blacklisted by VirusTotal (9 engines)"
],
"threat_intel": { "malicious": 9, "suspicious": 1, "harmless": 58, "blacklisted": true },
"ml_prediction": { "prediction": "Phishing", "probability": 1.0 }
}
```
#### ML 预测:响应示例
```
{
"url": "http://paypa1-security-update.com/login",
"prediction": "Phishing",
"probability": 1.0,
"features": { "is_https": 0, "has_suspicious_tld": 0, "suspicious_keyword_count": 2, "brand_impersonation": 1 }
}
```
## ML 服务:数据集与模型
### 数据集(`data/urls.csv`)
Built by `build_dataset.py` using two sources:
| Source | Count | Label |
|--------|-------|-------|
| [Phishing.Database](https://github.com/mitchellkrogza/Phishing.Database) (active phishing URLs) | 50,000 | 1 (Phishing) |
| Generated from 100 known-trusted domains (Google, GitHub, PayPal, etc.) | 50,000 | 0 (Legitimate) |
| **Total** | **100,000** | balanced |
### 模型(`model.pkl`)
| Property | Value |
|----------|-------|
| Algorithm | XGBoost (200 estimators, depth 6) |
| Features | 20 URL structural features |
| Test accuracy | 100% (20,000 held-out samples) |
| Train/test split | 80/20, stratified |
### 提取的特征
`url_length`, `hostname_length`, `path_length`, `num_dots`, `num_hyphens`,
`num_underscores`, `num_slashes`, `num_question_marks`, `num_equals`, `num_at`,
`num_percent`, `num_ampersand`, `has_ip`, `is_https`, `has_www`,
`has_encoded_chars`, `suspicious_keyword_count`, `has_suspicious_tld`,
`subdomain_count`, `brand_impersonation`
## 环境变量
### 后端 `.env`
| Variable | Description |
|----------|-------------|
| `PORT` | Backend port (default 4000) |
| `NEXTAUTH_SECRET` | **Required** : shared JWT secret (same value as frontend) |
| `VIRUSTOTAL_API_KEY` | VirusTotal v3 API key |
| `GOOGLE_SAFE_BROWSING_API_KEY` | Google Safe Browsing API key (free, 10k req/day) |
| `ML_SERVICE_URL` | ML microservice URL (default `http://localhost:8000`) |
| `ALLOWED_ORIGINS` | Comma-separated allowed CORS origins |
| `GROQ_API_KEY` | Groq API key for LLM email classification (free at console.groq.com) |
### 前端 `.env.local`
| Variable | Description |
|----------|-------------|
| `NEXTAUTH_SECRET` | **Required** : shared JWT secret (same value as backend) |
| `NEXTAUTH_URL` | Frontend URL (default `http://localhost:3000`) |
| `BACKEND_URL` | Backend URL for server-side proxy routes (default `http://localhost:4000`) |
| `DATABASE_URL` | **Required** : Neon PostgreSQL connection string for user accounts |
| `GOOGLE_CLIENT_ID` | Google OAuth client ID (leave blank to disable Google login) |
| `GOOGLE_CLIENT_SECRET` | Google OAuth client secret |
| `NEXT_PUBLIC_GOOGLE_ENABLED` | Set to `true` to show Google login button |
### 生成 NEXTAUTH_SECRET
```
openssl rand -base64 32
```
Use the same value in both `backend/.env` and `frontend/.env.local`.
### 获取 Google Safe Browsing API 密钥(免费)
1. Go to [console.cloud.google.com](https://console.cloud.google.com)
2. Create a project (or select an existing one)
3. Search for **"Safe Browsing API"** and click **Enable**
4. Go to **Credentials → Create Credentials → API Key**
5. Copy the key into `backend/.env` as `GOOGLE_SAFE_BROWSING_API_KEY`
Free quota: **10000 requests/day** : no billing required.
## Render 部署
A `render.yaml` is included for one-click deployment of all three services (frontend, backend, ML service) to [Render](https://render.com).
After the first deploy, set these environment variables in the Render dashboard:
| Service | Variable | Value |
|---------|----------|-------|
| Frontend | `BACKEND_URL` | `https://phishguard-backend.onrender.com` (no trailing slash) |
| Frontend | `DATABASE_URL` | Your Neon PostgreSQL connection string |
| Frontend | `NEXTAUTH_SECRET` | Same secret as backend |
| Frontend | `NEXTAUTH_URL` | `https://phishguard-frontend.onrender.com` |
| Backend | `NEXTAUTH_SECRET` | Same secret as frontend |
| Backend | `ALLOWED_ORIGINS` | `https://phishguard-frontend.onrender.com` |
| Backend | `ML_SERVICE_URL` | `https://phishguard-ml.onrender.com` |
| Backend | `VIRUSTOTAL_API_KEY` | Your VirusTotal API key |
| Backend | `GOOGLE_SAFE_BROWSING_API_KEY` | Your Safe Browsing API key |
| Backend | `GROQ_API_KEY` | Your Groq API key |
## 风险评分公式
### URL 评分
| Signal | Points |
|--------|--------|
| IP address as hostname | +20 |
| URL length > 75 chars | +10 |
| Excessive subdomains | +10 |
| Suspicious keywords | +5–15 |
| Suspicious TLD | +15 |
| No HTTPS | +10 |
| Typosquatting detected | +25 |
| Encoded characters | +10 |
| VirusTotal blacklisted | +25 |
| Recently registered domain (<1 year) | +10 |
| Google Safe Browsing flagged | +20 |
Score range: 0–100. Classification: Low (<40), Medium (40–69), High (≥70).
### 电子邮件评分
Heuristics check for urgent language, suspicious URLs, grammar anomalies, spoofed sender domains, and common phishing keywords. The Groq LLM (Llama 3.1 70B) provides an independent verdict : if it classifies as Phishing with ≥70% confidence, +15 points are added.
## 测试
### 端到端测试套件
```
# 三个服务必须首先运行
python tests/e2e_test.py
```
Covers 57 test cases across 8 groups:
1. Service health checks (all 3 services)
2. ML URL predictions (phishing, legitimate, invalid input)
3. Backend 401 enforcement (all protected routes)
4. Frontend auth flow (register, login, session, sign-out)
5. Authenticated proxy routes (URL analyze, email analyze, history, stats)
6. Proxy routes : unauthenticated (redirects to login)
7. Route protection : page redirects (all protected pages)
8. Data isolation (two users cannot see each other's history)
See `tests/REPORT.md` for the full test report.
## 安全注意事项
- Input validation on all endpoints via `express-validator` and Pydantic
- Rate limiting: 60 req/min per IP (backend)
- Helmet.js security headers
- URLs are **never fetched** : only their structure is analyzed (SSRF-safe)
- Passwords hashed with bcrypt (12 rounds)
- JWTs signed with `HS256` and verified on every backend request
- API keys stored in `.env` / `.env.local` : never commit them
- Frontend proxy routes add `Authorization` server-side : raw JWT never reaches the browser
## 路线图
- [x] Backend heuristic URL analyzer
- [x] Typosquatting detection (Levenshtein)
- [x] VirusTotal threat intel integration
- [x] Email phishing analyzer
- [x] ML classifier (XGBoost, trained on 100k URLs)
- [x] FastAPI ML microservice
- [x] Next.js frontend (URL analyzer, email analyzer, dashboard)
- [x] Domain age lookup (WHOIS via whoiser)
- [x] Google Safe Browsing API integration
- [x] SQLite scan history (per-user, isolated)
- [x] Live dashboard with stats and recent scans table
- [x] LLM-based email classification (Groq : Llama 3.1, free tier)
- [x] Grammar anomaly detection in email analyzer
- [x] Chrome browser extension (Manifest V3)
- [x] User authentication (NextAuth.js v5 : email/password + Google OAuth)
- [x] End-to-end test suite (57 tests, all passing)
- [x] Render deployment (render.yaml : frontend + backend + ML service)
- [x] Neon PostgreSQL for persistent user accounts across deploys
标签:AI反钓鱼, Apex, AV绕过, DLL 劫持, DNS解析, DNS通配符暴力破解, Express, FastAPI, GNU通用公共许可证, JWT认证, M3扩展, MITM代理, Node.js, PostgreSQL, Python, SEO, SQLite, Tailwind, TailwindCSS, URL分析, XGBoost, 前后端分离, 反钓鱼平台, 启发式规则, 大语言模型, 威胁情报, 子域名暴力破解, 实时检测, 开发者工具, 开源项目, 数据隔离, 无后门, 机器学习, 权限隔离, 浏览器扩展, 用户认证, 秘密管理, 端到端测试, 网络安全, 自动化攻击, 自动化检测, 逆向工具, 邮件分析, 钓鱼检测, 错拼域名检测, 隐私保护