Sachin30102006/Hireshield
GitHub: Sachin30102006/Hireshield
Stars: 0 | Forks: 0
# 🛡️ HireShield - AI Recruitment Threat Intelligence Platform
**Production-Grade AI Security Platform for Detecting Recruitment Scams**
## 📋 Table of Contents
- [Project Overview](#project-overview)
- [System Architecture](#system-architecture)
- [Features](#features)
- [Quick Start](#quick-start)
- [API Documentation](#api-documentation)
- [Project Structure](#project-structure)
- [Installation](#installation)
- [Configuration](#configuration)
- [Development](#development)
- [Deployment](#deployment)
- [Contributing](#contributing)
## 🚀 Project Overview
HireShield is a comprehensive AI-powered platform designed to detect and analyze recruitment-related threats including:
- **Fake Job Offers** - AI-detected impersonation and fraudulent positions
- **Phishing Emails** - Detection of credential harvesting attempts
- **Payment Scams** - Identification of advance payment and fee requests
- **Social Engineering** - Analysis of manipulation and urgency tactics
- **Identity Theft** - Recognition of identity verification requests
- **Recruiter Impersonation** - Verification of recruiter legitimacy
### Key Technologies
- **AI/ML**: scikit-learn, XGBoost, SHAP
- **NLP**: spaCy, NLTK, Transformers (DistilBERT)
- **Backend**: FastAPI, SQLAlchemy
- **Frontend**: Streamlit
- **Database**: SQLite (PostgreSQL-ready)
- **Deployment**: Docker, Docker Compose
## 🏗️ System Architecture
### Layered Architecture
┌─────────────────────────────────────────────────────────────┐
│ FRONTEND LAYER │
│ (Streamlit Web Application) │
└────────────────────┬────────────────────────────────────────┘
│ HTTP/REST API
┌────────────────────▼────────────────────────────────────────┐
│ API LAYER │
│ (FastAPI - RESTful Endpoints) │
├─────────┬──────────┬──────────┬──────────┬──────────────────┤
│ Analyze │ Explain │ Logs │ Threat │ Health/Status │
│ /analyze│ /explain │ /logs │ /stats │ /health │
└────────────────────┬────────────────────────────────────────┘
│ Service Injection
┌────────────────────▼────────────────────────────────────────┐
│ SERVICE LAYER │
├──────────────┬──────────┬──────────┬───────┬─────────────────┤
│ Inference │ Features │ Explain- │ Threat│ Logging │
│ (Models) │ (Eng.) │ ability │(Intel)│ (Persistence) │
└────────────────────┬────────────────────────────────────────┘
│ Dependency Injection
┌────────────────────▼────────────────────────────────────────┐
│ AI/NLP LAYER │
├──────────────┬──────────────┬───────────────────────────────┤
│ Preprocessing│ Feature Eng. │ ML Inference │
│ (Cleaning) │ (Extraction) │ (XGBoost, Logistic Reg.) │
└────────────────────┬────────────────────────────────────────┘
│
┌────────────────────▼────────────────────────────────────────┐
│ DATABASE LAYER │
│ (SQLAlchemy ORM + SQLite/PostgreSQL) │
├──────────────┬──────────────┬──────────────────────────────┤
│ Scan Logs │ Recruiter │ Threat Intelligence / Models │
│ │ Profiles │ Training Logs │
└──────────────┴──────────────┴──────────────────────────────┘
### Key Design Decisions
1. **Separation of Concerns**: Clear separation between frontend, API, services, and data layers
2. **Dependency Injection**: Services are loosely coupled and easy to test
3. **Asynchronous API**: FastAPI enables high-performance async endpoints
4. **Database Abstraction**: SQLAlchemy ORM allows easy migration to PostgreSQL
5. **Modular Services**: Each service handles a specific responsibility
6. **Type Safety**: Pydantic models ensure request/response validation
## ✨ Features
### Analysis Engine
- **Multi-Model Inference**: XGBoost (primary) + Logistic Regression (baseline)
- **Real-Time Processing**: Analysis completes in <500ms
- **Confidence Scoring**: Calibrated probability estimates
- **Feature-Rich Detection**: 23+ behavioral and linguistic features
### Explainability
- **SHAP Integration**: Feature importance and local explanations
- **Human-Readable Reasoning**: Step-by-step decision explanations
- **Confidence Assessment**: Why the model is confident in its prediction
- **Top Contributing Features**: Identification of key fraud signals
### Threat Intelligence
- **Recruiter Profiling**: Trust scoring and historical tracking
- **Threat Categorization**: Classification into 8+ threat types
- **Pattern Recognition**: Identification of recurring fraud patterns
- **Analytics Dashboard**: Trend analysis and statistics
### Logging & Monitoring
- **Comprehensive Audit Trail**: All scans logged with metadata
- **Search & Filtering**: Advanced log queries with multiple filters
- **Analytics Snapshots**: Periodic threat intelligence summaries
- **Model Training History**: Tracking of model performance over time
### Security & Compliance
- **Data Persistence**: Secure storage in SQLite/PostgreSQL
- **Access Logging**: Complete audit trail of system activity
- **Error Handling**: Graceful degradation and error reporting
- **Health Monitoring**: Real-time system status checks
## 🚀 Quick Start
### Prerequisites
- Python 3.11+
- Docker & Docker Compose (for containerized deployment)
- 2GB+ RAM
### Installation
1. **Clone Repository**
git clone https://github.com/yourusername/hireshield.git
cd hireshield
2. **Create Virtual Environment**
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
3. **Install Dependencies**
pip install -r requirements.txt
python -m spacy download en_core_web_sm
4. **Initialize Database**
python -c "from backend.database import init_db; init_db()"
5. **Train Models** (if needed)
python -m models.train_model
### Running the Application
#### Option 1: Local Development (Separate Services)
**Terminal 1 - Backend API:**
uvicorn backend.main:app --host 0.0.0.0 --port 8000 --reload
**Terminal 2 - Frontend:**
streamlit run frontend/app.py
Then visit:
- **Frontend**: http://localhost:8501
- **API Docs**: http://localhost:8000/api/docs
- **API ReDoc**: http://localhost:8000/api/redoc
#### Option 2: Docker Compose (Recommended)
docker-compose up --build
Then visit:
- **Frontend**: http://localhost:8501
- **Backend**: http://localhost:8000
- **API Docs**: http://localhost:8000/api/docs
## 📚 API Documentation
### Base URL
http://localhost:8000/api
### Authentication
Currently no authentication required. Add JWT/OAuth in production.
### Endpoints
#### Analysis Endpoints
**POST /analyze** - Basic scam analysis
Request:
{
"text": "Your recruitment message here...",
"recruiter_email": "recruiter@company.com",
"recruiter_name": "John Doe",
"job_title": "Senior Engineer",
"company_name": "TechCorp"
}
Response:
{
"scam_probability": 87.5,
"risk_level": "HIGH RISK",
"confidence": 0.94,
"detected_indicators": [...],
"highlighted_phrases": [...],
"feature_scores": {...},
"processing_time_ms": 125.5
}
**POST /deep-scan** - Advanced analysis with SHAP
Request:
{
"text": "...",
"recruiter_email": "...",
"include_shap": true,
"confidence_threshold": 0.7
}
Response: Same as /analyze with enhanced feature importance
#### Explainability Endpoints
**POST /explain** - Get SHAP-based explanations
Request:
{
"text": "...",
"recruiter_email": "..."
}
Response:
{
"scam_probability": 87.5,
"risk_level": "HIGH RISK",
"shap_available": true,
"feature_importance": [...],
"top_contributing_features": ["urgency_score", "payment_request"],
"explanation_text": "The AI detected multiple fraud indicators...",
"reasoning": [...]
}
#### Log Endpoints
**GET /logs** - Retrieve scan history
Query Parameters:
- limit: 50 (max 1000)
- offset: 0
- severity_level: CRITICAL|HIGH RISK|SUSPICIOUS|SAFE
- status: Blocked|Quarantined|Flagged|Verified
- search_query: Search in email/category
Response:
{
"logs": [...],
"total_count": 145,
"limit": 50,
"offset": 0
}
**GET /logs/{log_id}** - Get specific log details
**DELETE /logs/{log_id}** - Delete a log entry
#### Threat Intelligence Endpoints
**GET /threat-stats** - Aggregated threat statistics
Query Parameters:
- days: 30 (lookback period)
Response:
{
"total_scans": 1250,
"critical_count": 145,
"high_risk_count": 340,
"average_scam_probability": 45.3,
"top_threat_categories": [...]
}
**GET /threat-summary** - Quick threat summary
#### Health & Status Endpoints
**GET /health** - API health check
Response:
{
"status": "healthy",
"api_version": "1.0.0",
"models_loaded": true,
"database_connected": true,
"message": "All systems operational"
}
**GET /model-info** - Model information
**GET /status** - Comprehensive system status
### Error Responses
{
"error": "Error type",
"status_code": 400,
"timestamp": "2026-05-21T10:30:00Z",
"details": {...}
}
## 📁 Project Structure
HireShield/
│
├── frontend/
│ ├── app.py # Streamlit main application
│ ├── api_client.py # API client for REST communication
│ └── views/
│ ├── dashboard.py # Dashboard view
│ ├── analysis.py # Scam analysis view
│ ├── threat_intel.py # Threat intelligence view
│ ├── explainability.py # Explainability view
│ ├── logs.py # Detection logs view
│ └── settings.py # Settings view
│
├── backend/
│ ├── main.py # FastAPI application entry point
│ │
│ ├── routers/
│ │ ├── analyze.py # Analysis endpoints
│ │ ├── explain.py # Explainability endpoints
│ │ ├── logs.py # Log management endpoints
│ │ ├── threat.py # Threat intelligence endpoints
│ │ └── health.py # Health & status endpoints
│ │
│ ├── services/
│ │ ├── inference_service.py # ML model inference
│ │ ├── feature_service.py # Feature extraction
│ │ ├── explainability_service.py # SHAP explanations
│ │ ├── logging_service.py # Log persistence
│ │ └── threat_service.py # Threat intelligence
│ │
│ ├── models/
│ │ ├── request_models.py # Pydantic request schemas
│ │ └── response_models.py # Pydantic response schemas
│ │
│ ├── database/
│ │ ├── db.py # Database connection & config
│ │ ├── schemas.py # SQLAlchemy ORM models
│ │ └── operations.py # Database CRUD operations
│ │
│ └── utils/
│ └── config.py # Configuration utilities
│
├── ai/
│ ├── preprocessing/
│ │ ├── cleaner.py
│ │ ├── tokenizer.py
│ │ └── lemmatizer.py
│ │
│ ├── feature_engineering/
│ │ ├── payment_detector.py
│ │ ├── urgency_detector.py
│ │ └── ...feature modules...
│ │
│ ├── ml/
│ │ ├── train_xgboost.py
│ │ ├── train_logistic.py
│ │ └── inference.py
│ │
│ └── explainability/
│ └── shap_explainer.py
│
├── models/
│ ├── xgboost_model.pkl # Trained XGBoost model
│ ├── logistic_model.pkl # Logistic regression model
│ ├── vectorizer.pkl # Feature vectorizer
│ └── feature_order.pkl # Feature column order
│
├── data/
│ ├── raw/ # Raw data files
│ └── sample_scams.csv # Training data
│
├── tests/
│ ├── test_api.py
│ ├── test_preprocessing.py
│ └── test_models.py
│
├── docker/
│ └── Dockerfile
│
├── docker-compose.yml
├── requirements.txt
├── README.md
├── .gitignore
└── hireshield.db # SQLite database (auto-created)
## ⚙️ Installation
### Prerequisites
- **Python**: 3.11 or higher
- **pip**: Python package manager
- **Virtual Environment** (recommended): venv or conda
### Step 1: Clone Repository
git clone https://github.com/yourusername/hireshield.git
cd hireshield
### Step 2: Create Virtual Environment
# Using venv
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Or using conda
conda create -n hireshield python=3.11
conda activate hireshield
### Step 3: Install Dependencies
pip install -r requirements.txt
### Step 4: Download NLP Models
python -m spacy download en_core_web_sm
python -m nltk.downloader punkt averaged_perceptron_tagger
### Step 5: Initialize Database
python -c "from backend.database import init_db; init_db()"
### Step 6: Train Models (Optional)
If models don't exist in `/models/`:
python -m models.train_model
## 🔧 Configuration
### Environment Variables
Create a `.env` file in the project root:
# Database
DATABASE_URL=sqlite:///./hireshield.db
# DATABASE_URL=postgresql://user:password@localhost/hireshield
# API
API_PORT=8000
API_HOST=0.0.0.0
# Frontend
STREAMLIT_PORT=8501
# Logging
LOG_LEVEL=INFO
# ML
MODEL_CONFIDENCE_THRESHOLD=0.5
### Database Configuration
#### SQLite (Default)
DATABASE_URL = "sqlite:///./hireshield.db"
#### PostgreSQL (Production)
DATABASE_URL = "postgresql://user:password@localhost:5432/hireshield"
## 👨💻 Development
### Running Tests
# Run all tests
pytest
# With coverage
pytest --cov=backend --cov=frontend
# Specific test
pytest tests/test_api.py::test_analyze
### Code Quality
# Format code
black .
# Lint code
flake8 .
# Type checking
mypy backend/
### Adding New Features
1. Create feature branch: `git checkout -b feature/my-feature`
2. Write tests first (TDD)
3. Implement feature
4. Update documentation
5. Submit pull request
## 🚀 Deployment
### Docker Deployment
# Build and run
docker-compose up --build
# Run in detached mode
docker-compose up -d
# View logs
docker-compose logs -f
# Stop services
docker-compose down
### Cloud Deployment Options
#### Render.com
# Create web service pointing to backend/main.py
# Set start command: uvicorn backend.main:app --host 0.0.0.0 --port $PORT
#### Railway.app
# Deploy with Railway CLI
railway up
#### HuggingFace Spaces
# Push to HuggingFace Spaces repo
# Streamlit will auto-deploy frontend
#### AWS EC2
# Install dependencies, clone repo, run with Docker Compose
# Configure security groups for ports 8000 and 8501
### Production Checklist
- [ ] Use PostgreSQL instead of SQLite
- [ ] Enable CORS with specific origins
- [ ] Add authentication (JWT/OAuth)
- [ ] Set up HTTPS/SSL
- [ ] Configure logging and monitoring
- [ ] Set resource limits
- [ ] Regular database backups
- [ ] Load testing
- [ ] Security audit
## 📊 Performance Metrics
- **API Response Time**: <500ms per request
- **Model Accuracy**: 97.82% on test set
- **F1 Score**: 0.9801
- **Throughput**: 100+ requests/second (single instance)
- **Database Queries**: <50ms average
## 🙏 Acknowledgments
- Built with ❤️ for cybersecurity
- Inspired by enterprise SaaS platforms
- Community contributions welcome