Sachin30102006/Hireshield

GitHub: Sachin30102006/Hireshield

Stars: 0 | Forks: 0

# 🛡️ HireShield - AI Recruitment Threat Intelligence Platform **Production-Grade AI Security Platform for Detecting Recruitment Scams** ## 📋 Table of Contents - [Project Overview](#project-overview) - [System Architecture](#system-architecture) - [Features](#features) - [Quick Start](#quick-start) - [API Documentation](#api-documentation) - [Project Structure](#project-structure) - [Installation](#installation) - [Configuration](#configuration) - [Development](#development) - [Deployment](#deployment) - [Contributing](#contributing) ## 🚀 Project Overview HireShield is a comprehensive AI-powered platform designed to detect and analyze recruitment-related threats including: - **Fake Job Offers** - AI-detected impersonation and fraudulent positions - **Phishing Emails** - Detection of credential harvesting attempts - **Payment Scams** - Identification of advance payment and fee requests - **Social Engineering** - Analysis of manipulation and urgency tactics - **Identity Theft** - Recognition of identity verification requests - **Recruiter Impersonation** - Verification of recruiter legitimacy ### Key Technologies - **AI/ML**: scikit-learn, XGBoost, SHAP - **NLP**: spaCy, NLTK, Transformers (DistilBERT) - **Backend**: FastAPI, SQLAlchemy - **Frontend**: Streamlit - **Database**: SQLite (PostgreSQL-ready) - **Deployment**: Docker, Docker Compose ## 🏗️ System Architecture ### Layered Architecture ┌─────────────────────────────────────────────────────────────┐ │ FRONTEND LAYER │ │ (Streamlit Web Application) │ └────────────────────┬────────────────────────────────────────┘ │ HTTP/REST API ┌────────────────────▼────────────────────────────────────────┐ │ API LAYER │ │ (FastAPI - RESTful Endpoints) │ ├─────────┬──────────┬──────────┬──────────┬──────────────────┤ │ Analyze │ Explain │ Logs │ Threat │ Health/Status │ │ /analyze│ /explain │ /logs │ /stats │ /health │ └────────────────────┬────────────────────────────────────────┘ │ Service Injection ┌────────────────────▼────────────────────────────────────────┐ │ SERVICE LAYER │ ├──────────────┬──────────┬──────────┬───────┬─────────────────┤ │ Inference │ Features │ Explain- │ Threat│ Logging │ │ (Models) │ (Eng.) │ ability │(Intel)│ (Persistence) │ └────────────────────┬────────────────────────────────────────┘ │ Dependency Injection ┌────────────────────▼────────────────────────────────────────┐ │ AI/NLP LAYER │ ├──────────────┬──────────────┬───────────────────────────────┤ │ Preprocessing│ Feature Eng. │ ML Inference │ │ (Cleaning) │ (Extraction) │ (XGBoost, Logistic Reg.) │ └────────────────────┬────────────────────────────────────────┘ │ ┌────────────────────▼────────────────────────────────────────┐ │ DATABASE LAYER │ │ (SQLAlchemy ORM + SQLite/PostgreSQL) │ ├──────────────┬──────────────┬──────────────────────────────┤ │ Scan Logs │ Recruiter │ Threat Intelligence / Models │ │ │ Profiles │ Training Logs │ └──────────────┴──────────────┴──────────────────────────────┘ ### Key Design Decisions 1. **Separation of Concerns**: Clear separation between frontend, API, services, and data layers 2. **Dependency Injection**: Services are loosely coupled and easy to test 3. **Asynchronous API**: FastAPI enables high-performance async endpoints 4. **Database Abstraction**: SQLAlchemy ORM allows easy migration to PostgreSQL 5. **Modular Services**: Each service handles a specific responsibility 6. **Type Safety**: Pydantic models ensure request/response validation ## ✨ Features ### Analysis Engine - **Multi-Model Inference**: XGBoost (primary) + Logistic Regression (baseline) - **Real-Time Processing**: Analysis completes in <500ms - **Confidence Scoring**: Calibrated probability estimates - **Feature-Rich Detection**: 23+ behavioral and linguistic features ### Explainability - **SHAP Integration**: Feature importance and local explanations - **Human-Readable Reasoning**: Step-by-step decision explanations - **Confidence Assessment**: Why the model is confident in its prediction - **Top Contributing Features**: Identification of key fraud signals ### Threat Intelligence - **Recruiter Profiling**: Trust scoring and historical tracking - **Threat Categorization**: Classification into 8+ threat types - **Pattern Recognition**: Identification of recurring fraud patterns - **Analytics Dashboard**: Trend analysis and statistics ### Logging & Monitoring - **Comprehensive Audit Trail**: All scans logged with metadata - **Search & Filtering**: Advanced log queries with multiple filters - **Analytics Snapshots**: Periodic threat intelligence summaries - **Model Training History**: Tracking of model performance over time ### Security & Compliance - **Data Persistence**: Secure storage in SQLite/PostgreSQL - **Access Logging**: Complete audit trail of system activity - **Error Handling**: Graceful degradation and error reporting - **Health Monitoring**: Real-time system status checks ## 🚀 Quick Start ### Prerequisites - Python 3.11+ - Docker & Docker Compose (for containerized deployment) - 2GB+ RAM ### Installation 1. **Clone Repository** git clone https://github.com/yourusername/hireshield.git cd hireshield 2. **Create Virtual Environment** python -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate 3. **Install Dependencies** pip install -r requirements.txt python -m spacy download en_core_web_sm 4. **Initialize Database** python -c "from backend.database import init_db; init_db()" 5. **Train Models** (if needed) python -m models.train_model ### Running the Application #### Option 1: Local Development (Separate Services) **Terminal 1 - Backend API:** uvicorn backend.main:app --host 0.0.0.0 --port 8000 --reload **Terminal 2 - Frontend:** streamlit run frontend/app.py Then visit: - **Frontend**: http://localhost:8501 - **API Docs**: http://localhost:8000/api/docs - **API ReDoc**: http://localhost:8000/api/redoc #### Option 2: Docker Compose (Recommended) docker-compose up --build Then visit: - **Frontend**: http://localhost:8501 - **Backend**: http://localhost:8000 - **API Docs**: http://localhost:8000/api/docs ## 📚 API Documentation ### Base URL http://localhost:8000/api ### Authentication Currently no authentication required. Add JWT/OAuth in production. ### Endpoints #### Analysis Endpoints **POST /analyze** - Basic scam analysis Request: { "text": "Your recruitment message here...", "recruiter_email": "recruiter@company.com", "recruiter_name": "John Doe", "job_title": "Senior Engineer", "company_name": "TechCorp" } Response: { "scam_probability": 87.5, "risk_level": "HIGH RISK", "confidence": 0.94, "detected_indicators": [...], "highlighted_phrases": [...], "feature_scores": {...}, "processing_time_ms": 125.5 } **POST /deep-scan** - Advanced analysis with SHAP Request: { "text": "...", "recruiter_email": "...", "include_shap": true, "confidence_threshold": 0.7 } Response: Same as /analyze with enhanced feature importance #### Explainability Endpoints **POST /explain** - Get SHAP-based explanations Request: { "text": "...", "recruiter_email": "..." } Response: { "scam_probability": 87.5, "risk_level": "HIGH RISK", "shap_available": true, "feature_importance": [...], "top_contributing_features": ["urgency_score", "payment_request"], "explanation_text": "The AI detected multiple fraud indicators...", "reasoning": [...] } #### Log Endpoints **GET /logs** - Retrieve scan history Query Parameters: - limit: 50 (max 1000) - offset: 0 - severity_level: CRITICAL|HIGH RISK|SUSPICIOUS|SAFE - status: Blocked|Quarantined|Flagged|Verified - search_query: Search in email/category Response: { "logs": [...], "total_count": 145, "limit": 50, "offset": 0 } **GET /logs/{log_id}** - Get specific log details **DELETE /logs/{log_id}** - Delete a log entry #### Threat Intelligence Endpoints **GET /threat-stats** - Aggregated threat statistics Query Parameters: - days: 30 (lookback period) Response: { "total_scans": 1250, "critical_count": 145, "high_risk_count": 340, "average_scam_probability": 45.3, "top_threat_categories": [...] } **GET /threat-summary** - Quick threat summary #### Health & Status Endpoints **GET /health** - API health check Response: { "status": "healthy", "api_version": "1.0.0", "models_loaded": true, "database_connected": true, "message": "All systems operational" } **GET /model-info** - Model information **GET /status** - Comprehensive system status ### Error Responses { "error": "Error type", "status_code": 400, "timestamp": "2026-05-21T10:30:00Z", "details": {...} } ## 📁 Project Structure HireShield/ │ ├── frontend/ │ ├── app.py # Streamlit main application │ ├── api_client.py # API client for REST communication │ └── views/ │ ├── dashboard.py # Dashboard view │ ├── analysis.py # Scam analysis view │ ├── threat_intel.py # Threat intelligence view │ ├── explainability.py # Explainability view │ ├── logs.py # Detection logs view │ └── settings.py # Settings view │ ├── backend/ │ ├── main.py # FastAPI application entry point │ │ │ ├── routers/ │ │ ├── analyze.py # Analysis endpoints │ │ ├── explain.py # Explainability endpoints │ │ ├── logs.py # Log management endpoints │ │ ├── threat.py # Threat intelligence endpoints │ │ └── health.py # Health & status endpoints │ │ │ ├── services/ │ │ ├── inference_service.py # ML model inference │ │ ├── feature_service.py # Feature extraction │ │ ├── explainability_service.py # SHAP explanations │ │ ├── logging_service.py # Log persistence │ │ └── threat_service.py # Threat intelligence │ │ │ ├── models/ │ │ ├── request_models.py # Pydantic request schemas │ │ └── response_models.py # Pydantic response schemas │ │ │ ├── database/ │ │ ├── db.py # Database connection & config │ │ ├── schemas.py # SQLAlchemy ORM models │ │ └── operations.py # Database CRUD operations │ │ │ └── utils/ │ └── config.py # Configuration utilities │ ├── ai/ │ ├── preprocessing/ │ │ ├── cleaner.py │ │ ├── tokenizer.py │ │ └── lemmatizer.py │ │ │ ├── feature_engineering/ │ │ ├── payment_detector.py │ │ ├── urgency_detector.py │ │ └── ...feature modules... │ │ │ ├── ml/ │ │ ├── train_xgboost.py │ │ ├── train_logistic.py │ │ └── inference.py │ │ │ └── explainability/ │ └── shap_explainer.py │ ├── models/ │ ├── xgboost_model.pkl # Trained XGBoost model │ ├── logistic_model.pkl # Logistic regression model │ ├── vectorizer.pkl # Feature vectorizer │ └── feature_order.pkl # Feature column order │ ├── data/ │ ├── raw/ # Raw data files │ └── sample_scams.csv # Training data │ ├── tests/ │ ├── test_api.py │ ├── test_preprocessing.py │ └── test_models.py │ ├── docker/ │ └── Dockerfile │ ├── docker-compose.yml ├── requirements.txt ├── README.md ├── .gitignore └── hireshield.db # SQLite database (auto-created) ## ⚙️ Installation ### Prerequisites - **Python**: 3.11 or higher - **pip**: Python package manager - **Virtual Environment** (recommended): venv or conda ### Step 1: Clone Repository git clone https://github.com/yourusername/hireshield.git cd hireshield ### Step 2: Create Virtual Environment # Using venv python -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate # Or using conda conda create -n hireshield python=3.11 conda activate hireshield ### Step 3: Install Dependencies pip install -r requirements.txt ### Step 4: Download NLP Models python -m spacy download en_core_web_sm python -m nltk.downloader punkt averaged_perceptron_tagger ### Step 5: Initialize Database python -c "from backend.database import init_db; init_db()" ### Step 6: Train Models (Optional) If models don't exist in `/models/`: python -m models.train_model ## 🔧 Configuration ### Environment Variables Create a `.env` file in the project root: # Database DATABASE_URL=sqlite:///./hireshield.db # DATABASE_URL=postgresql://user:password@localhost/hireshield # API API_PORT=8000 API_HOST=0.0.0.0 # Frontend STREAMLIT_PORT=8501 # Logging LOG_LEVEL=INFO # ML MODEL_CONFIDENCE_THRESHOLD=0.5 ### Database Configuration #### SQLite (Default) DATABASE_URL = "sqlite:///./hireshield.db" #### PostgreSQL (Production) DATABASE_URL = "postgresql://user:password@localhost:5432/hireshield" ## 👨‍💻 Development ### Running Tests # Run all tests pytest # With coverage pytest --cov=backend --cov=frontend # Specific test pytest tests/test_api.py::test_analyze ### Code Quality # Format code black . # Lint code flake8 . # Type checking mypy backend/ ### Adding New Features 1. Create feature branch: `git checkout -b feature/my-feature` 2. Write tests first (TDD) 3. Implement feature 4. Update documentation 5. Submit pull request ## 🚀 Deployment ### Docker Deployment # Build and run docker-compose up --build # Run in detached mode docker-compose up -d # View logs docker-compose logs -f # Stop services docker-compose down ### Cloud Deployment Options #### Render.com # Create web service pointing to backend/main.py # Set start command: uvicorn backend.main:app --host 0.0.0.0 --port $PORT #### Railway.app # Deploy with Railway CLI railway up #### HuggingFace Spaces # Push to HuggingFace Spaces repo # Streamlit will auto-deploy frontend #### AWS EC2 # Install dependencies, clone repo, run with Docker Compose # Configure security groups for ports 8000 and 8501 ### Production Checklist - [ ] Use PostgreSQL instead of SQLite - [ ] Enable CORS with specific origins - [ ] Add authentication (JWT/OAuth) - [ ] Set up HTTPS/SSL - [ ] Configure logging and monitoring - [ ] Set resource limits - [ ] Regular database backups - [ ] Load testing - [ ] Security audit ## 📊 Performance Metrics - **API Response Time**: <500ms per request - **Model Accuracy**: 97.82% on test set - **F1 Score**: 0.9801 - **Throughput**: 100+ requests/second (single instance) - **Database Queries**: <50ms average ## 🙏 Acknowledgments - Built with ❤️ for cybersecurity - Inspired by enterprise SaaS platforms - Community contributions welcome