issaassii/SolanaScamDetection

GitHub: issaassii/SolanaScamDetection

Stars: 0 | Forks: 0

# Scam Token Detection on Solana Blockchain A machine learning system that detects fraudulent tokens on the Solana blockchain using on-chain financial data and Logistic Regression. ## Overview The Solana ecosystem sees a constant flood of new tokens, many of which are scams designed to rugpull investors within minutes. Traditional auditing is too slow for this environment. This project takes a data-driven approach — automatically fetching live token data, labeling it using domain-informed thresholds, and training a classifier to flag scam tokens before users lose money. ## Features - Custom dataset generation pipeline using GeckoTerminal and DexScreener APIs - Domain-informed scam labeling (liquidity traps, rugpull risk, FDV analysis) - Logistic Regression with balanced class weights to handle imbalanced data - Hyperparameter tuning (regularization strength C) to minimize overfitting - Achieves **0.84 F1-score** on scam token detection ## Dataset 100 recently active Solana tokens were fetched and labeled using the following features: | Feature | Description | |---|---| | Top 10 Holders % | High concentration = rugpull risk | | Total Liquidity (USD) | Low liquidity = tokens can't be sold | | FDV (Fully Diluted Valuation) | High FDV + low supply = liquidation risk | | GeckoTerminal Risk Score | Platform-generated safety rating (0–100) | | 24h / 5min Volume Change | Unusual volume spikes signal manipulation | | 24h / 5min Price Change | Rapid price movement is a scam indicator | | 24h Transaction Count | Activity level of the token | Tokens were labeled as **scam** or **not scam** based on threshold rules derived from DeFi domain knowledge. ## Model - **Algorithm:** Logistic Regression (scikit-learn) - **Preprocessing:** StandardScaler (all numerical features) - **Class Imbalance:** Handled via `class_weight='balanced'` - **Train/Test Split:** 70/30 - **Tuned Hyperparameter:** Regularization strength C ≈ 1 ## Results | Metric | Not Scam | Scam | |---|---|---| | Precision | 0.14 | 0.95 | | Recall | 0.50 | 0.75 | | F1-Score | 0.22 | **0.84** | | Accuracy | 73% | — | ## Challenges - Most free APIs (CoinGecko, SolScan, Helium) were unavailable or incomplete — required multiple pivots before settling on GeckoTerminal + DexScreener - Public datasets for Solana scam tokens are scarce; the data pipeline was built from scratch - Heavy class imbalance toward scam tokens required careful handling to avoid a biased model ## Future Work - Incorporate social media signals (Twitter/X activity, Telegram mentions) - Expand dataset with more tokens and additional features - Experiment with other models (Random Forest, XGBoost, Neural Networks) - Integrate paid APIs for richer data ## Tech Stack - Python - scikit-learn - pandas - GeckoTerminal API - DexScreener API ## Author Issa Assi linkedin.com/in/issaassi/