AureolePosey/mobile-money-transactions-fraud-detection-pyspark

GitHub: AureolePosey/mobile-money-transactions-fraud-detection-pyspark

基于 PySpark 和 Medallion 架构的移动支付欺诈检测数据管道,处理百万级交易数据并生成欺诈分析报告。

Stars: 0 | Forks: 0

*移动支付欺诈检测管道 (PySpark)* ``` This project implements an end-to-end data pipeline for detecting fraudulent activities in simulated Mobile Money transactions. It processes 1,000,000+ transactions using a Medallion Architecture to ensure data quality, reliability, and traceability. ``` *技术栈* ``` -Compute Engine: Apache Spark (PySpark) -Storage Format: Apache Parquet (Partitioned by date) -Environment: WSL (Windows Subsystem for Linux) / Ubuntu -Tools: Python, Logging, OS Path Management ``` *项目架构* 该项目采用模块化结构构建,以确保可维护性和可扩展性: ``` Plaintext fraud-detection-pipeline/ ├── data/ # Data Lake layers (Raw, Clean, Curated, Analytics) ├── pipeline/ # Core ETL logic (Ingestion, Cleaning, Features, etc.) ├── utils/ # Utilities (Logger, SparkSession, Config) ├── logs/ # Pipeline execution history & monitoring ├── run_pipeline.py # Central Orchestrator (Main Entry Point) ├── generate_dataset.py # Synthetic transaction generator └── README.md ``` *管道工作流* 该管道按顺序编排,旨在将原始数据转化为可执行的业务洞察: ``` -Ingestion: Reads raw CSV files and converts them to the optimized Parquet format. -Data Quality: Automated audit of data types and missing values (null checks). -Cleaning: Deduplication and filtering of inconsistent or negative transaction amounts. -Feature Engineering: Generation of behavioral features (rolling averages, daily transaction frequency per user). -Fraud Detection: Application of business rules (critical thresholds and relative anomalies). -Fraud Analytics: Final reporting on fraud rates segmented by city and operator in Benin. ``` *关键工程概念* ``` Parquet Partitioning: Storage optimization using transaction_date to enable high-performance queries via predicate pushdown. -Professional Logging System: Full execution traceability within pipeline.log, moving beyond basic print statements for production readiness. -Idempotence: The pipeline is designed to be re-run safely without data duplication using Spark's overwrite mode. -Advanced Feature Engineering: Utilizing Window Functions to calculate standard deviations and user-specific behavior shifts. ``` *最终报告示例* 执行后,管道会直接在日志中生成关键指标: ``` -Total Transactions Processed: 1,000,000+ -Detected Fraud Rate: ~1.5% -High-Risk Locations: Cotonou, Parakou, Abomey-Calavi. ``` *安装与使用* 克隆仓库: ``` Bash git clone https://github.com/AureolePosey/fraud-detection-pipeline.git Run the full pipeline: Bash python run_pipeline.py ```
标签:Apache Spark, CSV转Parquet, ETL管道, Gradle集成, Medallion架构, Parquet格式, PySpark, Python, 交易监控, 代码示例, 分区存储, 反洗钱, 大数据处理, 异常检测, 批处理, 数据分析, 数据工程, 数据清洗, 数据湖, 数据科学, 数据质量, 无后门, 欺诈检测, 特征工程, 移动支付, 资源验证, 逆向工具, 金融风控