AureolePosey/mobile-money-transactions-fraud-detection-pyspark
GitHub: AureolePosey/mobile-money-transactions-fraud-detection-pyspark
基于 PySpark 和 Medallion 架构的移动支付欺诈检测数据管道,处理百万级交易数据并生成欺诈分析报告。
Stars: 0 | Forks: 0
*移动支付欺诈检测管道 (PySpark)*
```
This project implements an end-to-end data pipeline for detecting fraudulent activities in simulated Mobile Money transactions. It processes 1,000,000+ transactions using a Medallion Architecture to ensure data quality, reliability, and traceability.
```
*技术栈*
```
-Compute Engine: Apache Spark (PySpark)
-Storage Format: Apache Parquet (Partitioned by date)
-Environment: WSL (Windows Subsystem for Linux) / Ubuntu
-Tools: Python, Logging, OS Path Management
```
*项目架构*
该项目采用模块化结构构建,以确保可维护性和可扩展性:
```
Plaintext
fraud-detection-pipeline/
├── data/ # Data Lake layers (Raw, Clean, Curated, Analytics)
├── pipeline/ # Core ETL logic (Ingestion, Cleaning, Features, etc.)
├── utils/ # Utilities (Logger, SparkSession, Config)
├── logs/ # Pipeline execution history & monitoring
├── run_pipeline.py # Central Orchestrator (Main Entry Point)
├── generate_dataset.py # Synthetic transaction generator
└── README.md
```
*管道工作流*
该管道按顺序编排,旨在将原始数据转化为可执行的业务洞察:
```
-Ingestion: Reads raw CSV files and converts them to the optimized Parquet format.
-Data Quality: Automated audit of data types and missing values (null checks).
-Cleaning: Deduplication and filtering of inconsistent or negative transaction amounts.
-Feature Engineering: Generation of behavioral features (rolling averages, daily transaction frequency per user).
-Fraud Detection: Application of business rules (critical thresholds and relative anomalies).
-Fraud Analytics: Final reporting on fraud rates segmented by city and operator in Benin.
```
*关键工程概念*
```
Parquet Partitioning: Storage optimization using transaction_date to enable high-performance queries via predicate pushdown.
-Professional Logging System: Full execution traceability within pipeline.log, moving beyond basic print statements for production readiness.
-Idempotence: The pipeline is designed to be re-run safely without data duplication using Spark's overwrite mode.
-Advanced Feature Engineering: Utilizing Window Functions to calculate standard deviations and user-specific behavior shifts.
```
*最终报告示例*
执行后,管道会直接在日志中生成关键指标:
```
-Total Transactions Processed: 1,000,000+
-Detected Fraud Rate: ~1.5%
-High-Risk Locations: Cotonou, Parakou, Abomey-Calavi.
```
*安装与使用*
克隆仓库:
```
Bash
git clone https://github.com/AureolePosey/fraud-detection-pipeline.git
Run the full pipeline:
Bash
python run_pipeline.py
```
标签:Apache Spark, CSV转Parquet, ETL管道, Gradle集成, Medallion架构, Parquet格式, PySpark, Python, 交易监控, 代码示例, 分区存储, 反洗钱, 大数据处理, 异常检测, 批处理, 数据分析, 数据工程, 数据清洗, 数据湖, 数据科学, 数据质量, 无后门, 欺诈检测, 特征工程, 移动支付, 资源验证, 逆向工具, 金融风控