Sunil-Kumar-Mudusu/healthcare-fraud-detection-data-framework
GitHub: Sunil-Kumar-Mudusu/healthcare-fraud-detection-data-framework
这是一个基于六阶段管道和风险评分系统的医疗保险欺诈检测数据框架,用于识别可疑理赔单。
Stars: 7 | Forks: 1
# 医疗保险欺诈检测数据框架
这是一套医疗保险欺诈检测方法的参考实现,该方法基于以下文章描述:
**透明度说明:** 该框架最初构思于2025年2月20日左右,随后基于已发表文章正式确定为GitHub参考实现。
## 概述
医疗保险欺诈每年给支付方造成数十亿美元的损失。此框架提供了一个模块化的六阶段数据管道,它结合了基于规则的欺诈检测、统计异常识别和透明的风险评分——将文章中描述的多层检测方法付诸实践。
### 六阶段管道
| 阶段 | 模块 | 用途 |
|---|---|---|
| 1. 数据摄入 | `ingestion` | 加载CSV理赔单,计算校验和,分配UUID数据集ID |
| 2. 模式验证 | `schema_validator` | 强制要求字段、数字类型、理赔单ID唯一性 |
| 3. 理赔分析 | `claims_profiler` | 分布统计、供应商/会员行为、重复检测 |
| 4. 欺诈规则引擎 | `fraud_rules` | YAML驱动:重复理赔单、高额理赔、频繁计费、重复程序、标识符缺失 |
| 5. 异常检测 | `anomaly_detector` | 对理赔金额进行IQR/Z-score异常值检测 |
| 6. 风险评分 | `risk_score` | 带有五个组成部分的加权0-100欺诈风险评分 |
支持服务——**审计日志**、**数据谱系追踪**和**SQLite元数据存储**——在每一阶段都透明运行。
## 架构
```
CSV Claims ──► Ingestion ──► Schema Validation ──► Claims Profiling
│
┌─────────────────┘
▼
Fraud Rule Engine ──► Anomaly Detector ──► Risk Score
│
┌─────────────┴───────────────────────────────────────┐
▼ ▼
Audit Log (JSONL) Lineage Log (JSONL)
│
▼
SQLite Metadata Store
```
## 项目结构
```
healthcare-fraud-detection-data-framework/
├── src/hfddf/
│ ├── __init__.py
│ ├── exceptions.py # Domain exception hierarchy
│ ├── config.py # YAML loader
│ ├── ingestion.py # CSV ingestion + metadata
│ ├── schema_validator.py # Required fields, type checks
│ ├── claims_profiler.py # Statistical profiling
│ ├── fraud_rules.py # Rule engine (YAML-driven)
│ ├── anomaly_detector.py # IQR / Z-score detection
│ ├── risk_score.py # 0–100 weighted risk score
│ ├── audit_logger.py # Immutable JSONL audit events
│ ├── lineage_tracker.py # JSONL lineage records
│ └── metadata_store.py # SQLite persistence
├── tests/ # 94 pytest tests
├── examples/
│ ├── sample_pipeline.py # End-to-end demonstration
│ ├── sample_claims.csv # 30-claim dataset with fraud patterns
│ └── fraud_rules.yaml # Declarative fraud rule configuration
├── docs/
│ ├── architecture.md
│ ├── article_mapping.md
│ ├── fraud_detection_model.md
│ ├── data_quality_model.md
│ ├── verification_checklist.md
│ └── test_results.md
├── .github/workflows/ci.yml
├── Dockerfile
└── pyproject.toml
```
## 安装说明
```
git clone https://github.com/Sunil-Kumar-Mudusu/healthcare-fraud-detection-data-framework
cd healthcare-fraud-detection-data-framework
pip install -e ".[dev]"
```
## 快速开始
```
# 运行所有测试
pytest -q
# 运行端到端流水线演示
python examples/sample_pipeline.py
```
## 欺诈规则配置
欺诈规则在 `examples/fraud_rules.yaml` 中声明——添加或修改检测逻辑无需更改代码:
```
rules:
- id: FR001
type: duplicate_claim
description: "Claim ID appears more than once in the dataset"
- id: FR002
type: high_claim_amount
threshold: 10000
description: "Claim amount exceeds $10,000"
- id: FR003
type: frequent_provider_billing
threshold: 5
description: "Provider submitted more than 5 claims in this batch"
- id: FR004
type: repeated_member_procedure
threshold: 2
description: "Same member/procedure pair billed more than twice"
- id: FR005
type: missing_identifier
columns: [provider_id, member_id]
description: "Required claim identifier is blank"
```
### 支持的规则类型
| 类型 | 关键参数 |
|---|---|
| `duplicate_claim` | — |
| `high_claim_amount` | `threshold`(浮点数) |
| `frequent_provider_billing` | `threshold`(整数) |
| `repeated_member_procedure` | `threshold`(整数) |
| `missing_identifier` | `columns`(列表) |
## 示例输出
```
============================================================
Healthcare Fraud Detection Data Framework — Pipeline Run
============================================================
Pipeline : healthcare_fraud_detection
Source : sample_claims.csv
[4] Fraud Rule Engine
rules evaluated: 5
violations : 5
flagged claims : 7
• [FR001] Claim ID appears more than once (n=1)
• [FR002] Claim amount exceeds $10,000 (n=3)
• [FR003] Provider submitted more than 5 claims (n=1)
• [FR004] Same member received same procedure >2× (n=1)
• [FR005] Required identifier is blank (n=1)
[6] Risk Score
score : 49.42/100
level : MEDIUM
============================================================
VERIFICATION SUMMARY
============================================================
Schema : WARNINGS (75.0/100)
Fraud Violations: 5 / 5 rules triggered
Anomalies : 5 claim(s) outside normal range
Risk Score : 49.42/100 (MEDIUM)
Lineage Steps : 6
Audit Events : 7
============================================================
```
## 风险评分公式
```
risk_score = (rule_violations / total_rules) × 40
+ (anomalies / total_claims) × 25
+ (duplicates / total_claims) × 20
+ (max_provider / total_claims) × 10
+ ((100 - schema_score) / 100) × 5
```
风险等级:**低** (<20) · **中** (<50) · **高** (<75) · **危急** (≥75)
## 测试
```
pytest -q
# 94 个测试在 0.91 秒内通过
```
## 验证清单
- [x] 9个测试模块通过94项测试
- [x] 在 sample_claims.csv 上运行端到端管道
- [x] 所有5条欺诈规则在样本数据集上被触发
- [x] 检测到5个IQR异常
- [x] 风险评分49.42/100(中等)
- [x] SQLite元数据存储已填充6个表
- [x] JSONL审计日志和数据谱系文件已写入
- [x] 已为Python 3.11和3.12配置GitHub Actions CI
## 许可证
MIT — 详见 [LICENSE](LICENSE)。
## 引用
```
@article{mudusu2025fraud,
author = {Mudusu, Sunil Kumar},
title = {Health Insurance Fraud Detection: The Role of Advanced IT Systems
in Preventing and Identifying Fraud},
journal = {International Journal of Computer Engineering and Technology (IJCET)},
volume = {16},
number = {1},
year = {2025},
url = {https://iaeme.com/MasterAdmin/Journal_uploads/IJCET/VOLUME_16_ISSUE_1/IJCET_16_01_259.pdf}
}
```
标签:IQR异常检测, SQLite, YAML规则引擎, 云计算, 保险欺诈, 六阶段管道, 加权风险评分, 医疗保健, 医疗数据分析, 医疗欺诈检测, 声明分析, 声明处理, 安全规则引擎, 审计日志, 数据摄取, 数据管道, 数据血统, 模块化设计, 模式验证, 欺诈识别, 统计异常检测, 自动化修复, 规则引擎, 请求拦截, 软件工程, 逆向工具, 透明度