Stdelarosa/reference-data-quality-engine
GitHub: Stdelarosa/reference-data-quality-engine
一个基于 PostgreSQL 和 Python 的金融证券参考数据质量验证系统,解决数据错误导致交易与监管风险问题。
Stars: 0 | Forks: 0
# Reference Data Quality Engine
## Overview
Financial firms depend on reference data, the master records describing
every security they trade. ISIN identifiers, currency codes, asset classes,
expiry dates. When this data is wrong, trades fail, risk calculations break,
and regulatory reports become inaccurate.
This project is a full reference data quality system built to simulate
the validation infrastructure used by data operations teams at hedge funds.
It ingests a securities dataset, runs automated validation across five data
quality dimensions, enforces SLA rules against defined thresholds, scores
vendor data quality, and surfaces everything in a live Streamlit dashboard.
## Architecture
data/securities.csv
|
v
db/schema.py Creates PostgreSQL database and tables
db/seed.py Loads securities data into PostgreSQL
|
v
validation/sql_checks.py 10 SQL validation queries
validation/schema_validation.py Pandera schema enforcement
validation/sla_engine.py SLA rules engine and vendor scorecard
|
v
dashboard/app.py Streamlit dashboard: metrics, failures, SLA, drill-down
## Validation Checks
| Check | Dimension | Severity | Result |
|---|---|---|---|
| Null ISIN values | Completeness | High | FAIL |
| Null CUSIP values | Completeness | High | FAIL |
| Duplicate ISIN values | Uniqueness | High | FAIL |
| Invalid ISIN format | Validity | High | FAIL |
| Negative price values | Validity | High | FAIL |
| Expired securities with active status | Consistency | Medium | FAIL |
| Currency country mismatch | Consistency | Medium | FAIL |
| Invalid status values | Validity | Medium | PASS |
| Null sector values | Completeness | Low | PASS |
| Zero price on active securities | Validity | Medium | PASS |
| Pandera: price_usd schema enforcement | Validity | High | FAIL |
## SLA Rules
| Rule | Threshold | Actual | Status |
|---|---|---|---|
| ISIN Completeness | 95% | 94.4% | FAIL |
| Zero Duplicate ISINs | 0 | 1 | FAIL |
| Zero Negative Prices | 0 | 1 | FAIL |
| All Active Securities Have Price | 0 missing | 0 missing | PASS |
| No Expired Securities Marked Active | 0 | 1 | FAIL |
| Vendor A ISIN Completeness | 95% | 100.0% | PASS |
| Vendor B ISIN Completeness | 95% | 88.9% | FAIL |
## Key Findings
| Finding | Detail |
|---|---|
| Total securities | 18 records across equity and fixed income |
| Validation failures | 8 out of 11 checks failed |
| SLA breaches | 5 out of 7 rules breached |
| Critical issues | Null ISIN, duplicate ISIN, invalid ISIN format, negative price |
| Consistency violations | 1 expired security still marked Active, 1 currency/country mismatch |
| Vendor performance | Vendor A clean at 100% ISIN completeness, Vendor B failing at 88.9% |
## Tools
Python | PostgreSQL | Pandas | Pandera | Streamlit | SQLAlchemy
## How to Run
1. Install dependencies:
`pip install pandas pandera streamlit sqlalchemy psycopg2-binary`
2. Set database password:
`export DB_PASSWORD=your_password_here`
3. Create database and tables:
`python db/schema.py`
4. Load securities data:
`python db/seed.py`
5. Run SQL validation checks:
`python validation/sql_checks.py`
6. Run schema validation:
`python validation/schema_validation.py`
7. Run SLA engine:
`python validation/sla_engine.py`
8. Launch dashboard:
`streamlit run dashboard/app.py`
标签:Kubernetes, Pandera, PostgreSQL, Python数据工程, SLA监控, SQL校验, Streamlit, Zenmap, 云计算, 供应商评分卡, 参考数据管理, 告警检测, 唯一性校验, 数据一致性, 数据仪表板, 数据加载, 数据合规, 数据完整性, 数据库建模, 数据有效性, 数据校验, 数据治理, 数据质量, 数据验证, 测试用例, 规则引擎, 访问控制, 证券参考数据, 逆向工具, 金融数据, 阈值检测