Stdelarosa/reference-data-quality-engine

GitHub: Stdelarosa/reference-data-quality-engine

一个基于 PostgreSQL 和 Python 的金融证券参考数据质量验证系统,解决数据错误导致交易与监管风险问题。

Stars: 0 | Forks: 0

# Reference Data Quality Engine ## Overview Financial firms depend on reference data, the master records describing every security they trade. ISIN identifiers, currency codes, asset classes, expiry dates. When this data is wrong, trades fail, risk calculations break, and regulatory reports become inaccurate. This project is a full reference data quality system built to simulate the validation infrastructure used by data operations teams at hedge funds. It ingests a securities dataset, runs automated validation across five data quality dimensions, enforces SLA rules against defined thresholds, scores vendor data quality, and surfaces everything in a live Streamlit dashboard. ## Architecture data/securities.csv | v db/schema.py Creates PostgreSQL database and tables db/seed.py Loads securities data into PostgreSQL | v validation/sql_checks.py 10 SQL validation queries validation/schema_validation.py Pandera schema enforcement validation/sla_engine.py SLA rules engine and vendor scorecard | v dashboard/app.py Streamlit dashboard: metrics, failures, SLA, drill-down ## Validation Checks | Check | Dimension | Severity | Result | |---|---|---|---| | Null ISIN values | Completeness | High | FAIL | | Null CUSIP values | Completeness | High | FAIL | | Duplicate ISIN values | Uniqueness | High | FAIL | | Invalid ISIN format | Validity | High | FAIL | | Negative price values | Validity | High | FAIL | | Expired securities with active status | Consistency | Medium | FAIL | | Currency country mismatch | Consistency | Medium | FAIL | | Invalid status values | Validity | Medium | PASS | | Null sector values | Completeness | Low | PASS | | Zero price on active securities | Validity | Medium | PASS | | Pandera: price_usd schema enforcement | Validity | High | FAIL | ## SLA Rules | Rule | Threshold | Actual | Status | |---|---|---|---| | ISIN Completeness | 95% | 94.4% | FAIL | | Zero Duplicate ISINs | 0 | 1 | FAIL | | Zero Negative Prices | 0 | 1 | FAIL | | All Active Securities Have Price | 0 missing | 0 missing | PASS | | No Expired Securities Marked Active | 0 | 1 | FAIL | | Vendor A ISIN Completeness | 95% | 100.0% | PASS | | Vendor B ISIN Completeness | 95% | 88.9% | FAIL | ## Key Findings | Finding | Detail | |---|---| | Total securities | 18 records across equity and fixed income | | Validation failures | 8 out of 11 checks failed | | SLA breaches | 5 out of 7 rules breached | | Critical issues | Null ISIN, duplicate ISIN, invalid ISIN format, negative price | | Consistency violations | 1 expired security still marked Active, 1 currency/country mismatch | | Vendor performance | Vendor A clean at 100% ISIN completeness, Vendor B failing at 88.9% | ## Tools Python | PostgreSQL | Pandas | Pandera | Streamlit | SQLAlchemy ## How to Run 1. Install dependencies: `pip install pandas pandera streamlit sqlalchemy psycopg2-binary` 2. Set database password: `export DB_PASSWORD=your_password_here` 3. Create database and tables: `python db/schema.py` 4. Load securities data: `python db/seed.py` 5. Run SQL validation checks: `python validation/sql_checks.py` 6. Run schema validation: `python validation/schema_validation.py` 7. Run SLA engine: `python validation/sla_engine.py` 8. Launch dashboard: `streamlit run dashboard/app.py`
标签:Kubernetes, Pandera, PostgreSQL, Python数据工程, SLA监控, SQL校验, Streamlit, Zenmap, 云计算, 供应商评分卡, 参考数据管理, 告警检测, 唯一性校验, 数据一致性, 数据仪表板, 数据加载, 数据合规, 数据完整性, 数据库建模, 数据有效性, 数据校验, 数据治理, 数据质量, 数据验证, 测试用例, 规则引擎, 访问控制, 证券参考数据, 逆向工具, 金融数据, 阈值检测