fl-sarada-sudarshan/Bank-Statement-Tagger
GitHub: fl-sarada-sudarshan/Bank-Statement-Tagger
Stars: 0 | Forks: 0
# Bank Statement Auto-Tag & Metrics Agent
FlexiLoans Hackathon — Topic 3 (Underwriting)
## What We Built — Problem Statement Coverage
### Core features (required)
- [x] **Transaction classifier** — every transaction tagged as `salary`, `business_inflow`, `emi_payment`, `cheque_bounce`, `gambling`, `regular_expense`, or `other`
- [x] **Rules-first engine with LLM fallback** — 41 bank-agnostic seed rules cover 90–97% of transactions; unrecognised clusters go to a local Ollama LLM, keeping cost near zero
- [x] **ABB (Average Bank Balance)** — computed from daily closing balance, forward-filled across days with no transactions
- [x] **BTO (Bank Turnover)** — monthly inflow excluding reversals and internal transfers; median and per-month breakdown
- [x] **Bounce ratio** — `count(cheque_bounce) / count(emi_payment + cheque_bounce)`
- [x] **Obligation-to-Income (OTI)** — `sum(emi_payment) / sum(salary + business_inflow)` per month
- [x] **Anomaly detection** — circular transfer detection: finds money going A→B→C→A within 14 days at similar amounts
- [x] **Confidence scoring per metric** — each metric carries a confidence level (high/medium/low) based on what share of contributing transactions were tagged by seed rules vs LLM
- [x] **Structured output** — `credit_input.json` with metrics, anomalies, tag breakdown, and confidence scores
- [x] **Human-readable summary** — `summary.md` narrating income stability, repayment behaviour, risk signals, and anomalies in plain language
- [x] **Synthetic bank statement** — 6 months, ~240 transactions, with planted salary, EMI, cheque bounces, and a circular transfer pattern
### Beyond the MVP
- [x] **PDF ingestion** — parses real multi-page bank PDFs (tested on 22-page ICICI business current account); handles NEFT, RTGS, IMPS, UPI, ACH, NACH, POS, ATM descriptions across banks
- [x] **Learning loop** — LLM suggestions are shown to the user for approve/deny; approved tags are promoted into `ruleset.json` so future statements need fewer LLM calls
- [x] **AI Analysis tab** — streams a full credit analyst report (5 sections: income profile, stability, repayment, risk signals, recommendation) via Ollama SSE
- [x] **Cost transparency** — shows Claude-equivalent cost per run vs a pure-LLM baseline, with projected savings at 50K statements/day
- [x] **Web UI** — single-page app: upload → rule tagging → LLM review → metrics → AI report; no CLI required
### Discussion angles addressed
- [x] **Cost at scale** — cost counter shows ₹0 (Ollama, local) vs Claude Sonnet 4.6 equivalent; pure-LLM baseline and 50K-stmts/day savings computed live
- [x] **Hybrid classifier** — rules handle the bulk (90%+), LLM only sees the ambiguous tail; ruleset grows with use so LLM share shrinks over time
- [x] **Latency** — rule engine runs in milliseconds; LLM call is batched (all untagged clusters in one prompt) and streams token-by-token to the UI
- [ ] **GST cross-check** — not implemented (optional per problem statement)
- [ ] **Precision/recall on held-out test set** — not measured; hand-verified on synthetic + real ICICI statement
## Running
pip install -r requirements.txt
python3 -m uvicorn app:app --host 0.0.0.0 --port 8000 --reload
Open `http://localhost:8000`
## Stack
- **Backend**: FastAPI + Python 3.11
- **PDF parsing**: pdfplumber (multi-page, ICICI / HDFC / SBI / Axis formats)
- **Rule engine**: regex + amount/direction filters, priority-ordered, persisted in `data/ruleset.json`
- **LLM fallback**: Ollama (local, zero API cost) — streamed via SSE
- **Frontend**: Vanilla JS + CSS, served as static files
## Project layout
bank-statement-agent/
├── app.py # FastAPI routes + session state
├── data/
│ ├── ruleset.json # 41 general seed rules (bank-agnostic)
│ └── synthetic_statement.csv
├── src/
│ ├── rule_engine.py # Rule loader, matcher, persister
│ ├── pdf_parser.py # Multi-format PDF → DataFrame
│ ├── entity_extractor.py # Counterparty name normalisation
│ ├── llm_ollama.py # Ollama SSE streaming + cluster parsing
│ ├── metrics.py # ABB, BTO, bounce ratio, OTI
│ ├── anomaly.py # Circular transfer detection
│ └── output.py # credit_input.json + summary.md
└── static/
├── index.html
├── app.js
└── style.css
## Workflow
1. **Upload** a bank statement (CSV or PDF)
2. **Rule engine** tags ~90–97% of transactions instantly with no LLM cost
3. **LLM clustering** groups unrecognised transactions and suggests tags + regexes
4. **Approve / deny** suggestions — approved rules are saved to `ruleset.json` permanently
5. **Metrics** tab shows ABB, BTO, bounce ratio, OTI with per-metric confidence
6. **AI Analysis** tab streams a credit analyst report via Ollama
## Ruleset
`data/ruleset.json` ships with 41 general-purpose rules covering:
| Category | Tag | Examples |
|---|---|---|
| Bounce / return | `cheque_bounce` | ECS RTN, NACH RETURN, MANDATE FAIL |
| Salary | `salary` | NEFT+SAL, PAYROLL, PFMS |
| Gambling | `gambling` | Dream11, Rummy Circle, Bet365 |
| EMI / loan repayment | `emi_payment` | EMI, ACH+LOAN, lender names, BIL/BANK |
| Business inflow | `business_inflow` | RTGS, NEFT CR, IMPS CR, UPI CR, CASH DEP, SETTLEMENT |
| Tax / GST | `regular_expense` | GST CHALLAN, ADVANCE TAX, GIB/, TDS |
| POS / ATM | `regular_expense` | POS DEBIT, ATM NWD, CASH WDL |
| UPI / IMPS debit | `regular_expense` | UPI DR, IMPS DR |
| Utility bills | `regular_expense` | Airtel, JIO, electricity, water |
| Internal transfer | `other` | INFT, OWN ACCOUNT, SWEEP |
Rules are bank-agnostic — patterns match universal keywords (NEFT, RTGS, IMPS, UPI, ACH, ECS, NACH, CLG) rather than any single bank's internal format codes.