Jakub-Syrek/GenericRagGenerator
GitHub: Jakub-Syrek/GenericRagGenerator
Stars: 0 | Forks: 0
# GenericRagGenerator
[](#)
[](#rest-api)
[](#)
[](#)
[](#)
[](https://github.com/Jakub-Syrek/GenericRagGenerator/actions/workflows/ci.yml)
[](https://github.com/Jakub-Syrek/GenericRagGenerator/tree/main/tests)
[](https://github.com/Jakub-Syrek/GenericRagGenerator/blob/main/eval/sample-result.md)
[](https://www.python.org/)
[](https://fastapi.tiangolo.com/)
[](https://ollama.com/)
[](https://docs.llamaindex.ai/)
[](https://www.trychroma.com/)
[](https://github.com/astral-sh/ruff)
[](http://mypy-lang.org/)
[](https://github.com/PyCQA/bandit)
[](https://github.com/Yelp/detect-secrets)
[](https://pre-commit.com/)
[](Dockerfile)
[](scripts/install-windows-service.ps1)
Local Retrieval-Augmented Generation (RAG) service exposed as a **fully
RESTful HTTP API** (22 endpoints, OpenAPI / Swagger docs at `/docs`).
Upload documents, a whole repository ZIP, or a multi-source project,
then **query / chat / search** the index over the same surface from
curl, Postman, an IDE plugin or the bundled browser UI. The exact same
API is served whether you launch it as a foreground process
(`run.ps1`), a Windows service (NSSM), or the Docker compose stack —
no "service" vs "API" distinction, every install is a network-ready
HTTP server.
Answers are grounded strictly in the indexed content and cite the
exact source file (with line range, for code).
## What it can ingest
- **Prose docs**: `.pdf`, `.txt`, `.md` / `.markdown`, `.rst`, `.html` /
`.htm`, `.docx`.
- **Source code** (≈ 30 languages): Python, TypeScript / JavaScript / TSX /
JSX, Java, Kotlin, Scala, Go, Rust, C / C++ / headers, C#, Ruby, PHP,
Swift, shells (bash / zsh / sh), PowerShell, SQL, plus common config
formats (YAML, TOML, JSON, XML, INI, CSS, SCSS).
- **Whole repositories**: drop a `.zip` archive; the safe extractor rejects
path traversal, symlinks, oversize members, and skips clutter like
`.git/`, `node_modules/`, `__pycache__/`, `dist/`, `build/`, `target/`,
`vendor/`, IDE caches.
Each chunk carries `kind` (`code`/`doc`), `language`, optional
`line_start`/`line_end` and `repository_id` metadata. Source chips in the
chat UI render `repo/path/to/file.py:42-101` for code and the plain path
for docs.
## Stack
| Layer | Choice |
|-------------------|-------------------------------------------------------------------|
| Backend | Python 3.11 + FastAPI |
| LLM runtime | [Ollama](https://ollama.com) (local, OSS) |
| Chat model | `llama3.1:8b` (configurable via `CHAT_MODEL`) |
| Embedding model | `nomic-embed-text` with `search_query:` / `search_document:` prefixes |
| RAG orchestration | [LlamaIndex](https://docs.llamaindex.ai) |
| Vector store | [ChromaDB](https://www.trychroma.com) (embedded, on-disk) |
| Chunking | `MarkdownNodeParser` (md) / `CodeChunker` (code) / `SentenceSplitter` (default) |
| Frontend | Plain HTML + CSS + ES6 (CSP-compliant, no inline scripts) |
| Security | bandit + detect-secrets + pip-audit, security headers, slowapi, optional API key |
| CI | GitHub Actions (ruff + mypy + bandit + pytest + pip-audit) |
## Prerequisites
1. Python 3.11+
2. Ollama running locally: `ollama serve`
3. Required models pulled once:
ollama pull llama3.1:8b
ollama pull nomic-embed-text
## Run locally
python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -r requirements.txt
Copy-Item .env.example .env
.\run.ps1
Then open . Upload documents *or* a project ZIP from
the sidebar, then ask questions in the chat pane. The assistant streams the
answer token-by-token and shows citation chips for every chunk it pulled
from the vector store.
## Optional: install as a Windows service
NSSM ([nssm.cc](https://nssm.cc)) wraps the `.venv` uvicorn process as a
proper Windows service with auto-start, log rotation and restart-on-failure.
Two PowerShell helpers under `scripts/` automate it:
# 1. Install NSSM once (admin PowerShell):
winget install NSSM.NSSM
# or: choco install nssm
# or: drop nssm.exe somewhere on PATH manually
# 2. From the project root, in an *elevated* PowerShell:
.\scripts\install-windows-service.ps1
The script registers the service under the name `GenericRagGenerator`,
binds to `127.0.0.1:8000` (default `local` mode), captures stdout/stderr
into rotated logs under `.\logs\service-*.log` (10 MB rotation) and
restarts the process after 5 s on failure.
Two install profiles via `-Mode`:
| Mode | Bind | When to use |
|-----------|--------------|--------------------------------------------------------|
| `local` | `127.0.0.1` | Single-user / personal box. API reachable only locally. |
| `public` | `0.0.0.0` | Shared host / corp deployment. Refuses to install unless `API_KEY` (or `AUTH_PASSWORD` + `JWT_SECRET`) is set in `.env`. |
# Local-only (default): API on http://127.0.0.1:8000
.\scripts\install-windows-service.ps1
# Public bind on a custom port (must set auth in .env first)
.\scripts\install-windows-service.ps1 -Mode public -Port 9000
# Public + hide Swagger / Redoc / OpenAPI
.\scripts\install-windows-service.ps1 -Mode public -DisableDocs
Operational commands:
Get-Service GenericRagGenerator # status
Restart-Service GenericRagGenerator # reload after .env / code changes
Stop-Service GenericRagGenerator # graceful stop (15 s window)
.\scripts\uninstall-windows-service.ps1
Notes for production:
- Ollama itself ships as a service via its installer; both can run
side-by-side on the same box.
- Run the service under a least-privileged local account
(`nssm set GenericRagGenerator ObjectName .\rag-user `)
rather than `LocalSystem` when the host is shared.
- Set `API_KEY` in `.env` before installing the service so the public
surface is gated from the first start.
## Run in Docker (corp-friendly)
docker compose up -d
docker exec ggrag-ollama ollama pull llama3.1:8b
docker exec ggrag-ollama ollama pull nomic-embed-text
The compose stack runs both the app and Ollama with:
- non-root user (`UID 10001`) and no shell in the app image,
- `read_only: true` root filesystem with tmpfs only for `/tmp` and
`/home/app/.cache`,
- `cap_drop: ALL` and `no-new-privileges: true` on both services,
- a healthcheck on `/api/health`,
- separate named volumes for Chroma data and the Ollama model cache.
See [`SECURITY.md`](SECURITY.md) for the full threat model and corp
deployment guidance.
## REST API
Whatever run mode you pick (`run.ps1`, the Windows service, the Docker
compose stack) the same surface is exposed on the same port — there is
no "service" vs "API" distinction. Interactive OpenAPI docs live at
`/docs` (Swagger UI) and `/redoc`; set `DOCS_ENABLED=false` to hide both
plus `/openapi.json` in production.
| Method | Path | Purpose |
|--------|-------------------------------------|--------------------------------------------------------|
| GET | `/api/health` | Service + Ollama reachability (always unauthenticated) |
| POST | `/api/auth/login` | Verify credentials, issue a JWT bearer |
| GET | `/api/auth/whoami` | Echo the authenticated principal + scopes |
| POST | `/api/admin/reset` | Wipe every chunk in the index (admin scope only) |
| GET | `/api/documents` | List indexed documents |
| POST | `/api/documents` | Upload one document (multipart) |
| GET | `/api/documents/{id}` | Document detail (kind, language, preview) |
| GET | `/api/documents/{id}/chunks` | List every chunk produced from a document |
| DELETE | `/api/documents/{id}` | Remove a document and its chunks |
| GET | `/api/repositories` | List indexed repositories |
| POST | `/api/repositories` | Upload a project ZIP (multipart) |
| GET | `/api/repositories/{id}` | Repository detail with per-file ingest list |
| GET | `/api/repositories/{id}/files` | List every file ingested from a repository |
| DELETE | `/api/repositories/{id}` | Remove a repository and all its chunks |
| GET | `/api/projects` | List indexed multi-source projects |
| POST | `/api/projects` | Upload many raw files as one project (multipart) |
| GET | `/api/projects/{id}` | Project detail with per-file ingest list |
| GET | `/api/projects/{id}/files` | List every file ingested into a project |
| DELETE | `/api/projects/{id}` | Remove a project and all its chunks |
| POST | `/api/search` | Retrieval-only similarity search (no LLM call) |
| POST | `/api/query` | Synchronous RAG answer (one JSON, non-streaming) |
| POST | `/api/chat` | Streaming RAG answer (NDJSON) |
### Authentication
Two flows, accepted on every protected route:
- **Static `X-API-Key`** — set `API_KEY` in the environment, then send
`X-API-Key: ` on every request. Compared in constant time
(`hmac.compare_digest`); good for service-to-service.
- **Interactive JWT bearer** — set `AUTH_PASSWORD` *and* `JWT_SECRET` in
the environment, then:
TOKEN=$(curl -s -X POST http://127.0.0.1:8000/api/auth/login \
-H "Content-Type: application/json" \
-d '{"username":"admin","password":"…"}' | jq -r .access_token)
curl http://127.0.0.1:8000/api/auth/whoami -H "Authorization: Bearer $TOKEN"
Bearer tokens are HS256-signed, carry a `sub` + `scopes` claim, and
expire after `JWT_EXPIRES_MINUTES` (default 60). The token from
`/api/auth/login` carries the `admin` scope, required by
`/api/admin/reset`.
When neither `API_KEY` nor `JWT_SECRET` is configured, authentication
is disabled and every endpoint runs as an `anonymous` principal — fine
for local dev, never use in production.
## Architecture
A few extractions earn their keep — the rest is plain FastAPI / Pydantic
/ `logging`:
| Where | Why it's not just inline code |
|---------------------------------|----------------------------------------------------------------------------------------------------------|
| `services/chunking.py` | `Chunker` protocol + `ChunkerRegistry`. The alternative was an `if kind == "code" elif language == ...` chain that grew on every new format. |
| `services/index_catalog.py` | `IndexCatalog` is the only class that imports `chromadb`. Swapping ChromaDB for Qdrant/pgvector touches one file. |
| `services/rag_service.py` | Composition root. Handlers depend on the facade, never on the underlying LlamaIndex/Chroma/Ollama trio. |
| `security/_PrefixedOllamaEmbedding` | nomic-embed-text needs `search_query:` / `search_document:` prefixes; this is the only class that knows that. |
## Optional retrieval features
All off by default — flip the env knob to opt in. Every toggle keeps
the dense-only single-process default path untouched on `false`.
| Flag | Default | What it adds |
|----------------------------------|---------|--------------------------------------------------------------------------------------------------------------------------|
| `RETRIEVAL_MODE=hybrid` | `vector`| BM25 lexical pass fused with the dense hits via Reciprocal Rank Fusion. Helps with rare-token queries (function names, error codes). |
| `CACHE_ENABLED=true` | `false` | LRU + TTL response cache on `/api/search` and `/api/query`. Auto-invalidated on every ingest / delete / wipe. |
| `RERANKER_ENABLED=true` | `false` | FlashRank ONNX cross-encoder (~80 MB, one-time download) reorders the shortlist after retrieval. |
| `PARSER_SANDBOX_ENABLED=true` | `false` | PDF / DOCX / HTML parsers run in a subprocess with a wall-clock timeout. A malformed payload crashes only the worker. |
| `OLLAMA_RETRY_*` | on | Tenacity-driven exponential backoff on transient Ollama errors (`ConnectError`, `ReadTimeout`, `RemoteProtocolError`). |
Other shipped capabilities (always on, no flag needed):
- **Content-hash dedup on ingest.** Every upload is keyed by SHA-256
of the raw payload; re-uploading identical bytes short-circuits
before the Ollama embedding call and returns the prior record with
`deduplicated=true` in the response.
- **Per-principal ACL.** When `API_KEY` / `JWT_SECRET` is configured,
every chunk is stamped with the uploader's principal name. Read,
list and delete paths are owner-scoped so JWT users only see their
own documents / repositories / projects. The static API key and
any JWT carrying the `admin` scope bypass the filter. Anonymous
mode (no credentials configured) stays single-tenant.
The chat stream emits one JSON event per line:
- `{"type": "sources", "sources": [...]}` once at the top, where each
source carries `document_id`, `filename`, `kind`, `language`,
`repository_name` (when ingested via a repo), and `line_start` /
`line_end` for code chunks.
- `{"type": "delta", "content": "..."}` per generated token batch.
- `{"type": "done"}` on success, or `{"type": "error", "message": "..."}`
if Ollama or Chroma fail mid-stream.
## Project layout
backend/app/
api/ HTTP routes (auth, admin, documents, repository,
projects, search, query, chat, health)
services/ RagService (facade), chunking (Strategy/Registry),
index_catalog (Repository), document_loader
models/ Pydantic schemas
security/ Headers middleware, API-key + JWT auth, rate limiter
config.py Settings (env-driven)
dependencies.py FastAPI DI providers (cached singletons via lru_cache)
main.py FastAPI entry, middleware wiring, static frontend mount
frontend/ Static UI (documents + repository forms, source chips)
eval/ RAG quality eval (corpus + runner + sample report)
sample_repo/ Synthetic mini_parser fixture (code + HTML + Markdown)
tests/ Pytest suite (unit + API integration with TestClient)
data/ Runtime (uploads + Chroma persistence) - gitignored
logs/ Windows-service stdout / stderr (NSSM-managed) - gitignored
scripts/ PowerShell helpers (install / uninstall Windows service)
Dockerfile Multi-stage, non-root, healthcheck
docker-compose.yml App + Ollama + hardening
SECURITY.md Threat model and deployment guidance
## Quality eval
.\.venv\Scripts\python.exe -m eval.run_eval
- **`retrieval_top1_precision`** — top-1 source matches the expected file.
- **`answer_substring_match`** — answer contains any of the expected
substrings (case-insensitive).
- **`kind_precision`** — for code/doc-tagged questions, top-1 source is of
the expected kind.
- **`ooc_refusal_rate`** — out-of-corpus probes correctly refuse.
Results land under `eval/results/` (timestamped JSON + Markdown). A
checked-in baseline lives at [`eval/sample-result.md`](eval/sample-result.md).
Latest local run on `llama3.1:8b` + `nomic-embed-text`:
- **24/24** composite pass
- retrieval_top1 1.0, answer_match 1.0, kind 1.0, ooc_refusal 1.0
- average latency ~0.6 s per turn warm
## Development workflow
# One-time setup
pip install -r requirements.txt
pre-commit install
# Per change
pre-commit run --all-files
pytest
`pre-commit` runs:
- whitespace / EOL / large-file hygiene,
- `ruff` (lint + format),
- `mypy` on `backend/`,
- `bandit` (Python security smells),
- `detect-secrets` against `.secrets.baseline`.
CI on push and PR runs the same pre-commit suite plus `pip-audit`
(advisories logged but non-blocking; see `SECURITY.md`).
Integration-style end-to-end checks against a live Ollama are kept out of
CI and live in `smoke_test.py` plus the `eval/` package.
## Configuration knobs (env)
| Variable | Default | Purpose |
|-----------------------------|-------------------------------|-------------------------------------------------------------|
| `OLLAMA_HOST` | `http://localhost:11434` | Where to reach Ollama. |
| `CHAT_MODEL` | `llama3.1:8b` | Chat completion model. |
| `EMBEDDING_MODEL` | `nomic-embed-text` | Embedding model. |
| `EMBEDDING_QUERY_PREFIX` | `"search_query: "` | Prefix prepended to query embeddings (nomic asymmetric). |
| `EMBEDDING_DOCUMENT_PREFIX` | `"search_document: "` | Prefix prepended to document embeddings. |
| `CHUNK_SIZE` / `CHUNK_OVERLAP` | `800` / `120` | Sentence-splitter window in characters. |
| `TOP_K` | `8` | Retriever top-k. Bump higher only on small / synthetic corpora; on a real index it just stuffs the prompt. |
| `API_KEY` | *(unset)* | When set, gates `/api/documents`, `/api/repositories`, `/api/chat` behind `X-API-Key`. |
| `CORS_ORIGINS` | `["http://localhost:8000", ...]` | Strict allowlist for browsers. |
| `RATE_LIMIT_CHAT` | `30/minute` | Per-IP slowapi limit on `/api/chat`. |
| `RATE_LIMIT_UPLOADS` | `10/minute` | Documented for reverse-proxy use (slowapi can't wrap `UploadFile`). |
| `AUTH_USERNAME` | `admin` | Username accepted by `POST /api/auth/login`. |
| `AUTH_PASSWORD` | *(unset)* | Password accepted by login. Required to enable the bearer flow. |
| `JWT_SECRET` | *(unset)* | HS256 signing secret for issued JWT bearers. Required to enable login. |
| `JWT_EXPIRES_MINUTES` | `60` | Lifetime of issued bearer tokens. |
| `DOCS_ENABLED` | `true` | Set to `false` to hide `/docs`, `/redoc` and `/openapi.json` in prod. |
| `RETRIEVAL_MODE` | `vector` | `vector` (dense only) or `hybrid` (dense + BM25 fused via RRF). |
| `CACHE_ENABLED` | `false` | LRU+TTL response cache for `/api/search` and `/api/query`. |
| `CACHE_MAX_ENTRIES` | `256` | Max cached responses before LRU eviction. |
| `CACHE_TTL_SECONDS` | `300` | TTL on cached responses (seconds). |
| `RERANKER_ENABLED` | `false` | Run a FlashRank cross-encoder on the retrieved shortlist. |
| `RERANKER_MODEL` | `ms-marco-MiniLM-L-12-v2` | FlashRank ONNX model name. |
| `RERANKER_TOP_K` | `5` | Number of hits kept after reranking. |
| `PARSER_SANDBOX_ENABLED` | `false` | Run PDF / DOCX / HTML parsers in a subprocess sandbox. |
| `PARSER_SANDBOX_TIMEOUT_SECONDS` | `10.0` | Wall-clock limit per sandboxed parse before it's killed. |
| `OLLAMA_RETRY_ATTEMPTS` | `3` | Total attempts (including the first) on transient Ollama errors. |
| `OLLAMA_RETRY_BACKOFF_MIN_SECONDS` | `1.0` | Initial exponential backoff window. |
| `OLLAMA_RETRY_BACKOFF_MAX_SECONDS` | `8.0` | Cap on exponential backoff window. |
## Security
A full threat model, in-scope / out-of-scope guarantees, deployment
guidance and the open dependency advisories live in
[`SECURITY.md`](SECURITY.md). In short:
- Local-first / corp-network target; not for direct public-internet
exposure without a reverse proxy.
- Bandit + detect-secrets + pip-audit wired into pre-commit / CI.
- Hardened response headers (CSP, XFO DENY, HSTS, Referrer-Policy,
Permissions-Policy), strict CORS, optional API key, per-IP rate
limiting.
- ZIP ingest enforces path-traversal / symlink / size caps with named
domain errors mapped to specific HTTP statuses.
## Commit and style conventions
- English only across code, comments, commits, identifiers.
- Public functions document `@param` / `@returns` (JSDoc-style docstrings).
- SOLID + DI: dependencies are injected via FastAPI `Depends` and
constructor arguments, never module globals.
- Functions stay around the 30-line ceiling; pylint statement / branch
limits are enforced by ruff.
- Errors are translated at every external boundary into explicit domain
exceptions (`EmbeddingError`, `ChatGenerationError`, `VectorStoreError`,
`StorageError`, `UnsafeArchiveError`, `RepositoryError`,
`EmptyDocumentError`, `UnsupportedFormatError`).
- Commits are atomic and pushed immediately; CI must stay green on `main`.