nickusevich/ai-engineering-assignment

GitHub: nickusevich/ai-engineering-assignment

Stars: 0 | Forks: 0

# Football News Pipeline It works in three steps: 1. Hybrid search finds similar articles using meaning (vector embeddings) and keywords (Postgres full text search), then merges both rankings with Reciprocal Rank Fusion (RRF). 2. An LLM filters those candidates down to the ones actually about the same event. 3. A second LLM call checks whether the new article brings fresh facts, status updates, or financial details. If the model is not sure, the article goes to a human editor instead of being classified silently. ## Architecture ┌──────────────┐ news.csv ──▶│ Ingestion │──▶ embeddings ──▶ Postgres (pgvector + tsvector) └──────────────┘ │ ▼ ┌────────────────────────────────────────┐ incoming ─── Task 1 ───▶│ Hybrid retrieval │ article │ semantic (pgvector) + keyword (tsv) │ │ ────── RRF fusion ────── │ └────────────────────────────────────────┘ │ ┌─────────────────▼──────────────────────┐ Task 2 ──▶│ LLM rerank (parallel, bounded) │ │ → relevant matches above threshold │ └─────────────────┬──────────────────────┘ ▼ ┌────────────────────────────────────────┐ │ LLM novelty assessment │ │ → PUBLISH / SKIP / REVIEW │ │ → persist to `decisions` table │ └─────────────────┬──────────────────────┘ ▼ Task 3 ──▶ Observability: REVIEW queue + audit + drift stats ## Quick start **Requirements:** Docker, Docker Compose, and an OpenRouter API key. Copy the example env file, paste in your API key, and start the stack. Docker Compose boots Postgres with pgvector, loads `data/news.csv`, runs all three tasks, and writes results into `outputs/`. cp .env.example .env # then edit and set OPENROUTER_API_KEY docker compose up ## Tasks ### Task 1: similar article search Given a query, returns the top K most similar articles. Two searches run in parallel: pgvector cosine similarity for meaning, and Postgres tsvector for keyword overlap. The two ranked lists are merged with Reciprocal Rank Fusion (RRF). RRF does not need score normalization, which is useful because the two scoring systems produce very different scales. docker compose run --rm app uv run python main.py --task 1 \ --query "Tottenham Bergvall injury" --top-k 5 ### Task 2: publish, skip, or review For every article in `data/incoming_news.json`, the pipeline finds the most similar existing articles, then asks an LLM to keep only the ones actually about the same event (these calls run in parallel, capped by a semaphore). A second LLM call then judges whether the new article adds fresh facts, status changes, or financial details that the existing coverage does not already have. If confidence falls below `NOVELTY_CONFIDENCE_THRESHOLD`, the decision is flagged REVIEW so a human editor can take a look. Every decision is saved to the `decisions` table, keyed by a hash of the normalized text, so running the pipeline again updates rows in place instead of inserting duplicates. docker compose run --rm app uv run python main.py --task 2 Output goes to `outputs/task2_decisions.json`. A sample run is committed at `outputs/sample_task2_decisions.json`. ### Task 3: observability for editorial decisions A pipeline making editorial decisions on its own should not run without monitoring. Task 3 shows what the system has been doing, so editors and ML engineers can spot problems early: an audit trail for each decision (incoming text, decision, confidence, reasoning, matched article, timestamps), aggregate counts for PUBLISH, SKIP, and REVIEW, and the current REVIEW queue with full context. In production this becomes the editor worklist and a way to catch problems early. A sudden drop in average confidence usually means something is wrong: a broken embedding model, a bad prompt change, or a shift in the kind of articles coming in. docker compose run --rm app uv run python main.py --task 3 Output: `outputs/task3_analysis.json`. ## Evaluation A small ground truth file at `data/ground_truth.json` (currently two labeled articles) is used by `scripts/evaluate.py` to measure decision accuracy against the latest Task 2 run. docker compose run --rm app uv run python scripts/evaluate.py ## Configuration All settings live in `.env`. See `.env.example` for the full list. The important ones: * `LLM_MODEL`: any OpenRouter chat model. * `EMBEDDING_MODEL` and `EMBEDDING_DIMENSION`: must match the model output size. * `TOP_K` and `RRF_K`: retrieval settings. * `RERANKER_RELEVANCE_THRESHOLD` and `RERANKER_MAX_CONCURRENT`: rerank stage. * `NOVELTY_CONFIDENCE_THRESHOLD`: the REVIEW gate. ## Development For local work without rebuilding the app container, run Postgres in Docker and the app on the host: cp .env.example .env # add OPENROUTER_API_KEY docker compose up -d db # Postgres with pgvector only uv venv && uv sync # Python 3.12+ via uv uv run python main.py --task 1 # default query ## Engineering notes Articles and decisions are deduplicated by a SHA256 hash of the normalized text: Unicode NFC, footnotes and markdown stripped, whitespace collapsed. Small formatting differences do not change the hash, so the same article will not be inserted twice. Each article also stores the embedding model that produced its vector. If `EMBEDDING_MODEL` changes between runs, the app refuses to start. Otherwise vectors from two different embedding spaces would quietly break similarity rankings. The reranker runs LLM calls in parallel with `asyncio.gather`, capped by `asyncio.Semaphore(RERANKER_MAX_CONCURRENT)`. If one call fails for a candidate, the fallback uses the RRF score and checks the threshold again. ## Tests make test # 44 isolated unit tests, runs in under 3 seconds make check # black + ruff + mypy + tests Tests cover RRF fusion, the async reranker (parallel dispatch, threshold, fallback, concurrency cap), the novelty detector branches, the LLM client (JSON parsing and schema validation), the embedding service (dimension check across the whole batch, batching), and the hash and normalization helpers. Everything is mocked, so the suite needs no database and no network. CI runs the same suite on every push and PR (`.github/workflows/ci.yml`). ## License MIT