nickusevich/ai-engineering-assignment
GitHub: nickusevich/ai-engineering-assignment
Stars: 0 | Forks: 0
# Football News Pipeline
It works in three steps:
1. Hybrid search finds similar articles using meaning (vector embeddings)
and keywords (Postgres full text search), then merges both rankings with
Reciprocal Rank Fusion (RRF).
2. An LLM filters those candidates down to the ones actually about the
same event.
3. A second LLM call checks whether the new article brings fresh facts,
status updates, or financial details. If the model is not sure, the
article goes to a human editor instead of being classified silently.
## Architecture
┌──────────────┐
news.csv ──▶│ Ingestion │──▶ embeddings ──▶ Postgres (pgvector + tsvector)
└──────────────┘ │
▼
┌────────────────────────────────────────┐
incoming ─── Task 1 ───▶│ Hybrid retrieval │
article │ semantic (pgvector) + keyword (tsv) │
│ ────── RRF fusion ────── │
└────────────────────────────────────────┘
│
┌─────────────────▼──────────────────────┐
Task 2 ──▶│ LLM rerank (parallel, bounded) │
│ → relevant matches above threshold │
└─────────────────┬──────────────────────┘
▼
┌────────────────────────────────────────┐
│ LLM novelty assessment │
│ → PUBLISH / SKIP / REVIEW │
│ → persist to `decisions` table │
└─────────────────┬──────────────────────┘
▼
Task 3 ──▶ Observability: REVIEW queue + audit + drift stats
## Quick start
**Requirements:** Docker, Docker Compose, and an OpenRouter API key.
Copy the example env file, paste in your API key, and start the stack.
Docker Compose boots Postgres with pgvector, loads `data/news.csv`, runs
all three tasks, and writes results into `outputs/`.
cp .env.example .env # then edit and set OPENROUTER_API_KEY
docker compose up
## Tasks
### Task 1: similar article search
Given a query, returns the top K most similar articles. Two searches run
in parallel: pgvector cosine similarity for meaning, and Postgres
tsvector for keyword overlap. The two ranked lists are merged with
Reciprocal Rank Fusion (RRF). RRF does not need score normalization,
which is useful because the two scoring systems produce very different
scales.
docker compose run --rm app uv run python main.py --task 1 \
--query "Tottenham Bergvall injury" --top-k 5
### Task 2: publish, skip, or review
For every article in `data/incoming_news.json`, the pipeline finds the
most similar existing articles, then asks an LLM to keep only the ones
actually about the same event (these calls run in parallel, capped by a
semaphore). A second LLM call then judges whether the new article adds
fresh facts, status changes, or financial details that the existing
coverage does not already have.
If confidence falls below `NOVELTY_CONFIDENCE_THRESHOLD`, the decision
is flagged REVIEW so a human editor can take a look. Every decision is
saved to the `decisions` table, keyed by a hash of the normalized text,
so running the pipeline again updates rows in place instead of inserting
duplicates.
docker compose run --rm app uv run python main.py --task 2
Output goes to `outputs/task2_decisions.json`. A sample run is committed
at `outputs/sample_task2_decisions.json`.
### Task 3: observability for editorial decisions
A pipeline making editorial decisions on its own should not run without
monitoring. Task 3 shows what the system has been doing, so editors and
ML engineers can spot problems early: an audit trail for each decision
(incoming text, decision, confidence, reasoning, matched article,
timestamps), aggregate counts for PUBLISH, SKIP, and REVIEW, and the
current REVIEW queue with full context.
In production this becomes the editor worklist and a way to catch
problems early. A sudden drop in average confidence usually means
something is wrong: a broken embedding model, a bad prompt change, or a
shift in the kind of articles coming in.
docker compose run --rm app uv run python main.py --task 3
Output: `outputs/task3_analysis.json`.
## Evaluation
A small ground truth file at `data/ground_truth.json` (currently two
labeled articles) is used by `scripts/evaluate.py` to measure decision
accuracy against the latest Task 2 run.
docker compose run --rm app uv run python scripts/evaluate.py
## Configuration
All settings live in `.env`. See `.env.example` for the full list.
The important ones:
* `LLM_MODEL`: any OpenRouter chat model.
* `EMBEDDING_MODEL` and `EMBEDDING_DIMENSION`: must match the model output size.
* `TOP_K` and `RRF_K`: retrieval settings.
* `RERANKER_RELEVANCE_THRESHOLD` and `RERANKER_MAX_CONCURRENT`: rerank stage.
* `NOVELTY_CONFIDENCE_THRESHOLD`: the REVIEW gate.
## Development
For local work without rebuilding the app container, run Postgres in
Docker and the app on the host:
cp .env.example .env # add OPENROUTER_API_KEY
docker compose up -d db # Postgres with pgvector only
uv venv && uv sync # Python 3.12+ via uv
uv run python main.py --task 1 # default query
## Engineering notes
Articles and decisions are deduplicated by a SHA256 hash of the
normalized text: Unicode NFC, footnotes and markdown stripped,
whitespace collapsed. Small formatting differences do not change the
hash, so the same article will not be inserted twice.
Each article also stores the embedding model that produced its vector.
If `EMBEDDING_MODEL` changes between runs, the app refuses to start.
Otherwise vectors from two different embedding spaces would quietly
break similarity rankings.
The reranker runs LLM calls in parallel with `asyncio.gather`, capped by
`asyncio.Semaphore(RERANKER_MAX_CONCURRENT)`. If one call fails for a
candidate, the fallback uses the RRF score and checks the threshold
again.
## Tests
make test # 44 isolated unit tests, runs in under 3 seconds
make check # black + ruff + mypy + tests
Tests cover RRF fusion, the async reranker (parallel dispatch,
threshold, fallback, concurrency cap), the novelty detector branches,
the LLM client (JSON parsing and schema validation), the embedding
service (dimension check across the whole batch, batching), and the
hash and normalization helpers. Everything is mocked, so the suite
needs no database and no network. CI runs the same suite on every push
and PR (`.github/workflows/ci.yml`).
## License
MIT