PRODUCTION RAG SYSTEM

UK Finance Domain
Intelligence System

A production-grade Retrieval-Augmented Generation system for answering questions over real UK financial documents — with evidence, traceability, and deployment realism.

Evidence Grounding Traceability Deployment Realism

Machine Learning Systems Are About
Trust, Not Text

Most RAG demos stop at "the model gave the right answer."

This system asks a harder question: Why should you trust the answer at all?

Financial documents are long, dense, and ambiguous. Large language models are powerful — but unreliable when left alone.

This project treats retrieval-augmented generation as a systems problem, not a prompting problem.

Rather than optimizing for fluency, the system is designed around:

01
Evidence Grounding
02
Traceability
03
Deployment Realism
04
Observability
05
Failure Transparency

Every answer is backed by retrieved source passages, attributable to a specific document and page range, generated by a stateless, production-ready API, visible through a simple but honest UI.

This is not a chatbot. It is a document intelligence pipeline.

01

Document Registry

What happens here

The system begins with a declarative registry of financial documents.

Each document is defined by:

  • company
  • fiscal year
  • report type
  • source URL

No document is processed unless it is explicitly declared.

Why this matters

This creates reproducibility, auditability, and clear provenance. Nothing "just appears" in the system.

ENGINEERING CHOICE

Declarative registry

What we chose: YAML-based document registry listing company, fiscal year, report type, and source URL.

Alternatives: Hardcoding URLs, dynamic crawling, manual uploads

Why: Explicit declaration equals reproducibility. Prevents accidental ingestion and mimics production data contracts.

02

PDF Ingestion

What happens here

Registered documents are downloaded as immutable artifacts.

Files are stored by company, year, and report type. Duplicate downloads are skipped.

Why this matters

Raw documents are treated as source-of-truth artifacts, not transient inputs.

03

Text Extraction

What happens here

PDFs are converted to raw text using page-aware extraction. Each page is explicitly marked. Failures are surfaced early.

Why this matters

Page boundaries are preserved so that citations remain meaningful later.

ENGINEERING CHOICE

Preserve page boundaries

What we chose: Extract text page-by-page with explicit markers, preserving ordering and structure.

Alternatives: Full-text blob dumping, aggressive OCR, layout stripping

Why: Citations require page awareness. Financial reports are page-referenced; losing structure kills explainability.

04

Text Normalization

Conservative cleanup

The extracted text undergoes conservative cleanup:

  • Remove obvious headers/footers
  • Normalize page markers
  • Preserve original wording

NO SEMANTIC REWRITING

No aggressive cleaning.

Why this matters

RAG systems fail when preprocessing distorts meaning. This phase optimizes for fidelity, not prettiness.

ENGINEERING CHOICE

Conservative cleanup

What we chose: Remove obvious headers/footers, normalize whitespace, preserve original wording. No semantic rewriting.

Alternatives: Heavy regex, sentence rewriting, LLM-based cleanup

Why: RAG systems break when semantics distort. Financial language is precise — don't rewrite it. Fidelity over aesthetics.

05

Chunking & Metadata

What happens here

Documents are split into overlapping chunks bounded by character length, page-traceable, and metadata-rich.

Each chunk inherits company, year, document ID, and page range.

Why this matters

Chunks become the atomic unit of retrieval. Traceability is built in, not bolted on.

ENGINEERING CHOICE

Fixed-size overlapping chunks

What we chose: Character-bounded chunks (~1200 chars) with overlap (~150 chars) and chunk-level metadata.

Alternatives: Sentence-based, section-based, no overlap

Why: Simple and predictable. Overlap prevents boundary loss. Metadata enables filtering. Optimized for retrieval stability, not theoretical optimality.

06

Embedding Generation

What happens here

Each chunk is embedded using a sentence-transformer model. Embeddings are deterministic, normalized, and stored once.

Why this matters

This enables efficient semantic similarity without model inference at query time.

ENGINEERING CHOICE

Sentence-Transformers MiniLM

What we chose: sentence-transformers/all-MiniLM-L6-v2

Alternatives: OpenAI embeddings, larger SBERT, Instructor models, domain-specific

Why: Fast, small, deterministic, excellent performance per parameter, no API dependency. Signals "I know when not to over-engineer."

07

Vector Store (FAISS)

What happens here

Embeddings are stored in a FAISS index loaded at API startup.

The index is read-only in production, fast, and memory-resident.

Why this matters

FAISS provides predictable latency without external dependencies.

ENGINEERING CHOICE

FAISS (in-memory)

What we chose: FAISS index loaded at startup, read-only during serving.

Alternatives: Pinecone, Weaviate, Milvus, ElasticSearch, Chroma

Why: No external dependencies, zero per-query cost, predictable latency, industry-standard. Production realism: choose simplicity until scale demands otherwise.

08

Query Embedding

What happens here

Incoming user queries are embedded using the same model as document chunks.

No shortcuts. No mixing models.

Why this matters

Embedding symmetry prevents silent retrieval skew.

09

Semantic Retrieval

What happens here

Top-K similar chunks are retrieved using vector similarity.

The system retrieves more than needed, then narrows.

Why this matters

Over-retrieval followed by filtering is safer than under-retrieval.

10

Metadata Filtering

What happens here

Retrieved chunks are filtered by company, fiscal year, and document type (optional).

Filters constrain retrieval, not generation.

Why this matters

This prevents cross-company hallucination and leakage.

ENGINEERING CHOICE

Over-retrieve, then filter

What we chose: Retrieve top-K×2, apply metadata filters after retrieval.

Alternatives: Pre-filtering, exact K retrieval, no filtering

Why: Vector similarity isn't perfect. Over-retrieval reduces false negatives; filtering enforces correctness constraints. Standard pattern in serious RAG.

11

Evidence Assembly

What happens here

Filtered chunks are assembled into an evidence context with clearly separated passages and explicit source attribution.

This context is the only information passed to the LLM.

Why this matters

The LLM is not allowed to "know" anything else.

ENGINEERING CHOICE

Explicit context construction

What we chose: Assemble chunks into structured context with clear separators. No free-form prompt stuffing.

Alternatives: Raw chunk dumping, letting LLM decide relevance

Why: Constrains the LLM's job, enables auditing, encourages extractive behavior. The model reasons over supplied facts, not latent knowledge.

12

LLM Generation

What happens here

The model generates an answer using the user query, the assembled evidence, and strict instructions to avoid speculation.

Latency is measured. Failures are surfaced.

Why this matters

Generation is the last step, not the core system.

ENGINEERING CHOICE

GPT-4.1-mini via Responses API

What we chose: Smaller, cheaper, reliable reasoning model with explicit latency measurement.

Alternatives: GPT-4 full, GPT-3.5, open-source models, fine-tuned models

Why: Reasoning > verbosity, lower cost for repeated queries, faster responses. Right-sized model for task.

13

API Response

What happens here

The API returns the answer, cited evidence, and structured metadata.

No UI assumptions are baked in.

Why this matters

The API is reusable, testable, and production-oriented.

ENGINEERING CHOICE

FastAPI

What we chose: FastAPI for API layer.

Alternatives: Flask, Django, Node.js, LangChain servers

Why: Strong typing, automatic OpenAPI docs, async-ready, production-grade, clean contracts. Signals "this is an API, not a script."

14

Observability & Logging

What happens here

Each request logs retrieval latency, generation latency, model version, and request outcome.

Logs are structured, not textual.

Why this matters

Observability is how systems earn trust over time.

ENGINEERING CHOICE

Explicit logging

What we chose: Log retrieval start/end, LLM generation latency, model version, request completion.

Alternatives: No logging, print statements, full tracing stacks

Why: Lightweight, Cloud Run compatible, sufficient for debugging. Production awareness without monitoring theater.

15

Streamlit UI

What happens here

A simple UI enables natural-language queries, optional filters, and transparent citation display.

The UI calls the hosted API directly.

Why this matters

The UI proves the system works end-to-end — not just in code.

ENGINEERING CHOICE

Streamlit (thin client)

What we chose: Streamlit UI calling hosted API directly, displaying citations transparently.

Alternatives: React, Next.js, Gradio, no UI

Why: Fastest honest UI, minimal ceremony, easy demo, clean separation from backend. Communication over frontend mastery.

ENGINEERING CHOICE

Docker + Cloud Run

What we chose: Dockerized services, serverless Cloud Run, CI/CD via GitHub Actions.

Alternatives: VMs, Kubernetes, local-only, platform-specific runtimes

Why: Scales to zero, simple mental model, industry-relevant, minimal ops burden. Proves deployment understanding beyond notebooks.

Document Pipeline

Registry Ingestion Extraction Normalize Chunking Embed FAISS Query Retrieve Filter Evidence LLM API Logs UI

Navigate

Reliable AI Is Built, Not Prompted

This project demonstrates how to build an AI system that answers questions responsibly, exposes its evidence, survives deployment, and fails honestly.

Answers with evidence
Survives deployment
Fails honestly
Respects systems thinking

Not by chasing novelty — but by respecting systems thinking.