Private RAG with citations for company documents

This guide explains how to build an assistant that answers questions using your internal documents, locally, with verifiable sources.

The idea is simple: instead of letting the model “guess”, you first make it retrieve the most relevant passages from your document base, then write the answer from those excerpts. This approach is called RAG (Retrieval‑Augmented Generation). The key takeaway: retrieval + writing.

Citations make this usable in a professional context: every answer shows where the information comes from (file, section, excerpt). You can re‑read, audit, fix the docs, and keep trust.

To deploy this workflow quickly on‑prem (FAISS index, ingestion, FastAPI, Docker), see Knowledge Base QA Agent.

When a “private” RAG is the right solution

A local RAG (on your servers) quickly becomes the best option when:

  • the information is internal (procedures, customer docs, tickets, specs, wiki)
  • you want to control hosting (data, logs, security, costs)
  • you need verifiable answers (citations required)
  • you need to iterate fast: adding docs, fixing content, new projects

Typical examples:

  • internal support (SOP, documentation, FAQ)
  • assistant for a product knowledge base
  • search across specs / contracts / quality documents
  • onboarding (tech, process, compliance)

The recipe for a reliable RAG

A RAG that “works” in a demo is not necessarily a RAG that is “reliable” in production.

A robust setup relies on 6 building blocks:

  1. Clean documents (structure, noise removal)
  2. Consistent chunking
  3. Embeddings adapted to the language and domain
  4. Versioned index (FAISS / vectors)
  5. Properly tuned retrieval (Top‑K, filters)
  6. Grounded generation + citations + “I don’t know” behavior

Failures almost always come from 3 causes:

  • the ingested content contains too much noise (PDF headers/footers, email signatures, quoted replies “On … wrote:”, legal disclaimers, Word comments / track changes, page numbers, watermarks, duplicated tables of contents)
  • chunks (text snippets produced by document splitting) are too long / too short, or they break the text structure
  • the model answers even when the sources do not contain the information

The rest of this article is a practical guide to avoid these pitfalls.

1) Prepare your documents: the foundation that saves time

Recommended formats

The simplest and most robust way to start is to ingest text files (or clean .txt exports).

If you start from PDF, HTML, Notion, Confluence, etc.:

  • export to text when possible
  • keep titles and sections (they act as anchors)
  • avoid “broken” line breaks that fragment sentences

Minimum cleaning (useful from day one)

Before ingestion, apply these rules:

  • remove navigation menus, repetitive footers, duplicated tables of contents
  • normalize headings (e.g., #, ##, ###)
  • keep lists and tables (they often carry key info)
  • make sure every document has an explicit name

Add context to the file

Two simple strategies:

  • Descriptive filenames: product_pricing_2026.txt, api_authentication.txt
  • A short header at the top of the file (2–5 lines): product, team, date, scope

This improves retrieval with almost no effort.

2) Chunking: how to split without breaking meaning

Chunking creates “snippets” small enough to retrieve, but complete enough to answer.

Rules that work well in practice

  • split by sections (headings) whenever possible
  • aim for “human‑readable” chunks (one paragraph to a few paragraphs)
  • add a small overlap to avoid losing a useful sentence between two chunks

Common mistakes

  • chunks that are too short: they lose context (e.g., a single isolated sentence)
  • chunks that are too long: they dilute information and consume model context
  • “random” splitting (by characters) that cuts a procedure in the middle

A good test: “if I read this chunk alone, does it still make complete sense?”

3) Embeddings: pick the right model (and stay pragmatic)

Embeddings turn your chunks into vectors for the index.

What to aim for

  • strong behavior in French (and multilingual if your base is mixed)
  • stable on “business” text (procedures, acronyms, product terms)
  • fast enough for ingestion

Why start simple

Starting with a general‑purpose embeddings model is often very effective.

Then, if you observe systematic failures (domain synonyms, jargon, specific formats), you can:

  • test another embeddings model
  • enrich documents (titles, glossary)
  • add a reranker (if you need it)

In Knowledge Base QA Agent, the embeddings model is configurable via an environment variable.

4) Retrieval: Top‑K, filters, and noise control

Retrieval is the most “underestimated” part: if the retrieved passages are bad, the best generation in the world will not save the answer.

A simple starting point

  • Top‑K = 4 is a good default
  • increase if docs are long or information is dispersed
  • decrease if the model mixes too many sources and becomes confusing

Filter and deduplicate

Even without a complex stack, add two protections:

  • deduplicate near‑identical chunks
  • filter low‑value content (e.g., under X characters, empty pages, repetitions)

5) Grounded generation: the rule “no source, no claim”

If you do not enforce the behavior, many models will tend to “fill the gaps” when information is missing.

The answer contract

In your system prompt / generation instructions, you must add rules to guide the LLM:

  • answer only from the excerpts
  • if the information is not present, explicitly say it was not found
  • ask a clarification question if needed
  • produce citations systematically

Recommended citation format

A practical product‑side format:

  • citations at the end of the answer (list of files + segments)
  • or inline citations (e.g., [source: api_authentication.txt#3])

The choice depends on your UI and your audit needs.

6) Evaluate a RAG: fast, but serious

The goal is not a perfect score: it is to avoid costly mistakes and know when the system is ready.

Mini test set (recommended):

  • 30 to 100 real questions
  • for each question:
    • expected answer (or “not found”)
    • expected source (at least one document)

What you must check:

  • retrieval: does the right passage appear in Top‑K?
  • grounding: does the model invent when the source is missing?
  • citations: do the cited sources actually contain the information?

Production hardening: versioning, updates, and trust

Once the RAG is validated in real conditions, the goal is to avoid surprises: be able to roll back, integrate new documents cleanly, and understand what was answered (and why).

Version the index

Treat the index as a build artifact: it must be reproducible and allow rollback when something regresses.

Even for simple usage, this is critical:

  • keep one index folder per version
  • keep a manifest: date, embeddings model, parameters, list of ingested files

Update strategy

Pick an update strategy that fits your organization: how often docs change, time constraints, and freshness requirements.

  • “batch” ingestion (e.g., nightly) if you have many documents
  • “on‑demand” ingestion if you add documents occasionally

Useful logs (without leakage)

Logs help diagnose and improve the system without keeping more sensitive data than necessary: track the path (retrieval → answer → sources), not people’s private life.

  • question + timestamp
  • retrieved documents (names / IDs)
  • score / Top‑K
  • answer + sources

In a private RAG, these logs are also a tool for continuous improvement.

Speed up with the Knowledge Base QA Agent

This guide explains how to think about a reliable RAG. In practice, what takes time is not the idea “retrieval + writing”, but industrialization: clean ingestion, versioned index, stable API, useful logs, and strict source‑based behavior.

Knowledge Base QA Agent is designed to prevent the “RAG project that drags on” by providing a ready‑to‑use base:

  • Ingestion: a script to turn your files into chunks, generate embeddings, and build a local FAISS index
  • Query: a CLI mode to test quickly with real questions
  • Integration: a FastAPI service with a single route (question → answer + sources)
  • Deployment: a Docker start to keep an environment reproducible
  • Configuration: simple settings via environment variables (Top‑K, embeddings model, generation model, temperature, context window)

What the agent saves you

Instead of starting from a blank slate, you immediately get:

  • a coherent project structure (ingestion / index / service)
  • answers with citations usable in a product
  • a foundation ready to harden (index versioning, update strategy, logs)

If your goal is to integrate a RAG into an existing app (intranet, support, on‑prem SaaS product), the agent helps you move faster from “demo” to “component”: a stable endpoint, readable sources, and a base that is easy to iterate.

In summary

An effective private RAG is clean documents, consistent splitting, well‑tuned retrieval, and generation that follows one simple rule: no source, no claim.

If you want to move quickly from an idea to an integrable component (ingestion + FAISS index + API + citations + Docker base), the simplest path is to start with Knowledge Base QA Agent and iterate on your documents and parameters.