Feature Specification: RAG Retrieval Quality Improvements

Feature Branch: 004-rag-retrieval-quality
Created: 2026-04-06
Status: Draft
Input: User description: "I want to enhance the current RAG system, to avoid common pitfalls like: Vocabulary mismatch in retrieval, where the user's language doesn't overlap with the documentation's terminology and Citation errors in generation, where the model cites a chunk that either wasn't retrieved or doesn't support the specific claim being made"

Overview

The AI Teacher's question-answering system retrieves relevant content from neurosurgery documentation and generates answers. Two reliability problems reduce trust in the system:

Vocabulary mismatch: A student asking "what happens after cutting the skull?" may use everyday language while the documentation uses clinical terms like "craniotomy" — causing relevant passages to be missed entirely.
Citation hallucination: The model sometimes references a section or page in its answer that was not actually retrieved or that does not support the specific claim made, misleading students.

This feature improves both the accuracy of what gets retrieved and the integrity of what the model claims as its sources.

User Scenarios & Testing (mandatory)

User Story 1 - Accurate Retrieval Despite Different Terminology (Priority: P1)

A medical student asks a question using lay or imprecise language. The system bridges the gap between their vocabulary and the technical terminology used in the textbook, returning contextually relevant passages even when there is no word overlap between the question and the document text.

Why this priority: Vocabulary mismatch is the most frequent silent failure — the system returns an answer but based on wrong or empty context, so students receive incorrect information without realising it.

Independent Test: Ask the system a question using a common synonym or lay term for a concept that appears in the documentation only under a clinical name. Verify that the retrieved passages contain relevant content about that concept.

Acceptance Scenarios:

Given a question uses a lay term (e.g., "brain swelling"), When the system retrieves content, Then it returns passages discussing the medically indexed equivalent ("cerebral edema") even though that phrase was not in the question.
Given a question uses an acronym the documentation spells out fully, When retrieval runs, Then relevant passages are still found.
Given a highly technical question that perfectly matches documentation language, When retrieval runs, Then quality does not regress compared to current behavior.

User Story 2 - Grounded Citation in Generated Answers (Priority: P1)

When the model produces an answer and references a source (section, page, or figure), that source must have been part of the retrieved context and must genuinely support the specific claim being cited.

Why this priority: A hallucinated citation is worse than no citation — it gives students false confidence that a claim is documented when it is not.

Independent Test: Submit a query, capture the retrieved context passed to the model, and verify that every source reference in the generated answer maps to an identifier present in that context.

Acceptance Scenarios:

Given retrieved context contains sections A, B, and C, When the model generates an answer citing "Section D", Then that citation is either removed or flagged before the answer is shown to the user.
Given a retrieved passage is used as context, When the model cites it, Then the cited passage actually contains the information the citation supports.
Given no relevant context was retrieved for part of a query, When the model responds, Then it acknowledges uncertainty rather than fabricating a source.

User Story 3 - User Visibility into Retrieval Confidence (Priority: P2)

A student can see which parts of the answer are well-supported by the retrieved material and which parts carry lower confidence, allowing them to judge how much to rely on each claim.

Why this priority: Transparency builds appropriate trust — students should know when the system is uncertain rather than presenting all answers with equal authority.

Independent Test: Ask a question where only partial relevant content exists in the corpus. Verify the answer visually differentiates well-supported claims from lower-confidence statements.

Acceptance Scenarios:

Given an answer contains a claim backed by a directly retrieved passage, When displayed, Then the source is shown alongside the claim.
Given an answer contains a claim not directly covered by retrieved content, When displayed, Then the system signals lower confidence or absence of a source for that claim.

Edge Cases

What happens when query enrichment produces terms that retrieve completely unrelated passages?
How does the system behave when the retrieved context is very short or empty?
What if two retrieved sections contradict each other — how does citation work in that case?
What if citation verification removes all citations from an answer (model cited nothing valid)?
How does the system handle questions in languages other than the documentation language?

Requirements (mandatory)

Functional Requirements

FR-001: The system MUST enrich user queries to bridge vocabulary gaps before retrieval, using the domain context of the book being queried.
FR-002: The system MUST retrieve passages based on the enriched query, not solely on the literal user input.
FR-003: The system MUST pass only verified, retrieved passage identifiers to the generation step as eligible citation targets.
FR-004: The system MUST validate that every source reference in a generated answer corresponds to a passage present in the retrieved context.
FR-005: The system MUST suppress or flag any citation in the generated answer that refers to a passage not present in the retrieved context.
FR-006: The system MUST surface to the user which retrieved passages were used to support each claim (or indicate absence of supporting source).
FR-007: When no relevant passages are retrieved, the system MUST communicate this clearly rather than generating an unsupported answer.
FR-008: Query enrichment MUST be scoped to the active book to avoid introducing terminology from unrelated domains.

Key Entities

Enriched Query: The augmented version of the user's original question, including synonyms, alternate phrasings, or domain-aligned terms used for retrieval.
Retrieved Context: The set of passages (sections and figures) returned by retrieval, each identified by a unique source reference, passed to generation.
Citation: A reference in the generated answer to a specific source; must be traceable to a member of the Retrieved Context.
Citation Validation Result: A per-citation judgment of whether the cited source was retrieved and supports the claim.

Success Criteria (mandatory)

Measurable Outcomes

SC-001: Questions using common lay synonyms for clinical terms retrieve relevant passages at least 80% of the time (verified on a manually curated test set of synonym-query pairs).
SC-002: Zero citations appear in generated answers that reference a source not present in the retrieved context passed to the model.
SC-003: For every generated answer, 100% of displayed citations can be traced to a retrieved passage identifier.
SC-004: Retrieval quality for queries that already match documentation vocabulary does not degrade (baseline score maintained or improved).
SC-005: Users can identify, for each factual claim in an answer, whether a supporting source was found — without needing to ask a follow-up question.

Assumptions

The existing retrieval pipeline returns section and figure identifiers that can be used as citation anchors.
Query enrichment operates at query time (not ingestion time), so no changes to stored embeddings are needed.
The generation model can be instructed via prompt to restrict citations to a provided list of identifiers.
Citation validation is performed after generation but before the answer is shown to the user (post-processing step).
Mobile or offline support is out of scope for this feature.
Multi-language support (non-English questions against English documentation) is a future concern and not addressed here.

8.4 KiB Raw Permalink Blame History

Feature Specification: RAG Retrieval Quality Improvements

Overview

User Scenarios & Testing (mandatory)

User Story 1 - Accurate Retrieval Despite Different Terminology (Priority: P1)

User Story 2 - Grounded Citation in Generated Answers (Priority: P1)

User Story 3 - User Visibility into Retrieval Confidence (Priority: P2)

Edge Cases

Requirements (mandatory)

Functional Requirements

Key Entities

Success Criteria (mandatory)

Measurable Outcomes

Assumptions

8.4 KiB

Raw Permalink Blame History