Research: RAG Retrieval Quality Improvements

Branch: 004-rag-retrieval-quality | Date: 2026-04-06

Decision 1: Query Expansion Strategy

Decision: Single LLM rewrite — ask the model to restate the user's question using clinical/technical terminology before retrieval. Use the rewritten query for vector search instead of (or in addition to) the original.

Rationale: The simplest approach that directly addresses vocabulary mismatch. A single extra LLM call rewrites the query into the language of the documentation (clinical terms), so the embedding similarity search has a much better chance of matching. No new dependencies, no index changes.

Alternatives considered:

HyDE (Hypothetical Document Embeddings): Ask the model to write a hypothetical answer, then embed that. More powerful but adds latency and the answer may itself hallucinate clinical content — rejected for POC.
Multi-query retrieval: Generate 3–5 alternative queries and merge results. Effective but multiplies retrieval calls and deduplication complexity — rejected (KISS).
Synonym dictionary: Pre-built medical thesaurus mapping. No new dependencies but requires maintenance and won't generalise — rejected.
Re-ranking: Run a cross-encoder after retrieval. Addresses ordering, not vocabulary gap — deferred.

Implementation note: The rewrite prompt should be a short, focused instruction: "Rewrite the following question using precise medical/surgical terminology as it would appear in a neurosurgery textbook index. Output only the rewritten question." Use the same ChatClient bean; no new API client needed.

Decision 2: Citation Grounding Strategy

Decision: Tag each retrieved context section with a short ref-label ([S1], [S2], …, [Fn] for figures) in the prompt. Instruct the model to cite only using those labels. Post-process the generated answer to detect and strip any citation that does not correspond to a known label.

Rationale: The simplest way to make citations verifiable — all valid citation targets are enumerated in the prompt itself. Post-processing is a pure string operation; no second LLM call needed for validation.

Alternatives considered:

Ask the model to self-check citations: Second LLM call to verify each claim. More accurate but doubles cost and latency — rejected for POC.
Structured output (JSON): Return answer as JSON with claim/source pairs. Most precise but requires significant prompt engineering and frontend changes — deferred.
No citation enforcement: Let the model cite freely and show retrieved sources separately. Already the current state — rejected because it doesn't solve citation hallucination.

Implementation note:

Context sections labelled [S1] Section Title, p.N through [Sk] in buildContextPrompt().
Figures labelled [F1], [F2] etc.
System prompt updated: "Cite claims using ONLY the reference labels [S1]…[Sk] and [F1]…[Fj] provided in the context. Do not invent page numbers or section titles."
CitationValidatorService scans generated text for [Sx] / [Fx] patterns, checks each against the known label set, and removes unknown ones.
Message.sources is already a List<Map<String,Object>>. No schema change needed — but we store the ref-label alongside each source so the frontend can correlate.

Decision 3: Frontend Source Display

Decision: No frontend change needed for MVP. The backend already filters out hallucinated citations via CitationValidatorService before saving the message. The sources list attached to the message is already the retrieved set. The existing ChatMessage.vue source panel continues to show all retrieved sources (which is correct — they were all used as context).

Rationale: KISS. The critical correctness fix (no hallucinated citations in answer text) is fully backend-side. The sources panel shows context that was genuinely available to the model — that is accurate and useful. Linking specific claims to specific sources (inline highlighting) is a UX enhancement that can be a follow-on feature.

Alternatives considered:

Inline citation links: Highlight [S1] in answer text and link to the source card. Better UX but requires markdown parsing and component changes — deferred.
Show only cited sources: Filter sources to only those actually cited in the answer. Marginally more accurate but the uncited context is still legitimately retrieved — deferred.

Decision 4: Persisting Topic Summaries

Decision: Add a new topic_summary table that stores each generated summary. Summaries are numbered sequentially per topic (summary #1, #2, …). The POST endpoint continues to generate and now also persists. A new GET endpoint lists saved summaries for a topic.

Rationale: The simplest approach — one new table, one new repository, minimal changes to the existing service. No caching layer, no event system. The sequential number is derived at query time via ROW_NUMBER or counted in the repository, keeping the schema minimal.

Alternatives considered:

Store in frontend state only (session memory): Already the current state — summaries are lost on reload. Rejected because the user explicitly wants persistence.
Store summary_number as a persisted column: Avoids a query-time count but risks gaps/duplicates on concurrent writes. For a POC with single-user use, a SELECT COUNT(*) + 1 approach at insert time is sufficient.
Versioning / soft-delete: Overkill for a POC where the user just wants to re-read old summaries. No delete endpoint needed initially.

Implementation note:

New Flyway migration V6__topic_summary.sql.
TopicSummaryEntity (JPA @Entity → table topic_summary): id (UUID), topicId (VARCHAR FK), summaryNumber (INT), summary (TEXT), sourcesJson (TEXT — JSON array), generatedAt (TIMESTAMP).
summaryNumber set at insert time: COUNT(*) WHERE topic_id = ? + 1.
TopicSummaryRepository extends JpaRepository<TopicSummaryEntity, UUID> with a findByTopicIdOrderBySummaryNumberAsc query.
TopicSummaryService.generateSummary() saves the result before returning it.
TopicController gains GET /api/v1/topics/{id}/summaries returning List<SavedSummaryResponse> (id, summaryNumber, generatedAt — no full text, for list efficiency) and GET /api/v1/topics/{id}/summaries/{summaryId} for the full detail.
Frontend: when a topic card is clicked, fetch the summary list. Show "Summary #1", "Summary #2" chips + "Generate New" button. Clicking a chip loads the full summary.

6.5 KiB Raw Blame History Unescape Escape

Research: RAG Retrieval Quality Improvements

Decision 1: Query Expansion Strategy

Decision 2: Citation Grounding Strategy

Decision 3: Frontend Source Display

Decision 4: Persisting Topic Summaries

6.5 KiB

Raw Blame History