11 KiB
Concept Retrieval via Indexing-Time Chunk Enrichment
Context
Vector similarity alone can't answer "tell me everything about aneurysms." It surfaces the chunks most linguistically similar to the query, not the set of all chunks that concern the concept — and it has no notion of whether each chunk is a definition, a case, a technique, or a complication.
The unlock is to move intelligence from query time to indexing time: for every text chunk, use an LLM to extract structured metadata (entities, facet, summary). At retrieval time, concept lookup becomes an SQL filter (entities @> ['aneurysm']) bucketed by facet — deterministic, exhaustive, and organized by default. Vector search remains as a fallback for typos / synonyms and for ranking within a facet.
This plan covers: (1) defining the metadata schema, (2) enriching chunks during new book ingestion, (3) back-filling the already-embedded corpus via an admin endpoint, (4) a new concept retrieval path, and (5) a Topics-page UI to surface the result.
Approach
1. Data model — new chunk_metadata table
Flyway migration backend/src/main/resources/db/migration/V7__chunk_metadata.sql:
CREATE TABLE chunk_metadata (
chunk_id VARCHAR(64) PRIMARY KEY, -- same UUID that TextChunkingService issues and stores in vectorstore
book_id UUID NOT NULL,
section_id VARCHAR(255) NOT NULL,
facet VARCHAR(32) NOT NULL, -- enum (see ConceptFacet)
entities JSONB NOT NULL, -- canonical lowercase string[]
summary TEXT NOT NULL,
model_version VARCHAR(32) NOT NULL, -- records which LLM/prompt version tagged this chunk
enriched_at TIMESTAMPTZ NOT NULL
);
CREATE INDEX idx_chunk_metadata_book ON chunk_metadata(book_id);
CREATE INDEX idx_chunk_metadata_book_facet ON chunk_metadata(book_id, facet);
CREATE INDEX idx_chunk_metadata_entities_gin ON chunk_metadata USING GIN (entities jsonb_path_ops);
Why chunk_id is the natural key: TextChunkingService already generates a UUID per chunk, uses it as the pgvector Document id, stores it in metadata, and it's the key in ChunkFigureRefEntity — so the table joins cleanly to everything already in place.
2. Enrichment service & facet taxonomy
New package com.aiteacher.enrichment:
ConceptFacetenum — 13 values tailored to neurosurgery textbooks:DEFINITION, ANATOMY, PATHOPHYSIOLOGY, EPIDEMIOLOGY, CLINICAL_PRESENTATION, IMAGING, CLASSIFICATION, INDICATIONS, SURGICAL_TECHNIQUE, NONSURGICAL_MANAGEMENT, COMPLICATIONS, OUTCOMES_FOLLOWUP, OTHER.OTHERis mandatory so the LLM always has an out (no hallucinated bucketing). The prompt carries explicit disambiguation rules (named grading scales →CLASSIFICATION; imaging of a complication →COMPLICATIONS; tools inside an operation →SURGICAL_TECHNIQUE).ChunkEnrichmentResult— record(List<String> entities, ConceptFacet facet, String summary)ChunkEnrichmentService— single methodenrich(String chunkText, SectionEntity section, String bookTitle) → ChunkEnrichmentResult. Uses Spring AIChatClient.prompt().call().entity(Class)for structured output. The prompt gives: book title, section title, chunk text, the fixed facet enum list, and instructs the model to return JSON with entities normalised to lowercase singular canonical form (e.g. "aneurysms" → "aneurysm"; "SAH" → "subarachnoid hemorrhage"). Caps entities at ~8 per chunk.ChunkMetadataEntity+ChunkMetadataRepository— JPA entity/repo mirroring the table.
Model version string (e.g. "v1") lives on the service and is stamped into each row so a future prompt rev can be rolled out by filtering model_version <> 'v2' in the backfill job.
3. Hook into new book ingestion
Modify BookEmbeddingService.embedBook:
// Step 3: Chunk and embed text
List<Document> allChunks = new ArrayList<>();
for (SectionEntity section : sections) {
allChunks.addAll(textChunkingService.chunk(section, bookTitle));
}
if (skipEmbedding) { ... } else {
embedInBatches(allChunks, bookId);
chunkEnrichmentPipeline.enrichAndPersist(allChunks, sectionsById, bookTitle); // NEW
}
ChunkEnrichmentPipeline— new orchestrator that iterates chunks, callsChunkEnrichmentService.enrich(...)per chunk, savesChunkMetadataEntityrows in batches, with the same throttle pattern asembedInBatches.- Runs after embedding, not in place of it, so a failure in enrichment doesn't corrupt the vector store. On failure, log and continue — the backfill endpoint is the universal recovery path.
- Extend
deleteBookChunksto also deletechunk_metadatarows so deletion stays consistent.
4. Backfill endpoint for already-embedded books
New EnrichmentController in com.aiteacher.enrichment:
POST /api/v1/admin/books/{id}/enrich→ kicks off async backfill, returns 202 with{status, chunksTotal, chunksEnriched}GET /api/v1/admin/books/{id}/enrich→ returns progress
Backfill flow (EnrichmentBackfillService.backfillBook(UUID bookId)):
- Query the pgvector storage table directly via
JdbcTemplatefor all chunks of the book:SELECT id, content, metadata FROM vector_store WHERE metadata->>'book_id' = ? AND metadata->>'type' = 'TEXT' - Left-anti-join against
chunk_metadatato skip already-enriched chunks → idempotent, resumable. - For each missing chunk: look up its
SectionEntityviasection_idin metadata, callChunkEnrichmentService.enrich, write aChunkMetadataEntityrow. - Progress tracked in an in-memory
ConcurrentHashMap<UUID, BackfillProgress>(POC scope — no cross-restart resumability needed because the left-anti-join makes re-runs free). @Asyncon the backfill method using the same executor asembedBook.
5. Concept retrieval path
New com.aiteacher.concept.ConceptRetriever:
public ConceptRetrievalResult retrieveByConcept(String conceptKeyword, UUID bookId) {
String canonical = canonicalise(conceptKeyword); // lowercase, trim, simple plural strip
// 5a. Primary: SQL entity match, grouped by facet
List<ChunkMetadataEntity> hits = chunkMetadataRepository
.findByBookIdAndEntityContains(bookId, canonical); // WHERE entities @> to_jsonb(?::text)
if (hits.isEmpty()) {
// 5b. Fallback: vector search, then enrich-join + facet-group
List<Document> vectorHits = vectorStore.similaritySearch(/* TEXT filter, book_id filter, topK=30 */);
List<String> chunkIds = vectorHits.stream().map(Document::getId).toList();
hits = chunkMetadataRepository.findByChunkIdIn(chunkIds);
}
Map<ConceptFacet, List<ChunkMetadataEntity>> byFacet = hits.stream()
.collect(groupingBy(ChunkMetadataEntity::getFacet, LinkedHashMap::new, toList()));
// Hydrate: load SectionEntity for each chunk's section_id; load linked figures
// via ChunkFigureRefRepository.findByChunkIdIn(chunkIds) — reuses existing linkage.
return assemble(byFacet, ...);
}
ConceptRetrievalResult = Map<ConceptFacet, FacetBundle> where each FacetBundle holds the parent sections, linked figures, and the per-chunk summary strings.
Cross-book aggregation: caller loops over READY books and merges bundles by facet.
6. Concept Report service & controller
New ConceptReportService in com.aiteacher.concept — mirrors the shape of TopicSummaryService, but:
- Calls
ConceptRetriever.retrieveByConcept(topic.getName(), bookId)per book. - For each facet that has hits, sends one LLM synthesis call with the chunks/figures of that facet — producing a structured, facet-labelled report.
- Persists in a new
concept_reporttable:
CREATE TABLE concept_report (
id UUID PRIMARY KEY,
topic_id VARCHAR(255) NOT NULL REFERENCES topic(id),
report_number INT NOT NULL,
facets_json JSONB NOT NULL, -- [{facetKey,title,markdown,refLabels[]}, ...]
sources_json JSONB NOT NULL, -- deduplicated SourceReference[]
generated_at TIMESTAMPTZ NOT NULL,
UNIQUE (topic_id, report_number)
);
Controller ConceptReportController exposes three endpoints under /api/v1/topics/{id}/concept-reports (POST generate, GET list, GET /{reportId}).
Reuses TopicSummaryResponse.SourceReference verbatim.
7. Frontend
frontend/src/stores/topicStore.ts: add parallel stateconceptReportList,activeConceptReport,conceptReportLoading, and actions mirroring the existing summary ones.frontend/src/views/TopicsView.vue: add a Summary / Concept Report tab toggle at the top of the topic panel. Concept Report reuses the history-chips + Generate button UI. Report body renders eachFacetSectionas<h3>{title}</h3>+ markdown.- Loading hint: update the "up to 30 seconds" copy to "up to 60 seconds".
8. README update
Add an Indexing Pipeline diagram showing: PDF → parse → chunk → embed → enrich (new) → chunk_metadata. Plus a Concept Retrieval sequence diagram: query → entity-match SQL → facet-grouped bundle → synthesis → report.
Decisions & trade-offs
- Storage as separate Postgres table, not vectorstore JSON: vectorstore has no metadata-only update API, backfill would require delete+reinsert (re-embedding cost). A dedicated table joins cleanly on
chunk_idand is GIN-indexed. - Entity-match primary, vector fallback: deterministic for the main use case, robust against typos/synonyms. Vector search stays the default for normal chat retrieval — this feature is additive.
- Enrichment runs after embedding, not before: keeps the two failure modes independent. The backfill endpoint is the universal recovery lever.
- Fixed 9-value facet enum (incl.
OTHER): constrains LLM outputs;OTHERprevents forced mis-bucketing. - Direct
JdbcTemplateread againstvector_storefor backfill: Spring AI exposes no listing API. Acceptable for a POC, isolated behind one method. - Synchronous (sequential) LLM calls: simplest; parallelism is a later optimisation if needed.
model_versioncolumn: cheap insurance. If the prompt or facet taxonomy changes, backfill can re-enrich only stale rows.
Verification
- Migration applies V7 and V8. Tables and indexes created.
- New book ingestion: upload PDF →
chunk_metadatapopulated with plausible entities/facets/summaries. - Backfill: POST
/api/v1/admin/books/{id}/enrich→ idempotent, completes, re-run is a no-op. - Concept retrieval primary path: POST
/api/v1/topics/aneurysm/concept-reports→ 200 with facets populated. - Fallback path: misspelled topic still returns results via vector fallback.
- Frontend: Concept Report tab renders facet-labelled markdown + sources + inline figures; persists across reloads.
- Deletion: removing a book cascades to
chunk_metadatarows. - Regression: existing chat and summary flows still work.
- Lint & tests pass.