# Concept Retrieval via Indexing-Time Chunk Enrichment ## Context Vector similarity alone can't answer "tell me everything about aneurysms." It surfaces the chunks most *linguistically* similar to the query, not the set of all chunks that *concern* the concept — and it has no notion of whether each chunk is a definition, a case, a technique, or a complication. The unlock is to move intelligence from query time to indexing time: for every text chunk, use an LLM to extract **structured metadata** (entities, facet, summary). At retrieval time, concept lookup becomes an SQL filter (`entities @> ['aneurysm']`) bucketed by facet — deterministic, exhaustive, and organized by default. Vector search remains as a fallback for typos / synonyms and for ranking within a facet. This plan covers: (1) defining the metadata schema, (2) enriching chunks during new book ingestion, (3) back-filling the already-embedded corpus via an admin endpoint, (4) a new concept retrieval path, and (5) a Topics-page UI to surface the result. ## Approach ### 1. Data model — new `chunk_metadata` table Flyway migration `backend/src/main/resources/db/migration/V7__chunk_metadata.sql`: ```sql CREATE TABLE chunk_metadata ( chunk_id VARCHAR(64) PRIMARY KEY, -- same UUID that TextChunkingService issues and stores in vectorstore book_id UUID NOT NULL, section_id VARCHAR(255) NOT NULL, facet VARCHAR(32) NOT NULL, -- enum (see ConceptFacet) entities JSONB NOT NULL, -- canonical lowercase string[] summary TEXT NOT NULL, model_version VARCHAR(32) NOT NULL, -- records which LLM/prompt version tagged this chunk enriched_at TIMESTAMPTZ NOT NULL ); CREATE INDEX idx_chunk_metadata_book ON chunk_metadata(book_id); CREATE INDEX idx_chunk_metadata_book_facet ON chunk_metadata(book_id, facet); CREATE INDEX idx_chunk_metadata_entities_gin ON chunk_metadata USING GIN (entities jsonb_path_ops); ``` Why `chunk_id` is the natural key: `TextChunkingService` already generates a UUID per chunk, uses it as the pgvector Document id, stores it in metadata, and it's the key in `ChunkFigureRefEntity` — so the table joins cleanly to everything already in place. ### 2. Enrichment service & facet taxonomy New package `com.aiteacher.enrichment`: - `ConceptFacet` enum — 13 values tailored to neurosurgery textbooks: `DEFINITION, ANATOMY, PATHOPHYSIOLOGY, EPIDEMIOLOGY, CLINICAL_PRESENTATION, IMAGING, CLASSIFICATION, INDICATIONS, SURGICAL_TECHNIQUE, NONSURGICAL_MANAGEMENT, COMPLICATIONS, OUTCOMES_FOLLOWUP, OTHER`. `OTHER` is mandatory so the LLM always has an out (no hallucinated bucketing). The prompt carries explicit disambiguation rules (named grading scales → `CLASSIFICATION`; imaging of a complication → `COMPLICATIONS`; tools inside an operation → `SURGICAL_TECHNIQUE`). - `ChunkEnrichmentResult` — record `(List entities, ConceptFacet facet, String summary)` - `ChunkEnrichmentService` — single method `enrich(String chunkText, SectionEntity section, String bookTitle) → ChunkEnrichmentResult`. Uses Spring AI `ChatClient.prompt().call().entity(Class)` for structured output. The prompt gives: book title, section title, chunk text, the fixed facet enum list, and instructs the model to return JSON with entities normalised to lowercase singular canonical form (e.g. "aneurysms" → "aneurysm"; "SAH" → "subarachnoid hemorrhage"). Caps entities at ~8 per chunk. - `ChunkMetadataEntity` + `ChunkMetadataRepository` — JPA entity/repo mirroring the table. Model version string (e.g. `"v1"`) lives on the service and is stamped into each row so a future prompt rev can be rolled out by filtering `model_version <> 'v2'` in the backfill job. ### 3. Hook into new book ingestion Modify `BookEmbeddingService.embedBook`: ```java // Step 3: Chunk and embed text List allChunks = new ArrayList<>(); for (SectionEntity section : sections) { allChunks.addAll(textChunkingService.chunk(section, bookTitle)); } if (skipEmbedding) { ... } else { embedInBatches(allChunks, bookId); chunkEnrichmentPipeline.enrichAndPersist(allChunks, sectionsById, bookTitle); // NEW } ``` - `ChunkEnrichmentPipeline` — new orchestrator that iterates chunks, calls `ChunkEnrichmentService.enrich(...)` per chunk, saves `ChunkMetadataEntity` rows in batches, with the same throttle pattern as `embedInBatches`. - Runs *after* embedding, not in place of it, so a failure in enrichment doesn't corrupt the vector store. On failure, log and continue — the backfill endpoint is the universal recovery path. - Extend `deleteBookChunks` to also delete `chunk_metadata` rows so deletion stays consistent. ### 4. Backfill endpoint for already-embedded books New `EnrichmentController` in `com.aiteacher.enrichment`: - `POST /api/v1/admin/books/{id}/enrich` → kicks off async backfill, returns 202 with `{status, chunksTotal, chunksEnriched}` - `GET /api/v1/admin/books/{id}/enrich` → returns progress Backfill flow (`EnrichmentBackfillService.backfillBook(UUID bookId)`): 1. Query the pgvector storage table directly via `JdbcTemplate` for all chunks of the book: ```sql SELECT id, content, metadata FROM vector_store WHERE metadata->>'book_id' = ? AND metadata->>'type' = 'TEXT' ``` 2. Left-anti-join against `chunk_metadata` to skip already-enriched chunks → idempotent, resumable. 3. For each missing chunk: look up its `SectionEntity` via `section_id` in metadata, call `ChunkEnrichmentService.enrich`, write a `ChunkMetadataEntity` row. 4. Progress tracked in an in-memory `ConcurrentHashMap` (POC scope — no cross-restart resumability needed because the left-anti-join makes re-runs free). 5. `@Async` on the backfill method using the same executor as `embedBook`. ### 5. Concept retrieval path New `com.aiteacher.concept.ConceptRetriever`: ```java public ConceptRetrievalResult retrieveByConcept(String conceptKeyword, UUID bookId) { String canonical = canonicalise(conceptKeyword); // lowercase, trim, simple plural strip // 5a. Primary: SQL entity match, grouped by facet List hits = chunkMetadataRepository .findByBookIdAndEntityContains(bookId, canonical); // WHERE entities @> to_jsonb(?::text) if (hits.isEmpty()) { // 5b. Fallback: vector search, then enrich-join + facet-group List vectorHits = vectorStore.similaritySearch(/* TEXT filter, book_id filter, topK=30 */); List chunkIds = vectorHits.stream().map(Document::getId).toList(); hits = chunkMetadataRepository.findByChunkIdIn(chunkIds); } Map> byFacet = hits.stream() .collect(groupingBy(ChunkMetadataEntity::getFacet, LinkedHashMap::new, toList())); // Hydrate: load SectionEntity for each chunk's section_id; load linked figures // via ChunkFigureRefRepository.findByChunkIdIn(chunkIds) — reuses existing linkage. return assemble(byFacet, ...); } ``` `ConceptRetrievalResult` = `Map` where each `FacetBundle` holds the parent sections, linked figures, and the per-chunk `summary` strings. Cross-book aggregation: caller loops over READY books and merges bundles by facet. ### 6. Concept Report service & controller New `ConceptReportService` in `com.aiteacher.concept` — mirrors the shape of `TopicSummaryService`, but: - Calls `ConceptRetriever.retrieveByConcept(topic.getName(), bookId)` per book. - For each facet that has hits, sends **one** LLM synthesis call with the chunks/figures of that facet — producing a structured, facet-labelled report. - Persists in a new `concept_report` table: ```sql CREATE TABLE concept_report ( id UUID PRIMARY KEY, topic_id VARCHAR(255) NOT NULL REFERENCES topic(id), report_number INT NOT NULL, facets_json JSONB NOT NULL, -- [{facetKey,title,markdown,refLabels[]}, ...] sources_json JSONB NOT NULL, -- deduplicated SourceReference[] generated_at TIMESTAMPTZ NOT NULL, UNIQUE (topic_id, report_number) ); ``` Controller `ConceptReportController` exposes three endpoints under `/api/v1/topics/{id}/concept-reports` (POST generate, GET list, GET `/{reportId}`). Reuses `TopicSummaryResponse.SourceReference` verbatim. ### 7. Frontend - `frontend/src/stores/topicStore.ts`: add parallel state `conceptReportList`, `activeConceptReport`, `conceptReportLoading`, and actions mirroring the existing summary ones. - `frontend/src/views/TopicsView.vue`: add a **Summary / Concept Report** tab toggle at the top of the topic panel. Concept Report reuses the history-chips + Generate button UI. Report body renders each `FacetSection` as `

{title}

` + markdown. - Loading hint: update the "up to 30 seconds" copy to "up to 60 seconds". ### 8. README update Add an **Indexing Pipeline** diagram showing: PDF → parse → chunk → embed → **enrich (new)** → chunk_metadata. Plus a **Concept Retrieval** sequence diagram: query → entity-match SQL → facet-grouped bundle → synthesis → report. ## Decisions & trade-offs - **Storage as separate Postgres table, not vectorstore JSON**: vectorstore has no metadata-only update API, backfill would require delete+reinsert (re-embedding cost). A dedicated table joins cleanly on `chunk_id` and is GIN-indexed. - **Entity-match primary, vector fallback**: deterministic for the main use case, robust against typos/synonyms. Vector search stays the default for normal chat retrieval — this feature is additive. - **Enrichment runs *after* embedding, not before**: keeps the two failure modes independent. The backfill endpoint is the universal recovery lever. - **Fixed 9-value facet enum** (incl. `OTHER`): constrains LLM outputs; `OTHER` prevents forced mis-bucketing. - **Direct `JdbcTemplate` read against `vector_store` for backfill**: Spring AI exposes no listing API. Acceptable for a POC, isolated behind one method. - **Synchronous (sequential) LLM calls**: simplest; parallelism is a later optimisation if needed. - **`model_version` column**: cheap insurance. If the prompt or facet taxonomy changes, backfill can re-enrich only stale rows. ## Verification 1. Migration applies V7 and V8. Tables and indexes created. 2. New book ingestion: upload PDF → `chunk_metadata` populated with plausible entities/facets/summaries. 3. Backfill: POST `/api/v1/admin/books/{id}/enrich` → idempotent, completes, re-run is a no-op. 4. Concept retrieval primary path: POST `/api/v1/topics/aneurysm/concept-reports` → 200 with facets populated. 5. Fallback path: misspelled topic still returns results via vector fallback. 6. Frontend: Concept Report tab renders facet-labelled markdown + sources + inline figures; persists across reloads. 7. Deletion: removing a book cascades to `chunk_metadata` rows. 8. Regression: existing chat and summary flows still work. 9. Lint & tests pass.