add new concept report

2026-04-18 17:54:54 +02:00
parent 5f03e1f41b
commit c7a77af2f4
29 changed files with 1892 additions and 41 deletions
@@ -0,0 +1,172 @@
+# Concept Retrieval via Indexing-Time Chunk Enrichment
+
+## Context
+
+Vector similarity alone can't answer "tell me everything about aneurysms." It surfaces the chunks most *linguistically* similar to the query, not the set of all chunks that *concern* the concept — and it has no notion of whether each chunk is a definition, a case, a technique, or a complication.
+
+The unlock is to move intelligence from query time to indexing time: for every text chunk, use an LLM to extract **structured metadata** (entities, facet, summary). At retrieval time, concept lookup becomes an SQL filter (`entities @> ['aneurysm']`) bucketed by facet — deterministic, exhaustive, and organized by default. Vector search remains as a fallback for typos / synonyms and for ranking within a facet.
+
+This plan covers: (1) defining the metadata schema, (2) enriching chunks during new book ingestion, (3) back-filling the already-embedded corpus via an admin endpoint, (4) a new concept retrieval path, and (5) a Topics-page UI to surface the result.
+
+## Approach
+
+### 1. Data model — new `chunk_metadata` table
+
+Flyway migration `backend/src/main/resources/db/migration/V7__chunk_metadata.sql`:
+
+```sql
+CREATE TABLE chunk_metadata (
+    chunk_id        VARCHAR(64) PRIMARY KEY,       -- same UUID that TextChunkingService issues and stores in vectorstore
+    book_id         UUID NOT NULL,
+    section_id      VARCHAR(255) NOT NULL,
+    facet           VARCHAR(32) NOT NULL,           -- enum (see ConceptFacet)
+    entities        JSONB NOT NULL,                 -- canonical lowercase string[]
+    summary         TEXT NOT NULL,
+    model_version   VARCHAR(32) NOT NULL,           -- records which LLM/prompt version tagged this chunk
+    enriched_at     TIMESTAMPTZ NOT NULL
+);
+CREATE INDEX idx_chunk_metadata_book         ON chunk_metadata(book_id);
+CREATE INDEX idx_chunk_metadata_book_facet   ON chunk_metadata(book_id, facet);
+CREATE INDEX idx_chunk_metadata_entities_gin ON chunk_metadata USING GIN (entities jsonb_path_ops);
+```
+
+Why `chunk_id` is the natural key: `TextChunkingService` already generates a UUID per chunk, uses it as the pgvector Document id, stores it in metadata, and it's the key in `ChunkFigureRefEntity` — so the table joins cleanly to everything already in place.
+
+### 2. Enrichment service & facet taxonomy
+
+New package `com.aiteacher.enrichment`:
+
+- `ConceptFacet` enum — 13 values tailored to neurosurgery textbooks: `DEFINITION, ANATOMY, PATHOPHYSIOLOGY, EPIDEMIOLOGY, CLINICAL_PRESENTATION, IMAGING, CLASSIFICATION, INDICATIONS, SURGICAL_TECHNIQUE, NONSURGICAL_MANAGEMENT, COMPLICATIONS, OUTCOMES_FOLLOWUP, OTHER`. `OTHER` is mandatory so the LLM always has an out (no hallucinated bucketing). The prompt carries explicit disambiguation rules (named grading scales → `CLASSIFICATION`; imaging of a complication → `COMPLICATIONS`; tools inside an operation → `SURGICAL_TECHNIQUE`).
+- `ChunkEnrichmentResult` — record `(List<String> entities, ConceptFacet facet, String summary)`
+- `ChunkEnrichmentService` — single method `enrich(String chunkText, SectionEntity section, String bookTitle) → ChunkEnrichmentResult`. Uses Spring AI `ChatClient.prompt().call().entity(Class)` for structured output. The prompt gives: book title, section title, chunk text, the fixed facet enum list, and instructs the model to return JSON with entities normalised to lowercase singular canonical form (e.g. "aneurysms" → "aneurysm"; "SAH" → "subarachnoid hemorrhage"). Caps entities at ~8 per chunk.
+- `ChunkMetadataEntity` + `ChunkMetadataRepository` — JPA entity/repo mirroring the table.
+
+Model version string (e.g. `"v1"`) lives on the service and is stamped into each row so a future prompt rev can be rolled out by filtering `model_version <> 'v2'` in the backfill job.
+
+### 3. Hook into new book ingestion
+
+Modify `BookEmbeddingService.embedBook`:
+
+```java
+// Step 3: Chunk and embed text
+List<Document> allChunks = new ArrayList<>();
+for (SectionEntity section : sections) {
+    allChunks.addAll(textChunkingService.chunk(section, bookTitle));
+}
+if (skipEmbedding) { ... } else {
+    embedInBatches(allChunks, bookId);
+    chunkEnrichmentPipeline.enrichAndPersist(allChunks, sectionsById, bookTitle);  // NEW
+}
+```
+
+- `ChunkEnrichmentPipeline` — new orchestrator that iterates chunks, calls `ChunkEnrichmentService.enrich(...)` per chunk, saves `ChunkMetadataEntity` rows in batches, with the same throttle pattern as `embedInBatches`.
+- Runs *after* embedding, not in place of it, so a failure in enrichment doesn't corrupt the vector store. On failure, log and continue — the backfill endpoint is the universal recovery path.
+- Extend `deleteBookChunks` to also delete `chunk_metadata` rows so deletion stays consistent.
+
+### 4. Backfill endpoint for already-embedded books
+
+New `EnrichmentController` in `com.aiteacher.enrichment`:
+
+- `POST /api/v1/admin/books/{id}/enrich` → kicks off async backfill, returns 202 with `{status, chunksTotal, chunksEnriched}`
+- `GET  /api/v1/admin/books/{id}/enrich` → returns progress
+
+Backfill flow (`EnrichmentBackfillService.backfillBook(UUID bookId)`):
+
+1. Query the pgvector storage table directly via `JdbcTemplate` for all chunks of the book:
+   ```sql
+   SELECT id, content, metadata
+   FROM vector_store
+   WHERE metadata->>'book_id' = ? AND metadata->>'type' = 'TEXT'
+   ```
+2. Left-anti-join against `chunk_metadata` to skip already-enriched chunks → idempotent, resumable.
+3. For each missing chunk: look up its `SectionEntity` via `section_id` in metadata, call `ChunkEnrichmentService.enrich`, write a `ChunkMetadataEntity` row.
+4. Progress tracked in an in-memory `ConcurrentHashMap<UUID, BackfillProgress>` (POC scope — no cross-restart resumability needed because the left-anti-join makes re-runs free).
+5. `@Async` on the backfill method using the same executor as `embedBook`.
+
+### 5. Concept retrieval path
+
+New `com.aiteacher.concept.ConceptRetriever`:
+
+```java
+public ConceptRetrievalResult retrieveByConcept(String conceptKeyword, UUID bookId) {
+    String canonical = canonicalise(conceptKeyword);   // lowercase, trim, simple plural strip
+
+    // 5a. Primary: SQL entity match, grouped by facet
+    List<ChunkMetadataEntity> hits = chunkMetadataRepository
+        .findByBookIdAndEntityContains(bookId, canonical);   // WHERE entities @> to_jsonb(?::text)
+
+    if (hits.isEmpty()) {
+        // 5b. Fallback: vector search, then enrich-join + facet-group
+        List<Document> vectorHits = vectorStore.similaritySearch(/* TEXT filter, book_id filter, topK=30 */);
+        List<String> chunkIds = vectorHits.stream().map(Document::getId).toList();
+        hits = chunkMetadataRepository.findByChunkIdIn(chunkIds);
+    }
+
+    Map<ConceptFacet, List<ChunkMetadataEntity>> byFacet = hits.stream()
+        .collect(groupingBy(ChunkMetadataEntity::getFacet, LinkedHashMap::new, toList()));
+
+    // Hydrate: load SectionEntity for each chunk's section_id; load linked figures
+    // via ChunkFigureRefRepository.findByChunkIdIn(chunkIds) — reuses existing linkage.
+    return assemble(byFacet, ...);
+}
+```
+
+`ConceptRetrievalResult` = `Map<ConceptFacet, FacetBundle>` where each `FacetBundle` holds the parent sections, linked figures, and the per-chunk `summary` strings.
+
+Cross-book aggregation: caller loops over READY books and merges bundles by facet.
+
+### 6. Concept Report service & controller
+
+New `ConceptReportService` in `com.aiteacher.concept` — mirrors the shape of `TopicSummaryService`, but:
+
+- Calls `ConceptRetriever.retrieveByConcept(topic.getName(), bookId)` per book.
+- For each facet that has hits, sends **one** LLM synthesis call with the chunks/figures of that facet — producing a structured, facet-labelled report.
+- Persists in a new `concept_report` table:
+
+```sql
+CREATE TABLE concept_report (
+    id            UUID PRIMARY KEY,
+    topic_id      VARCHAR(255) NOT NULL REFERENCES topic(id),
+    report_number INT NOT NULL,
+    facets_json   JSONB NOT NULL,        -- [{facetKey,title,markdown,refLabels[]}, ...]
+    sources_json  JSONB NOT NULL,        -- deduplicated SourceReference[]
+    generated_at  TIMESTAMPTZ NOT NULL,
+    UNIQUE (topic_id, report_number)
+);
+```
+
+Controller `ConceptReportController` exposes three endpoints under `/api/v1/topics/{id}/concept-reports` (POST generate, GET list, GET `/{reportId}`).
+
+Reuses `TopicSummaryResponse.SourceReference` verbatim.
+
+### 7. Frontend
+
+- `frontend/src/stores/topicStore.ts`: add parallel state `conceptReportList`, `activeConceptReport`, `conceptReportLoading`, and actions mirroring the existing summary ones.
+- `frontend/src/views/TopicsView.vue`: add a **Summary / Concept Report** tab toggle at the top of the topic panel. Concept Report reuses the history-chips + Generate button UI. Report body renders each `FacetSection` as `<h3>{title}</h3>` + markdown.
+- Loading hint: update the "up to 30 seconds" copy to "up to 60 seconds".
+
+### 8. README update
+
+Add an **Indexing Pipeline** diagram showing: PDF → parse → chunk → embed → **enrich (new)** → chunk_metadata. Plus a **Concept Retrieval** sequence diagram: query → entity-match SQL → facet-grouped bundle → synthesis → report.
+
+## Decisions & trade-offs
+
+- **Storage as separate Postgres table, not vectorstore JSON**: vectorstore has no metadata-only update API, backfill would require delete+reinsert (re-embedding cost). A dedicated table joins cleanly on `chunk_id` and is GIN-indexed.
+- **Entity-match primary, vector fallback**: deterministic for the main use case, robust against typos/synonyms. Vector search stays the default for normal chat retrieval — this feature is additive.
+- **Enrichment runs *after* embedding, not before**: keeps the two failure modes independent. The backfill endpoint is the universal recovery lever.
+- **Fixed 9-value facet enum** (incl. `OTHER`): constrains LLM outputs; `OTHER` prevents forced mis-bucketing.
+- **Direct `JdbcTemplate` read against `vector_store` for backfill**: Spring AI exposes no listing API. Acceptable for a POC, isolated behind one method.
+- **Synchronous (sequential) LLM calls**: simplest; parallelism is a later optimisation if needed.
+- **`model_version` column**: cheap insurance. If the prompt or facet taxonomy changes, backfill can re-enrich only stale rows.
+
+## Verification
+
+1. Migration applies V7 and V8. Tables and indexes created.
+2. New book ingestion: upload PDF → `chunk_metadata` populated with plausible entities/facets/summaries.
+3. Backfill: POST `/api/v1/admin/books/{id}/enrich` → idempotent, completes, re-run is a no-op.
+4. Concept retrieval primary path: POST `/api/v1/topics/aneurysm/concept-reports` → 200 with facets populated.
+5. Fallback path: misspelled topic still returns results via vector fallback.
+6. Frontend: Concept Report tab renders facet-labelled markdown + sources + inline figures; persists across reloads.
+7. Deletion: removing a book cascades to `chunk_metadata` rows.
+8. Regression: existing chat and summary flows still work.
+9. Lint & tests pass.