add new concept report

This commit is contained in:
Adrien
2026-04-18 17:54:54 +02:00
parent 5f03e1f41b
commit c7a77af2f4
29 changed files with 1892 additions and 41 deletions
+172
View File
@@ -0,0 +1,172 @@
# Concept Retrieval via Indexing-Time Chunk Enrichment
## Context
Vector similarity alone can't answer "tell me everything about aneurysms." It surfaces the chunks most *linguistically* similar to the query, not the set of all chunks that *concern* the concept — and it has no notion of whether each chunk is a definition, a case, a technique, or a complication.
The unlock is to move intelligence from query time to indexing time: for every text chunk, use an LLM to extract **structured metadata** (entities, facet, summary). At retrieval time, concept lookup becomes an SQL filter (`entities @> ['aneurysm']`) bucketed by facet — deterministic, exhaustive, and organized by default. Vector search remains as a fallback for typos / synonyms and for ranking within a facet.
This plan covers: (1) defining the metadata schema, (2) enriching chunks during new book ingestion, (3) back-filling the already-embedded corpus via an admin endpoint, (4) a new concept retrieval path, and (5) a Topics-page UI to surface the result.
## Approach
### 1. Data model — new `chunk_metadata` table
Flyway migration `backend/src/main/resources/db/migration/V7__chunk_metadata.sql`:
```sql
CREATE TABLE chunk_metadata (
chunk_id VARCHAR(64) PRIMARY KEY, -- same UUID that TextChunkingService issues and stores in vectorstore
book_id UUID NOT NULL,
section_id VARCHAR(255) NOT NULL,
facet VARCHAR(32) NOT NULL, -- enum (see ConceptFacet)
entities JSONB NOT NULL, -- canonical lowercase string[]
summary TEXT NOT NULL,
model_version VARCHAR(32) NOT NULL, -- records which LLM/prompt version tagged this chunk
enriched_at TIMESTAMPTZ NOT NULL
);
CREATE INDEX idx_chunk_metadata_book ON chunk_metadata(book_id);
CREATE INDEX idx_chunk_metadata_book_facet ON chunk_metadata(book_id, facet);
CREATE INDEX idx_chunk_metadata_entities_gin ON chunk_metadata USING GIN (entities jsonb_path_ops);
```
Why `chunk_id` is the natural key: `TextChunkingService` already generates a UUID per chunk, uses it as the pgvector Document id, stores it in metadata, and it's the key in `ChunkFigureRefEntity` — so the table joins cleanly to everything already in place.
### 2. Enrichment service & facet taxonomy
New package `com.aiteacher.enrichment`:
- `ConceptFacet` enum — 13 values tailored to neurosurgery textbooks: `DEFINITION, ANATOMY, PATHOPHYSIOLOGY, EPIDEMIOLOGY, CLINICAL_PRESENTATION, IMAGING, CLASSIFICATION, INDICATIONS, SURGICAL_TECHNIQUE, NONSURGICAL_MANAGEMENT, COMPLICATIONS, OUTCOMES_FOLLOWUP, OTHER`. `OTHER` is mandatory so the LLM always has an out (no hallucinated bucketing). The prompt carries explicit disambiguation rules (named grading scales → `CLASSIFICATION`; imaging of a complication → `COMPLICATIONS`; tools inside an operation → `SURGICAL_TECHNIQUE`).
- `ChunkEnrichmentResult` — record `(List<String> entities, ConceptFacet facet, String summary)`
- `ChunkEnrichmentService` — single method `enrich(String chunkText, SectionEntity section, String bookTitle) → ChunkEnrichmentResult`. Uses Spring AI `ChatClient.prompt().call().entity(Class)` for structured output. The prompt gives: book title, section title, chunk text, the fixed facet enum list, and instructs the model to return JSON with entities normalised to lowercase singular canonical form (e.g. "aneurysms" → "aneurysm"; "SAH" → "subarachnoid hemorrhage"). Caps entities at ~8 per chunk.
- `ChunkMetadataEntity` + `ChunkMetadataRepository` — JPA entity/repo mirroring the table.
Model version string (e.g. `"v1"`) lives on the service and is stamped into each row so a future prompt rev can be rolled out by filtering `model_version <> 'v2'` in the backfill job.
### 3. Hook into new book ingestion
Modify `BookEmbeddingService.embedBook`:
```java
// Step 3: Chunk and embed text
List<Document> allChunks = new ArrayList<>();
for (SectionEntity section : sections) {
allChunks.addAll(textChunkingService.chunk(section, bookTitle));
}
if (skipEmbedding) { ... } else {
embedInBatches(allChunks, bookId);
chunkEnrichmentPipeline.enrichAndPersist(allChunks, sectionsById, bookTitle); // NEW
}
```
- `ChunkEnrichmentPipeline` — new orchestrator that iterates chunks, calls `ChunkEnrichmentService.enrich(...)` per chunk, saves `ChunkMetadataEntity` rows in batches, with the same throttle pattern as `embedInBatches`.
- Runs *after* embedding, not in place of it, so a failure in enrichment doesn't corrupt the vector store. On failure, log and continue — the backfill endpoint is the universal recovery path.
- Extend `deleteBookChunks` to also delete `chunk_metadata` rows so deletion stays consistent.
### 4. Backfill endpoint for already-embedded books
New `EnrichmentController` in `com.aiteacher.enrichment`:
- `POST /api/v1/admin/books/{id}/enrich` → kicks off async backfill, returns 202 with `{status, chunksTotal, chunksEnriched}`
- `GET /api/v1/admin/books/{id}/enrich` → returns progress
Backfill flow (`EnrichmentBackfillService.backfillBook(UUID bookId)`):
1. Query the pgvector storage table directly via `JdbcTemplate` for all chunks of the book:
```sql
SELECT id, content, metadata
FROM vector_store
WHERE metadata->>'book_id' = ? AND metadata->>'type' = 'TEXT'
```
2. Left-anti-join against `chunk_metadata` to skip already-enriched chunks → idempotent, resumable.
3. For each missing chunk: look up its `SectionEntity` via `section_id` in metadata, call `ChunkEnrichmentService.enrich`, write a `ChunkMetadataEntity` row.
4. Progress tracked in an in-memory `ConcurrentHashMap<UUID, BackfillProgress>` (POC scope — no cross-restart resumability needed because the left-anti-join makes re-runs free).
5. `@Async` on the backfill method using the same executor as `embedBook`.
### 5. Concept retrieval path
New `com.aiteacher.concept.ConceptRetriever`:
```java
public ConceptRetrievalResult retrieveByConcept(String conceptKeyword, UUID bookId) {
String canonical = canonicalise(conceptKeyword); // lowercase, trim, simple plural strip
// 5a. Primary: SQL entity match, grouped by facet
List<ChunkMetadataEntity> hits = chunkMetadataRepository
.findByBookIdAndEntityContains(bookId, canonical); // WHERE entities @> to_jsonb(?::text)
if (hits.isEmpty()) {
// 5b. Fallback: vector search, then enrich-join + facet-group
List<Document> vectorHits = vectorStore.similaritySearch(/* TEXT filter, book_id filter, topK=30 */);
List<String> chunkIds = vectorHits.stream().map(Document::getId).toList();
hits = chunkMetadataRepository.findByChunkIdIn(chunkIds);
}
Map<ConceptFacet, List<ChunkMetadataEntity>> byFacet = hits.stream()
.collect(groupingBy(ChunkMetadataEntity::getFacet, LinkedHashMap::new, toList()));
// Hydrate: load SectionEntity for each chunk's section_id; load linked figures
// via ChunkFigureRefRepository.findByChunkIdIn(chunkIds) — reuses existing linkage.
return assemble(byFacet, ...);
}
```
`ConceptRetrievalResult` = `Map<ConceptFacet, FacetBundle>` where each `FacetBundle` holds the parent sections, linked figures, and the per-chunk `summary` strings.
Cross-book aggregation: caller loops over READY books and merges bundles by facet.
### 6. Concept Report service & controller
New `ConceptReportService` in `com.aiteacher.concept` — mirrors the shape of `TopicSummaryService`, but:
- Calls `ConceptRetriever.retrieveByConcept(topic.getName(), bookId)` per book.
- For each facet that has hits, sends **one** LLM synthesis call with the chunks/figures of that facet — producing a structured, facet-labelled report.
- Persists in a new `concept_report` table:
```sql
CREATE TABLE concept_report (
id UUID PRIMARY KEY,
topic_id VARCHAR(255) NOT NULL REFERENCES topic(id),
report_number INT NOT NULL,
facets_json JSONB NOT NULL, -- [{facetKey,title,markdown,refLabels[]}, ...]
sources_json JSONB NOT NULL, -- deduplicated SourceReference[]
generated_at TIMESTAMPTZ NOT NULL,
UNIQUE (topic_id, report_number)
);
```
Controller `ConceptReportController` exposes three endpoints under `/api/v1/topics/{id}/concept-reports` (POST generate, GET list, GET `/{reportId}`).
Reuses `TopicSummaryResponse.SourceReference` verbatim.
### 7. Frontend
- `frontend/src/stores/topicStore.ts`: add parallel state `conceptReportList`, `activeConceptReport`, `conceptReportLoading`, and actions mirroring the existing summary ones.
- `frontend/src/views/TopicsView.vue`: add a **Summary / Concept Report** tab toggle at the top of the topic panel. Concept Report reuses the history-chips + Generate button UI. Report body renders each `FacetSection` as `<h3>{title}</h3>` + markdown.
- Loading hint: update the "up to 30 seconds" copy to "up to 60 seconds".
### 8. README update
Add an **Indexing Pipeline** diagram showing: PDF → parse → chunk → embed → **enrich (new)** → chunk_metadata. Plus a **Concept Retrieval** sequence diagram: query → entity-match SQL → facet-grouped bundle → synthesis → report.
## Decisions & trade-offs
- **Storage as separate Postgres table, not vectorstore JSON**: vectorstore has no metadata-only update API, backfill would require delete+reinsert (re-embedding cost). A dedicated table joins cleanly on `chunk_id` and is GIN-indexed.
- **Entity-match primary, vector fallback**: deterministic for the main use case, robust against typos/synonyms. Vector search stays the default for normal chat retrieval — this feature is additive.
- **Enrichment runs *after* embedding, not before**: keeps the two failure modes independent. The backfill endpoint is the universal recovery lever.
- **Fixed 9-value facet enum** (incl. `OTHER`): constrains LLM outputs; `OTHER` prevents forced mis-bucketing.
- **Direct `JdbcTemplate` read against `vector_store` for backfill**: Spring AI exposes no listing API. Acceptable for a POC, isolated behind one method.
- **Synchronous (sequential) LLM calls**: simplest; parallelism is a later optimisation if needed.
- **`model_version` column**: cheap insurance. If the prompt or facet taxonomy changes, backfill can re-enrich only stale rows.
## Verification
1. Migration applies V7 and V8. Tables and indexes created.
2. New book ingestion: upload PDF → `chunk_metadata` populated with plausible entities/facets/summaries.
3. Backfill: POST `/api/v1/admin/books/{id}/enrich` → idempotent, completes, re-run is a no-op.
4. Concept retrieval primary path: POST `/api/v1/topics/aneurysm/concept-reports` → 200 with facets populated.
5. Fallback path: misspelled topic still returns results via vector fallback.
6. Frontend: Concept Report tab renders facet-labelled markdown + sources + inline figures; persists across reloads.
7. Deletion: removing a book cascades to `chunk_metadata` rows.
8. Regression: existing chat and summary flows still work.
9. Lint & tests pass.