# Concept Retrieval via Indexing-Time Chunk Enrichment

## Context

Vector similarity alone can't answer "tell me everything about aneurysms." It surfaces the chunks most *linguistically* similar to the query, not the set of all chunks that *concern* the concept — and it has no notion of whether each chunk is a definition, a case, a technique, or a complication.

The unlock is to move intelligence from query time to indexing time: for every text chunk, use an LLM to extract **structured metadata** (entities, facet, summary). At retrieval time, concept lookup becomes an SQL filter (`entities @> ['aneurysm']`) bucketed by facet — deterministic, exhaustive, and organized by default. Vector search remains as a fallback for typos / synonyms and for ranking within a facet.

This plan covers: (1) defining the metadata schema, (2) enriching chunks during new book ingestion, (3) back-filling the already-embedded corpus via an admin endpoint, (4) a new concept retrieval path, and (5) a Topics-page UI to surface the result.

## Approach

### 1. Data model — new `chunk_metadata` table

Flyway migration `backend/src/main/resources/db/migration/V7__chunk_metadata.sql`:

```sql
CREATE TABLE chunk_metadata (
    chunk_id        VARCHAR(64) PRIMARY KEY,       -- same UUID that TextChunkingService issues and stores in vectorstore
    book_id         UUID NOT NULL,
    section_id      VARCHAR(255) NOT NULL,
    facet           VARCHAR(32) NOT NULL,           -- enum (see ConceptFacet)
    entities        JSONB NOT NULL,                 -- canonical lowercase string[]
    summary         TEXT NOT NULL,
    model_version   VARCHAR(32) NOT NULL,           -- records which LLM/prompt version tagged this chunk
    enriched_at     TIMESTAMPTZ NOT NULL
);
CREATE INDEX idx_chunk_metadata_book         ON chunk_metadata(book_id);
CREATE INDEX idx_chunk_metadata_book_facet   ON chunk_metadata(book_id, facet);
CREATE INDEX idx_chunk_metadata_entities_gin ON chunk_metadata USING GIN (entities jsonb_path_ops);
```

Why `chunk_id` is the natural key: `TextChunkingService` already generates a UUID per chunk, uses it as the pgvector Document id, stores it in metadata, and it's the key in `ChunkFigureRefEntity` — so the table joins cleanly to everything already in place.

### 2. Enrichment service & facet taxonomy

New package `com.aiteacher.enrichment`:

- `ConceptFacet` enum — 13 values tailored to neurosurgery textbooks: `DEFINITION, ANATOMY, PATHOPHYSIOLOGY, EPIDEMIOLOGY, CLINICAL_PRESENTATION, IMAGING, CLASSIFICATION, INDICATIONS, SURGICAL_TECHNIQUE, NONSURGICAL_MANAGEMENT, COMPLICATIONS, OUTCOMES_FOLLOWUP, OTHER`. `OTHER` is mandatory so the LLM always has an out (no hallucinated bucketing). The prompt carries explicit disambiguation rules (named grading scales → `CLASSIFICATION`; imaging of a complication → `COMPLICATIONS`; tools inside an operation → `SURGICAL_TECHNIQUE`).
- `ChunkEnrichmentResult` — record `(List<String> entities, ConceptFacet facet, String summary)`
- `ChunkEnrichmentService` — single method `enrich(String chunkText, SectionEntity section, String bookTitle) → ChunkEnrichmentResult`. Uses Spring AI `ChatClient.prompt().call().entity(Class)` for structured output. The prompt gives: book title, section title, chunk text, the fixed facet enum list, and instructs the model to return JSON with entities normalised to lowercase singular canonical form (e.g. "aneurysms" → "aneurysm"; "SAH" → "subarachnoid hemorrhage"). Caps entities at ~8 per chunk.
- `ChunkMetadataEntity` + `ChunkMetadataRepository` — JPA entity/repo mirroring the table.

Model version string (e.g. `"v1"`) lives on the service and is stamped into each row so a future prompt rev can be rolled out by filtering `model_version <> 'v2'` in the backfill job.

### 3. Hook into new book ingestion

Modify `BookEmbeddingService.embedBook`:

```java
// Step 3: Chunk and embed text
List<Document> allChunks = new ArrayList<>();
for (SectionEntity section : sections) {
    allChunks.addAll(textChunkingService.chunk(section, bookTitle));
}
if (skipEmbedding) { ... } else {
    embedInBatches(allChunks, bookId);
    chunkEnrichmentPipeline.enrichAndPersist(allChunks, sectionsById, bookTitle);  // NEW
}
```

- `ChunkEnrichmentPipeline` — new orchestrator that iterates chunks, calls `ChunkEnrichmentService.enrich(...)` per chunk, saves `ChunkMetadataEntity` rows in batches, with the same throttle pattern as `embedInBatches`.
- Runs *after* embedding, not in place of it, so a failure in enrichment doesn't corrupt the vector store. On failure, log and continue — the backfill endpoint is the universal recovery path.
- Extend `deleteBookChunks` to also delete `chunk_metadata` rows so deletion stays consistent.

### 4. Backfill endpoint for already-embedded books

New `EnrichmentController` in `com.aiteacher.enrichment`:

- `POST /api/v1/admin/books/{id}/enrich` → kicks off async backfill, returns 202 with `{status, chunksTotal, chunksEnriched}`
- `GET  /api/v1/admin/books/{id}/enrich` → returns progress

Backfill flow (`EnrichmentBackfillService.backfillBook(UUID bookId)`):

1. Query the pgvector storage table directly via `JdbcTemplate` for all chunks of the book:
   ```sql
   SELECT id, content, metadata
   FROM vector_store
   WHERE metadata->>'book_id' = ? AND metadata->>'type' = 'TEXT'
   ```
2. Left-anti-join against `chunk_metadata` to skip already-enriched chunks → idempotent, resumable.
3. For each missing chunk: look up its `SectionEntity` via `section_id` in metadata, call `ChunkEnrichmentService.enrich`, write a `ChunkMetadataEntity` row.
4. Progress tracked in an in-memory `ConcurrentHashMap<UUID, BackfillProgress>` (POC scope — no cross-restart resumability needed because the left-anti-join makes re-runs free).
5. `@Async` on the backfill method using the same executor as `embedBook`.

### 5. Concept retrieval path

New `com.aiteacher.concept.ConceptRetriever`:

```java
public ConceptRetrievalResult retrieveByConcept(String conceptKeyword, UUID bookId) {
    String canonical = canonicalise(conceptKeyword);   // lowercase, trim, simple plural strip

    // 5a. Primary: SQL entity match, grouped by facet
    List<ChunkMetadataEntity> hits = chunkMetadataRepository
        .findByBookIdAndEntityContains(bookId, canonical);   // WHERE entities @> to_jsonb(?::text)

    if (hits.isEmpty()) {
        // 5b. Fallback: vector search, then enrich-join + facet-group
        List<Document> vectorHits = vectorStore.similaritySearch(/* TEXT filter, book_id filter, topK=30 */);
        List<String> chunkIds = vectorHits.stream().map(Document::getId).toList();
        hits = chunkMetadataRepository.findByChunkIdIn(chunkIds);
    }

    Map<ConceptFacet, List<ChunkMetadataEntity>> byFacet = hits.stream()
        .collect(groupingBy(ChunkMetadataEntity::getFacet, LinkedHashMap::new, toList()));

    // Hydrate: load SectionEntity for each chunk's section_id; load linked figures
    // via ChunkFigureRefRepository.findByChunkIdIn(chunkIds) — reuses existing linkage.
    return assemble(byFacet, ...);
}
```

`ConceptRetrievalResult` = `Map<ConceptFacet, FacetBundle>` where each `FacetBundle` holds the parent sections, linked figures, and the per-chunk `summary` strings.

Cross-book aggregation: caller loops over READY books and merges bundles by facet.

### 6. Concept Report service & controller

New `ConceptReportService` in `com.aiteacher.concept` — mirrors the shape of `TopicSummaryService`, but:

- Calls `ConceptRetriever.retrieveByConcept(topic.getName(), bookId)` per book.
- For each facet that has hits, sends **one** LLM synthesis call with the chunks/figures of that facet — producing a structured, facet-labelled report.
- Persists in a new `concept_report` table:

```sql
CREATE TABLE concept_report (
    id            UUID PRIMARY KEY,
    topic_id      VARCHAR(255) NOT NULL REFERENCES topic(id),
    report_number INT NOT NULL,
    facets_json   JSONB NOT NULL,        -- [{facetKey,title,markdown,refLabels[]}, ...]
    sources_json  JSONB NOT NULL,        -- deduplicated SourceReference[]
    generated_at  TIMESTAMPTZ NOT NULL,
    UNIQUE (topic_id, report_number)
);
```

Controller `ConceptReportController` exposes three endpoints under `/api/v1/topics/{id}/concept-reports` (POST generate, GET list, GET `/{reportId}`).

Reuses `TopicSummaryResponse.SourceReference` verbatim.

### 7. Frontend

- `frontend/src/stores/topicStore.ts`: add parallel state `conceptReportList`, `activeConceptReport`, `conceptReportLoading`, and actions mirroring the existing summary ones.
- `frontend/src/views/TopicsView.vue`: add a **Summary / Concept Report** tab toggle at the top of the topic panel. Concept Report reuses the history-chips + Generate button UI. Report body renders each `FacetSection` as `<h3>{title}</h3>` + markdown.
- Loading hint: update the "up to 30 seconds" copy to "up to 60 seconds".

### 8. README update

Add an **Indexing Pipeline** diagram showing: PDF → parse → chunk → embed → **enrich (new)** → chunk_metadata. Plus a **Concept Retrieval** sequence diagram: query → entity-match SQL → facet-grouped bundle → synthesis → report.

## Decisions & trade-offs

- **Storage as separate Postgres table, not vectorstore JSON**: vectorstore has no metadata-only update API, backfill would require delete+reinsert (re-embedding cost). A dedicated table joins cleanly on `chunk_id` and is GIN-indexed.
- **Entity-match primary, vector fallback**: deterministic for the main use case, robust against typos/synonyms. Vector search stays the default for normal chat retrieval — this feature is additive.
- **Enrichment runs *after* embedding, not before**: keeps the two failure modes independent. The backfill endpoint is the universal recovery lever.
- **Fixed 9-value facet enum** (incl. `OTHER`): constrains LLM outputs; `OTHER` prevents forced mis-bucketing.
- **Direct `JdbcTemplate` read against `vector_store` for backfill**: Spring AI exposes no listing API. Acceptable for a POC, isolated behind one method.
- **Synchronous (sequential) LLM calls**: simplest; parallelism is a later optimisation if needed.
- **`model_version` column**: cheap insurance. If the prompt or facet taxonomy changes, backfill can re-enrich only stale rows.

## Verification

1. Migration applies V7 and V8. Tables and indexes created.
2. New book ingestion: upload PDF → `chunk_metadata` populated with plausible entities/facets/summaries.
3. Backfill: POST `/api/v1/admin/books/{id}/enrich` → idempotent, completes, re-run is a no-op.
4. Concept retrieval primary path: POST `/api/v1/topics/aneurysm/concept-reports` → 200 with facets populated.
5. Fallback path: misspelled topic still returns results via vector fallback.
6. Frontend: Concept Report tab renders facet-labelled markdown + sources + inline figures; persists across reloads.
7. Deletion: removing a book cascades to `chunk_metadata` rows.
8. Regression: existing chat and summary flows still work.
9. Lint & tests pass.