Files
ai-teacher/chunk-enrichment.md
2026-04-18 17:54:54 +02:00

11 KiB

Concept Retrieval via Indexing-Time Chunk Enrichment

Context

Vector similarity alone can't answer "tell me everything about aneurysms." It surfaces the chunks most linguistically similar to the query, not the set of all chunks that concern the concept — and it has no notion of whether each chunk is a definition, a case, a technique, or a complication.

The unlock is to move intelligence from query time to indexing time: for every text chunk, use an LLM to extract structured metadata (entities, facet, summary). At retrieval time, concept lookup becomes an SQL filter (entities @> ['aneurysm']) bucketed by facet — deterministic, exhaustive, and organized by default. Vector search remains as a fallback for typos / synonyms and for ranking within a facet.

This plan covers: (1) defining the metadata schema, (2) enriching chunks during new book ingestion, (3) back-filling the already-embedded corpus via an admin endpoint, (4) a new concept retrieval path, and (5) a Topics-page UI to surface the result.

Approach

1. Data model — new chunk_metadata table

Flyway migration backend/src/main/resources/db/migration/V7__chunk_metadata.sql:

CREATE TABLE chunk_metadata (
    chunk_id        VARCHAR(64) PRIMARY KEY,       -- same UUID that TextChunkingService issues and stores in vectorstore
    book_id         UUID NOT NULL,
    section_id      VARCHAR(255) NOT NULL,
    facet           VARCHAR(32) NOT NULL,           -- enum (see ConceptFacet)
    entities        JSONB NOT NULL,                 -- canonical lowercase string[]
    summary         TEXT NOT NULL,
    model_version   VARCHAR(32) NOT NULL,           -- records which LLM/prompt version tagged this chunk
    enriched_at     TIMESTAMPTZ NOT NULL
);
CREATE INDEX idx_chunk_metadata_book         ON chunk_metadata(book_id);
CREATE INDEX idx_chunk_metadata_book_facet   ON chunk_metadata(book_id, facet);
CREATE INDEX idx_chunk_metadata_entities_gin ON chunk_metadata USING GIN (entities jsonb_path_ops);

Why chunk_id is the natural key: TextChunkingService already generates a UUID per chunk, uses it as the pgvector Document id, stores it in metadata, and it's the key in ChunkFigureRefEntity — so the table joins cleanly to everything already in place.

2. Enrichment service & facet taxonomy

New package com.aiteacher.enrichment:

  • ConceptFacet enum — 13 values tailored to neurosurgery textbooks: DEFINITION, ANATOMY, PATHOPHYSIOLOGY, EPIDEMIOLOGY, CLINICAL_PRESENTATION, IMAGING, CLASSIFICATION, INDICATIONS, SURGICAL_TECHNIQUE, NONSURGICAL_MANAGEMENT, COMPLICATIONS, OUTCOMES_FOLLOWUP, OTHER. OTHER is mandatory so the LLM always has an out (no hallucinated bucketing). The prompt carries explicit disambiguation rules (named grading scales → CLASSIFICATION; imaging of a complication → COMPLICATIONS; tools inside an operation → SURGICAL_TECHNIQUE).
  • ChunkEnrichmentResult — record (List<String> entities, ConceptFacet facet, String summary)
  • ChunkEnrichmentService — single method enrich(String chunkText, SectionEntity section, String bookTitle) → ChunkEnrichmentResult. Uses Spring AI ChatClient.prompt().call().entity(Class) for structured output. The prompt gives: book title, section title, chunk text, the fixed facet enum list, and instructs the model to return JSON with entities normalised to lowercase singular canonical form (e.g. "aneurysms" → "aneurysm"; "SAH" → "subarachnoid hemorrhage"). Caps entities at ~8 per chunk.
  • ChunkMetadataEntity + ChunkMetadataRepository — JPA entity/repo mirroring the table.

Model version string (e.g. "v1") lives on the service and is stamped into each row so a future prompt rev can be rolled out by filtering model_version <> 'v2' in the backfill job.

3. Hook into new book ingestion

Modify BookEmbeddingService.embedBook:

// Step 3: Chunk and embed text
List<Document> allChunks = new ArrayList<>();
for (SectionEntity section : sections) {
    allChunks.addAll(textChunkingService.chunk(section, bookTitle));
}
if (skipEmbedding) { ... } else {
    embedInBatches(allChunks, bookId);
    chunkEnrichmentPipeline.enrichAndPersist(allChunks, sectionsById, bookTitle);  // NEW
}
  • ChunkEnrichmentPipeline — new orchestrator that iterates chunks, calls ChunkEnrichmentService.enrich(...) per chunk, saves ChunkMetadataEntity rows in batches, with the same throttle pattern as embedInBatches.
  • Runs after embedding, not in place of it, so a failure in enrichment doesn't corrupt the vector store. On failure, log and continue — the backfill endpoint is the universal recovery path.
  • Extend deleteBookChunks to also delete chunk_metadata rows so deletion stays consistent.

4. Backfill endpoint for already-embedded books

New EnrichmentController in com.aiteacher.enrichment:

  • POST /api/v1/admin/books/{id}/enrich → kicks off async backfill, returns 202 with {status, chunksTotal, chunksEnriched}
  • GET /api/v1/admin/books/{id}/enrich → returns progress

Backfill flow (EnrichmentBackfillService.backfillBook(UUID bookId)):

  1. Query the pgvector storage table directly via JdbcTemplate for all chunks of the book:
    SELECT id, content, metadata
    FROM vector_store
    WHERE metadata->>'book_id' = ? AND metadata->>'type' = 'TEXT'
    
  2. Left-anti-join against chunk_metadata to skip already-enriched chunks → idempotent, resumable.
  3. For each missing chunk: look up its SectionEntity via section_id in metadata, call ChunkEnrichmentService.enrich, write a ChunkMetadataEntity row.
  4. Progress tracked in an in-memory ConcurrentHashMap<UUID, BackfillProgress> (POC scope — no cross-restart resumability needed because the left-anti-join makes re-runs free).
  5. @Async on the backfill method using the same executor as embedBook.

5. Concept retrieval path

New com.aiteacher.concept.ConceptRetriever:

public ConceptRetrievalResult retrieveByConcept(String conceptKeyword, UUID bookId) {
    String canonical = canonicalise(conceptKeyword);   // lowercase, trim, simple plural strip

    // 5a. Primary: SQL entity match, grouped by facet
    List<ChunkMetadataEntity> hits = chunkMetadataRepository
        .findByBookIdAndEntityContains(bookId, canonical);   // WHERE entities @> to_jsonb(?::text)

    if (hits.isEmpty()) {
        // 5b. Fallback: vector search, then enrich-join + facet-group
        List<Document> vectorHits = vectorStore.similaritySearch(/* TEXT filter, book_id filter, topK=30 */);
        List<String> chunkIds = vectorHits.stream().map(Document::getId).toList();
        hits = chunkMetadataRepository.findByChunkIdIn(chunkIds);
    }

    Map<ConceptFacet, List<ChunkMetadataEntity>> byFacet = hits.stream()
        .collect(groupingBy(ChunkMetadataEntity::getFacet, LinkedHashMap::new, toList()));

    // Hydrate: load SectionEntity for each chunk's section_id; load linked figures
    // via ChunkFigureRefRepository.findByChunkIdIn(chunkIds) — reuses existing linkage.
    return assemble(byFacet, ...);
}

ConceptRetrievalResult = Map<ConceptFacet, FacetBundle> where each FacetBundle holds the parent sections, linked figures, and the per-chunk summary strings.

Cross-book aggregation: caller loops over READY books and merges bundles by facet.

6. Concept Report service & controller

New ConceptReportService in com.aiteacher.concept — mirrors the shape of TopicSummaryService, but:

  • Calls ConceptRetriever.retrieveByConcept(topic.getName(), bookId) per book.
  • For each facet that has hits, sends one LLM synthesis call with the chunks/figures of that facet — producing a structured, facet-labelled report.
  • Persists in a new concept_report table:
CREATE TABLE concept_report (
    id            UUID PRIMARY KEY,
    topic_id      VARCHAR(255) NOT NULL REFERENCES topic(id),
    report_number INT NOT NULL,
    facets_json   JSONB NOT NULL,        -- [{facetKey,title,markdown,refLabels[]}, ...]
    sources_json  JSONB NOT NULL,        -- deduplicated SourceReference[]
    generated_at  TIMESTAMPTZ NOT NULL,
    UNIQUE (topic_id, report_number)
);

Controller ConceptReportController exposes three endpoints under /api/v1/topics/{id}/concept-reports (POST generate, GET list, GET /{reportId}).

Reuses TopicSummaryResponse.SourceReference verbatim.

7. Frontend

  • frontend/src/stores/topicStore.ts: add parallel state conceptReportList, activeConceptReport, conceptReportLoading, and actions mirroring the existing summary ones.
  • frontend/src/views/TopicsView.vue: add a Summary / Concept Report tab toggle at the top of the topic panel. Concept Report reuses the history-chips + Generate button UI. Report body renders each FacetSection as <h3>{title}</h3> + markdown.
  • Loading hint: update the "up to 30 seconds" copy to "up to 60 seconds".

8. README update

Add an Indexing Pipeline diagram showing: PDF → parse → chunk → embed → enrich (new) → chunk_metadata. Plus a Concept Retrieval sequence diagram: query → entity-match SQL → facet-grouped bundle → synthesis → report.

Decisions & trade-offs

  • Storage as separate Postgres table, not vectorstore JSON: vectorstore has no metadata-only update API, backfill would require delete+reinsert (re-embedding cost). A dedicated table joins cleanly on chunk_id and is GIN-indexed.
  • Entity-match primary, vector fallback: deterministic for the main use case, robust against typos/synonyms. Vector search stays the default for normal chat retrieval — this feature is additive.
  • Enrichment runs after embedding, not before: keeps the two failure modes independent. The backfill endpoint is the universal recovery lever.
  • Fixed 9-value facet enum (incl. OTHER): constrains LLM outputs; OTHER prevents forced mis-bucketing.
  • Direct JdbcTemplate read against vector_store for backfill: Spring AI exposes no listing API. Acceptable for a POC, isolated behind one method.
  • Synchronous (sequential) LLM calls: simplest; parallelism is a later optimisation if needed.
  • model_version column: cheap insurance. If the prompt or facet taxonomy changes, backfill can re-enrich only stale rows.

Verification

  1. Migration applies V7 and V8. Tables and indexes created.
  2. New book ingestion: upload PDF → chunk_metadata populated with plausible entities/facets/summaries.
  3. Backfill: POST /api/v1/admin/books/{id}/enrich → idempotent, completes, re-run is a no-op.
  4. Concept retrieval primary path: POST /api/v1/topics/aneurysm/concept-reports → 200 with facets populated.
  5. Fallback path: misspelled topic still returns results via vector fallback.
  6. Frontend: Concept Report tab renders facet-labelled markdown + sources + inline figures; persists across reloads.
  7. Deletion: removing a book cascades to chunk_metadata rows.
  8. Regression: existing chat and summary flows still work.
  9. Lint & tests pass.