From c7a77af2f45ac91ef935dff8dd9d01b1b8047a0a Mon Sep 17 00:00:00 2001 From: Adrien Date: Sat, 18 Apr 2026 17:54:54 +0200 Subject: [PATCH] add new concept report --- README.md | 46 +++ .../aiteacher/book/BookEmbeddingService.java | 20 +- .../concept/ConceptReportController.java | 50 +++ .../concept/ConceptReportEntity.java | 48 +++ .../concept/ConceptReportRepository.java | 13 + .../concept/ConceptReportResponse.java | 24 ++ .../concept/ConceptReportService.java | 287 ++++++++++++++++++ .../concept/ConceptRetrievalResult.java | 10 + .../aiteacher/concept/ConceptRetriever.java | 163 ++++++++++ .../com/aiteacher/concept/FacetBundle.java | 12 + .../concept/SavedConceptReportItem.java | 10 + .../enrichment/ChunkEnrichmentPipeline.java | 75 +++++ .../enrichment/ChunkEnrichmentResult.java | 9 + .../enrichment/ChunkEnrichmentService.java | 135 ++++++++ .../enrichment/ChunkMetadataEntity.java | 71 +++++ .../enrichment/ChunkMetadataRepository.java | 36 +++ .../aiteacher/enrichment/ConceptFacet.java | 27 ++ .../enrichment/EnrichmentBackfillService.java | 138 +++++++++ .../enrichment/EnrichmentController.java | 50 +++ .../aiteacher/topic/TopicSummaryService.java | 13 +- backend/src/main/resources/application.yaml | 5 +- .../db/migration/V7__chunk_metadata.sql | 14 + .../db/migration/V8__concept_report.sql | 11 + .../V9__chunk_metadata_facet_check.sql | 19 ++ chunk-enrichment.md | 172 +++++++++++ frontend/src/components/BookCard.vue | 70 ++++- frontend/src/stores/bookStore.ts | 39 ++- frontend/src/stores/topicStore.ts | 83 ++++- frontend/src/views/TopicsView.vue | 283 +++++++++++++++-- 29 files changed, 1892 insertions(+), 41 deletions(-) create mode 100644 backend/src/main/java/com/aiteacher/concept/ConceptReportController.java create mode 100644 backend/src/main/java/com/aiteacher/concept/ConceptReportEntity.java create mode 100644 backend/src/main/java/com/aiteacher/concept/ConceptReportRepository.java create mode 100644 backend/src/main/java/com/aiteacher/concept/ConceptReportResponse.java create mode 100644 backend/src/main/java/com/aiteacher/concept/ConceptReportService.java create mode 100644 backend/src/main/java/com/aiteacher/concept/ConceptRetrievalResult.java create mode 100644 backend/src/main/java/com/aiteacher/concept/ConceptRetriever.java create mode 100644 backend/src/main/java/com/aiteacher/concept/FacetBundle.java create mode 100644 backend/src/main/java/com/aiteacher/concept/SavedConceptReportItem.java create mode 100644 backend/src/main/java/com/aiteacher/enrichment/ChunkEnrichmentPipeline.java create mode 100644 backend/src/main/java/com/aiteacher/enrichment/ChunkEnrichmentResult.java create mode 100644 backend/src/main/java/com/aiteacher/enrichment/ChunkEnrichmentService.java create mode 100644 backend/src/main/java/com/aiteacher/enrichment/ChunkMetadataEntity.java create mode 100644 backend/src/main/java/com/aiteacher/enrichment/ChunkMetadataRepository.java create mode 100644 backend/src/main/java/com/aiteacher/enrichment/ConceptFacet.java create mode 100644 backend/src/main/java/com/aiteacher/enrichment/EnrichmentBackfillService.java create mode 100644 backend/src/main/java/com/aiteacher/enrichment/EnrichmentController.java create mode 100644 backend/src/main/resources/db/migration/V7__chunk_metadata.sql create mode 100644 backend/src/main/resources/db/migration/V8__concept_report.sql create mode 100644 backend/src/main/resources/db/migration/V9__chunk_metadata_facet_check.sql create mode 100644 chunk-enrichment.md diff --git a/README.md b/README.md index 42fffa2..5312ddd 100644 --- a/README.md +++ b/README.md @@ -35,11 +35,13 @@ graph TD EP3["Vision describe → embed caption"] EP4["Chunk text → embed chunks"] EP5["Link chunks ↔ figures"] + EP6["LLM enrich chunk\n(entities, facet, summary)\n→ chunk_metadata"] EP1 --> EP2 EP1 --> EP4 EP2 --> EP3 EP4 --> EP5 EP3 --> EP5 + EP4 --> EP6 end subgraph "Retrieval Pipeline (per chat query)" @@ -65,6 +67,50 @@ graph TD end ``` +### Concept Retrieval Pipeline (per concept report) + +Concept retrieval is an alternative to the semantic-similarity flow above. It uses the +LLM-tagged `chunk_metadata` rows written at indexing time to exhaustively gather every +chunk that *concerns* a concept (e.g. "aneurysm"), bucketed by facet. One synthesis call +per facet yields a structured, multi-section report. + +```mermaid +sequenceDiagram + participant User + participant FE as Frontend + participant BE as Backend (ConceptReportService) + participant Retr as ConceptRetriever + participant DB as chunk_metadata (GIN) + participant Vec as vector_store + participant LLM + + User->>FE: Click "Generate Concept Report" on topic + FE->>BE: POST /api/v1/topics/{id}/concept-reports + loop per READY book + BE->>Retr: retrieveByConcept(topicName, bookId) + Retr->>DB: WHERE entities @> [canonical] + alt SQL hits found + DB-->>Retr: chunks grouped by facet + else no match (typo / synonym) + Retr->>Vec: similaritySearch topK=30 + Vec-->>Retr: chunk ids + Retr->>DB: findByChunkIdIn → group by facet + end + end + BE->>BE: merge facets across books, assign global [S#]/[F#] + loop per non-empty facet + BE->>LLM: synthesize facet section (focused prompt) + LLM-->>BE: facet markdown + end + BE->>BE: persist concept_report + BE-->>FE: { facets[], sources[] } + FE->>User: render facet-labelled report + inline figures +``` + +Backfill path for already-embedded books: +`POST /api/v1/admin/books/{id}/enrich` scans `vector_store` for TEXT chunks missing +`chunk_metadata` rows and enriches them in place. Idempotent — re-running is a no-op. + ## Marker API Response Structure The PDF parsing pipeline calls a local [Marker](https://github.com/VikParuchuri/marker) server (`POST /marker/upload`). diff --git a/backend/src/main/java/com/aiteacher/book/BookEmbeddingService.java b/backend/src/main/java/com/aiteacher/book/BookEmbeddingService.java index 51ed43c..12c83e5 100644 --- a/backend/src/main/java/com/aiteacher/book/BookEmbeddingService.java +++ b/backend/src/main/java/com/aiteacher/book/BookEmbeddingService.java @@ -1,6 +1,8 @@ package com.aiteacher.book; import com.aiteacher.document.*; +import com.aiteacher.enrichment.ChunkEnrichmentPipeline; +import com.aiteacher.enrichment.ChunkMetadataRepository; import com.aiteacher.figure.FigureStorageService; import org.slf4j.Logger; @@ -35,6 +37,8 @@ public class BookEmbeddingService { private final ChunkFigureRefRepository chunkFigureRefRepository; private final FigureStorageService figureStorageService; private final MarkdownStorageService markdownStorageService; + private final ChunkEnrichmentPipeline chunkEnrichmentPipeline; + private final ChunkMetadataRepository chunkMetadataRepository; @Value("${app.embedding.batch-size:50}") private int embeddingBatchSize; @@ -58,7 +62,9 @@ public class BookEmbeddingService { FigureRepository figureRepository, ChunkFigureRefRepository chunkFigureRefRepository, FigureStorageService figureStorageService, - MarkdownStorageService markdownStorageService) { + MarkdownStorageService markdownStorageService, + ChunkEnrichmentPipeline chunkEnrichmentPipeline, + ChunkMetadataRepository chunkMetadataRepository) { this.vectorStore = vectorStore; this.bookRepository = bookRepository; this.markerPageParser = markerPageParser; @@ -72,6 +78,8 @@ public class BookEmbeddingService { this.chunkFigureRefRepository = chunkFigureRefRepository; this.figureStorageService = figureStorageService; this.markdownStorageService = markdownStorageService; + this.chunkEnrichmentPipeline = chunkEnrichmentPipeline; + this.chunkMetadataRepository = chunkMetadataRepository; } @Async @@ -110,6 +118,14 @@ public class BookEmbeddingService { } else { embedInBatches(allChunks, bookId); log.info("Embedded {} text chunks for book {}", allChunks.size(), bookId); + Map sectionsById = new HashMap<>(); + for (SectionEntity s : sections) sectionsById.put(s.getId(), s); + try { + chunkEnrichmentPipeline.enrichAndPersist(allChunks, sectionsById, bookTitle); + } catch (Exception ex) { + log.warn("Chunk enrichment failed for book {} — backfill endpoint can recover: {}", + bookId, ex.getMessage()); + } } // Step 4: Decode pre-cropped figures from Marker output @@ -200,6 +216,8 @@ public class BookEmbeddingService { sectionRepository.deleteAllByBookId(bookId); chapterRepository.deleteAllByBookId(bookId); + chunkMetadataRepository.deleteByBookId(bookId); + FilterExpressionBuilder b = new FilterExpressionBuilder(); vectorStore.delete(b.eq("book_id", bookId.toString()).build()); } catch (Exception ex) { diff --git a/backend/src/main/java/com/aiteacher/concept/ConceptReportController.java b/backend/src/main/java/com/aiteacher/concept/ConceptReportController.java new file mode 100644 index 0000000..83aa0cb --- /dev/null +++ b/backend/src/main/java/com/aiteacher/concept/ConceptReportController.java @@ -0,0 +1,50 @@ +package com.aiteacher.concept; + +import com.aiteacher.topic.Topic; +import com.aiteacher.topic.TopicRepository; +import org.springframework.http.ResponseEntity; +import org.springframework.web.bind.annotation.*; + +import java.util.List; +import java.util.Map; +import java.util.NoSuchElementException; +import java.util.UUID; +import java.util.stream.Collectors; + +@RestController +@RequestMapping("/api/v1/topics/{id}/concept-reports") +public class ConceptReportController { + + private final TopicRepository topicRepository; + private final ConceptReportService conceptReportService; + + public ConceptReportController(TopicRepository topicRepository, + ConceptReportService conceptReportService) { + this.topicRepository = topicRepository; + this.conceptReportService = conceptReportService; + } + + @PostMapping + public ResponseEntity generate(@PathVariable String id) { + Topic topic = topicRepository.findById(id) + .orElseThrow(() -> new NoSuchElementException("Topic not found.")); + return ResponseEntity.ok(conceptReportService.generateReport(topic)); + } + + @GetMapping + public ResponseEntity> list(@PathVariable String id) { + topicRepository.findById(id) + .orElseThrow(() -> new NoSuchElementException("Topic not found.")); + return ResponseEntity.ok(conceptReportService.listReports(id)); + } + + @GetMapping("/{reportId}") + public ResponseEntity get(@PathVariable String id, + @PathVariable UUID reportId) { + topicRepository.findById(id) + .orElseThrow(() -> new NoSuchElementException("Topic not found.")); + Map topicNames = topicRepository.findAll().stream() + .collect(Collectors.toMap(Topic::getId, Topic::getName, (a, b) -> a)); + return ResponseEntity.ok(conceptReportService.getReport(reportId, topicNames)); + } +} diff --git a/backend/src/main/java/com/aiteacher/concept/ConceptReportEntity.java b/backend/src/main/java/com/aiteacher/concept/ConceptReportEntity.java new file mode 100644 index 0000000..6e65a68 --- /dev/null +++ b/backend/src/main/java/com/aiteacher/concept/ConceptReportEntity.java @@ -0,0 +1,48 @@ +package com.aiteacher.concept; + +import jakarta.persistence.*; + +import java.time.Instant; +import java.util.UUID; + +@Entity +@Table(name = "concept_report") +public class ConceptReportEntity { + + @Id + @GeneratedValue(strategy = GenerationType.UUID) + private UUID id; + + @Column(name = "topic_id", nullable = false, length = 100) + private String topicId; + + @Column(name = "report_number", nullable = false) + private int reportNumber; + + @Column(name = "facets_json", nullable = false, columnDefinition = "TEXT") + private String facetsJson; + + @Column(name = "sources_json", nullable = false, columnDefinition = "TEXT") + private String sourcesJson; + + @Column(name = "generated_at", nullable = false) + private Instant generatedAt; + + protected ConceptReportEntity() {} + + public ConceptReportEntity(String topicId, int reportNumber, String facetsJson, + String sourcesJson, Instant generatedAt) { + this.topicId = topicId; + this.reportNumber = reportNumber; + this.facetsJson = facetsJson; + this.sourcesJson = sourcesJson; + this.generatedAt = generatedAt; + } + + public UUID getId() { return id; } + public String getTopicId() { return topicId; } + public int getReportNumber() { return reportNumber; } + public String getFacetsJson() { return facetsJson; } + public String getSourcesJson() { return sourcesJson; } + public Instant getGeneratedAt() { return generatedAt; } +} diff --git a/backend/src/main/java/com/aiteacher/concept/ConceptReportRepository.java b/backend/src/main/java/com/aiteacher/concept/ConceptReportRepository.java new file mode 100644 index 0000000..2bf2f5c --- /dev/null +++ b/backend/src/main/java/com/aiteacher/concept/ConceptReportRepository.java @@ -0,0 +1,13 @@ +package com.aiteacher.concept; + +import org.springframework.data.jpa.repository.JpaRepository; +import org.springframework.stereotype.Repository; + +import java.util.List; +import java.util.UUID; + +@Repository +public interface ConceptReportRepository extends JpaRepository { + long countByTopicId(String topicId); + List findByTopicIdOrderByReportNumberAsc(String topicId); +} diff --git a/backend/src/main/java/com/aiteacher/concept/ConceptReportResponse.java b/backend/src/main/java/com/aiteacher/concept/ConceptReportResponse.java new file mode 100644 index 0000000..7882915 --- /dev/null +++ b/backend/src/main/java/com/aiteacher/concept/ConceptReportResponse.java @@ -0,0 +1,24 @@ +package com.aiteacher.concept; + +import com.aiteacher.topic.TopicSummaryResponse.SourceReference; + +import java.time.Instant; +import java.util.List; +import java.util.UUID; + +public record ConceptReportResponse( + UUID id, + int reportNumber, + String topicId, + String topicName, + List facets, + List sources, + Instant generatedAt +) { + public record FacetSection( + String facetKey, + String title, + String markdown, + List refLabels + ) {} +} diff --git a/backend/src/main/java/com/aiteacher/concept/ConceptReportService.java b/backend/src/main/java/com/aiteacher/concept/ConceptReportService.java new file mode 100644 index 0000000..16ba3f6 --- /dev/null +++ b/backend/src/main/java/com/aiteacher/concept/ConceptReportService.java @@ -0,0 +1,287 @@ +package com.aiteacher.concept; + +import com.aiteacher.book.Book; +import com.aiteacher.book.BookRepository; +import com.aiteacher.book.BookStatus; +import com.aiteacher.book.NoKnowledgeSourceException; +import com.aiteacher.document.FigureEntity; +import com.aiteacher.document.SectionEntity; +import com.aiteacher.enrichment.ConceptFacet; +import com.aiteacher.topic.Topic; +import com.aiteacher.topic.TopicSummaryResponse.SourceReference; +import com.fasterxml.jackson.core.JsonProcessingException; +import com.fasterxml.jackson.databind.ObjectMapper; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; +import org.springframework.ai.chat.client.ChatClient; +import org.springframework.stereotype.Service; + +import java.time.Instant; +import java.util.*; + +@Service +public class ConceptReportService { + + private static final Logger log = LoggerFactory.getLogger(ConceptReportService.class); + + private static final String SYSTEM_PROMPT = """ + You are an expert neurosurgery educator. You write focused, facet-specific sections of + a structured concept report for highly experienced neurosurgeons. The audience wants + concise, clinically relevant teaching. + + When writing a facet section: + - Stick strictly to the facet you are asked about (e.g. definition, complications). + - Cite claims using ONLY the reference labels provided in the context. + Do not invent page numbers, section titles, or labels not present in CONTEXT. + - Citation format: each citation must be a SINGLE label per bracket — write `[S1], [S2]` or + `[S3] [F2]`. NEVER combine labels inside one bracket (no `[S1 S2]`, `[S1, S2]`, `[S1 2]`). + - Figures ([F#]) are actual images that will be rendered inline — reference them when they + visually support your explanation. + - If CONTEXT is insufficient for the requested facet, write exactly: + "The uploaded books do not contain sufficient information on this aspect." + - Never hallucinate clinical information outside the provided context. + """; + + private final ChatClient chatClient; + private final BookRepository bookRepository; + private final ConceptRetriever conceptRetriever; + private final ConceptReportRepository reportRepository; + private final ObjectMapper objectMapper; + + public ConceptReportService(ChatClient chatClient, + BookRepository bookRepository, + ConceptRetriever conceptRetriever, + ConceptReportRepository reportRepository, + ObjectMapper objectMapper) { + this.chatClient = chatClient; + this.bookRepository = bookRepository; + this.conceptRetriever = conceptRetriever; + this.reportRepository = reportRepository; + this.objectMapper = objectMapper; + } + + public ConceptReportResponse generateReport(Topic topic) { + List readyBooks = bookRepository.findAll().stream() + .filter(b -> b.getStatus() == BookStatus.READY) + .toList(); + + if (readyBooks.isEmpty()) { + throw new NoKnowledgeSourceException( + "No books are available as knowledge sources. Please upload and process at least one book."); + } + + Map merged = new EnumMap<>(ConceptFacet.class); + for (Book book : readyBooks) { + ConceptRetrievalResult result = conceptRetriever.retrieveByConcept(topic.getName(), book.getId()); + result.byFacet().forEach((facet, bundle) -> merged + .computeIfAbsent(facet, k -> new MergedFacet()) + .add(bundle)); + } + + // Global, deduplicated sources across all facets + List globalSections = new ArrayList<>(); + Set seenSections = new LinkedHashSet<>(); + List globalFigures = new ArrayList<>(); + Set seenFigures = new LinkedHashSet<>(); + + for (MergedFacet mf : merged.values()) { + for (SectionEntity s : mf.sections) if (seenSections.add(s.getId())) globalSections.add(s); + for (FigureEntity f : mf.figures) if (seenFigures.add(f.getId())) globalFigures.add(f); + } + + // Global label maps: section id -> "S#", figure id -> "F#" + Map sectionLabel = new HashMap<>(); + for (int i = 0; i < globalSections.size(); i++) { + sectionLabel.put(globalSections.get(i).getId(), "S" + (i + 1)); + } + Map figureLabel = new HashMap<>(); + for (int i = 0; i < globalFigures.size(); i++) { + figureLabel.put(globalFigures.get(i).getId(), "F" + (i + 1)); + } + + List facetSections = new ArrayList<>(); + // Preserve enum declaration order for consistent UI rendering + for (ConceptFacet facet : ConceptFacet.values()) { + MergedFacet mf = merged.get(facet); + if (mf == null || mf.isEmpty()) continue; + if (facet == ConceptFacet.OTHER) continue; // skip OTHER bucket in the rendered report + + String prompt = buildFacetPrompt(topic, facet, mf, sectionLabel, figureLabel); + String markdown = chatClient.prompt() + .system(SYSTEM_PROMPT) + .user(prompt) + .call() + .content(); + + List refs = collectRefs(mf, sectionLabel, figureLabel); + facetSections.add(new ConceptReportResponse.FacetSection( + facet.name(), facet.displayTitle(), markdown != null ? markdown : "", refs)); + } + + List sources = buildSources(globalSections, globalFigures, readyBooks); + Instant generatedAt = Instant.now(); + + int reportNumber = (int) reportRepository.countByTopicId(topic.getId()) + 1; + ConceptReportEntity entity = new ConceptReportEntity( + topic.getId(), reportNumber, + serialize(facetSections), serialize(sources), generatedAt); + entity = reportRepository.save(entity); + + return new ConceptReportResponse( + entity.getId(), reportNumber, topic.getId(), topic.getName(), + facetSections, sources, generatedAt); + } + + public List listReports(String topicId) { + return reportRepository.findByTopicIdOrderByReportNumberAsc(topicId).stream() + .map(e -> new SavedConceptReportItem(e.getId(), e.getReportNumber(), e.getGeneratedAt())) + .toList(); + } + + public ConceptReportResponse getReport(UUID reportId, Map topicNamesById) { + ConceptReportEntity entity = reportRepository.findById(reportId) + .orElseThrow(() -> new NoSuchElementException("Concept report not found.")); + List facets = deserializeFacets(entity.getFacetsJson()); + List sources = deserializeSources(entity.getSourcesJson()); + String topicName = topicNamesById.getOrDefault(entity.getTopicId(), entity.getTopicId()); + return new ConceptReportResponse( + entity.getId(), entity.getReportNumber(), entity.getTopicId(), topicName, + facets, sources, entity.getGeneratedAt()); + } + + private String buildFacetPrompt(Topic topic, ConceptFacet facet, MergedFacet mf, + Map sectionLabel, + Map figureLabel) { + StringBuilder sb = new StringBuilder(); + sb.append("CONCEPT: ").append(topic.getName()).append("\n"); + sb.append("FACET: ").append(facet.displayTitle()).append("\n\n"); + + sb.append("CONTEXT:\n\n"); + for (SectionEntity s : mf.sections) { + String label = sectionLabel.get(s.getId()); + sb.append("[").append(label).append("] ") + .append(s.getTitle() != null ? s.getTitle() : "") + .append(", p.").append(s.getPageStart()).append("\n"); + sb.append(s.getFullText()).append("\n\n"); + } + + if (!mf.figures.isEmpty()) { + sb.append("AVAILABLE FIGURES:\n"); + for (FigureEntity f : mf.figures) { + String label = figureLabel.get(f.getId()); + sb.append("[").append(label).append("] ") + .append(f.getLabel() != null ? f.getLabel() : "Figure") + .append(" (p.").append(f.getPage()).append("): ") + .append(f.getCaption() != null ? f.getCaption() : "") + .append("\n"); + } + sb.append("\n"); + } + + sb.append("Write the ").append(facet.displayTitle()).append(" section of a concept report on \"") + .append(topic.getName()) + .append("\". Stay strictly within this facet. Use the [S#]/[F#] labels above for citations."); + return sb.toString(); + } + + private List collectRefs(MergedFacet mf, + Map sectionLabel, + Map figureLabel) { + List refs = new ArrayList<>(); + for (SectionEntity s : mf.sections) { + String l = sectionLabel.get(s.getId()); + if (l != null) refs.add(l); + } + for (FigureEntity f : mf.figures) { + String l = figureLabel.get(f.getId()); + if (l != null) refs.add(l); + } + return refs; + } + + private List buildSources(List sections, + List figures, + List readyBooks) { + List sources = new ArrayList<>(); + for (int i = 0; i < sections.size(); i++) { + SectionEntity s = sections.get(i); + Book book = findBook(readyBooks, s.getBookId()); + String title = book != null ? book.getTitle() : "Book"; + String bookId = book != null ? book.getId().toString() : null; + sources.add(new SourceReference( + "TEXT", "S" + (i + 1), bookId, title, s.getPageStart(), + truncate(s.getFullText(), 500), null, null, null, null, null)); + } + for (int i = 0; i < figures.size(); i++) { + FigureEntity f = figures.get(i); + Book book = findBook(readyBooks, f.getBookId()); + String title = book != null ? book.getTitle() : "Book"; + String bookId = book != null ? book.getId().toString() : null; + String filename = f.getImagePath().substring(f.getImagePath().lastIndexOf('/') + 1); + String imageUrl = "/api/v1/figures/" + f.getBookId() + "/" + filename; + sources.add(new SourceReference( + "FIGURE", "F" + (i + 1), bookId, title, f.getPage(), + null, f.getId(), f.getLabel(), f.getCaption(), + f.getFigureType().name(), imageUrl)); + } + return sources; + } + + private Book findBook(List books, UUID bookId) { + return books.stream().filter(b -> b.getId().equals(bookId)).findFirst().orElse(null); + } + + private String serialize(Object value) { + try { + return objectMapper.writeValueAsString(value); + } catch (JsonProcessingException e) { + log.warn("Failed to serialize concept report field", e); + return "[]"; + } + } + + private List deserializeFacets(String json) { + try { + return objectMapper.readValue(json, + objectMapper.getTypeFactory().constructCollectionType( + List.class, ConceptReportResponse.FacetSection.class)); + } catch (JsonProcessingException e) { + log.warn("Failed to deserialize facets", e); + return List.of(); + } + } + + private List deserializeSources(String json) { + try { + return objectMapper.readValue(json, + objectMapper.getTypeFactory().constructCollectionType( + List.class, SourceReference.class)); + } catch (JsonProcessingException e) { + log.warn("Failed to deserialize sources", e); + return List.of(); + } + } + + private String truncate(String text, int maxChars) { + if (text == null) return ""; + return text.length() <= maxChars ? text : text.substring(0, maxChars) + "…"; + } + + private static class MergedFacet { + final List sections = new ArrayList<>(); + final List figures = new ArrayList<>(); + final Set sectionIds = new HashSet<>(); + final Set figureIds = new HashSet<>(); + + void add(FacetBundle bundle) { + for (SectionEntity s : bundle.sections()) { + if (sectionIds.add(s.getId())) sections.add(s); + } + for (FigureEntity f : bundle.figures()) { + if (figureIds.add(f.getId())) figures.add(f); + } + } + + boolean isEmpty() { return sections.isEmpty() && figures.isEmpty(); } + } +} diff --git a/backend/src/main/java/com/aiteacher/concept/ConceptRetrievalResult.java b/backend/src/main/java/com/aiteacher/concept/ConceptRetrievalResult.java new file mode 100644 index 0000000..9741bd0 --- /dev/null +++ b/backend/src/main/java/com/aiteacher/concept/ConceptRetrievalResult.java @@ -0,0 +1,10 @@ +package com.aiteacher.concept; + +import com.aiteacher.enrichment.ConceptFacet; + +import java.util.Map; + +public record ConceptRetrievalResult( + Map byFacet, + boolean usedFallback +) {} diff --git a/backend/src/main/java/com/aiteacher/concept/ConceptRetriever.java b/backend/src/main/java/com/aiteacher/concept/ConceptRetriever.java new file mode 100644 index 0000000..2418eed --- /dev/null +++ b/backend/src/main/java/com/aiteacher/concept/ConceptRetriever.java @@ -0,0 +1,163 @@ +package com.aiteacher.concept; + +import com.aiteacher.document.*; +import com.aiteacher.enrichment.ChunkMetadataEntity; +import com.aiteacher.enrichment.ChunkMetadataRepository; +import com.aiteacher.enrichment.ConceptFacet; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; +import org.springframework.ai.document.Document; +import org.springframework.ai.vectorstore.SearchRequest; +import org.springframework.ai.vectorstore.VectorStore; +import org.springframework.ai.vectorstore.filter.FilterExpressionBuilder; +import org.springframework.stereotype.Service; + +import java.util.*; +import java.util.stream.Collectors; + +@Service +public class ConceptRetriever { + + private static final Logger log = LoggerFactory.getLogger(ConceptRetriever.class); + + private static final int FALLBACK_TOP_K = 30; + private static final int FIGURE_TOP_K = 6; + + private final ChunkMetadataRepository metadataRepository; + private final VectorStore vectorStore; + private final SectionRepository sectionRepository; + private final FigureRepository figureRepository; + private final ChunkFigureRefRepository chunkFigureRefRepository; + + public ConceptRetriever(ChunkMetadataRepository metadataRepository, + VectorStore vectorStore, + SectionRepository sectionRepository, + FigureRepository figureRepository, + ChunkFigureRefRepository chunkFigureRefRepository) { + this.metadataRepository = metadataRepository; + this.vectorStore = vectorStore; + this.sectionRepository = sectionRepository; + this.figureRepository = figureRepository; + this.chunkFigureRefRepository = chunkFigureRefRepository; + } + + public ConceptRetrievalResult retrieveByConcept(String conceptKeyword, UUID bookId) { + String canonical = canonicalise(conceptKeyword); + + List hits = metadataRepository + .findByBookIdAndEntityContains(bookId, canonical); + boolean fallback = false; + + if (hits.isEmpty()) { + log.debug("Entity match miss for '{}' in book {} — falling back to vector search", canonical, bookId); + fallback = true; + hits = vectorFallback(conceptKeyword, bookId); + } + + if (hits.isEmpty()) { + return new ConceptRetrievalResult(Map.of(), fallback); + } + + List semanticFigures = semanticFigureSearch(conceptKeyword, bookId); + + Map> grouped = hits.stream() + .collect(Collectors.groupingBy( + ChunkMetadataEntity::getFacet, + LinkedHashMap::new, + Collectors.toList())); + + Map result = new LinkedHashMap<>(); + for (Map.Entry> entry : grouped.entrySet()) { + result.put(entry.getKey(), hydrate(entry.getValue(), semanticFigures)); + } + return new ConceptRetrievalResult(result, fallback); + } + + private List vectorFallback(String query, UUID bookId) { + FilterExpressionBuilder b = new FilterExpressionBuilder(); + List textHits = vectorStore.similaritySearch( + SearchRequest.builder() + .query(query) + .topK(FALLBACK_TOP_K) + .filterExpression(b.and( + b.eq("type", "TEXT"), + b.eq("book_id", bookId.toString()) + ).build()) + .build() + ); + List chunkIds = textHits.stream() + .map(d -> { + try { return UUID.fromString(d.getId()); } + catch (Exception e) { return null; } + }) + .filter(Objects::nonNull) + .toList(); + if (chunkIds.isEmpty()) return List.of(); + return metadataRepository.findByChunkIdIn(chunkIds); + } + + private FacetBundle hydrate(List chunks, List semanticFigures) { + List sectionIds = chunks.stream() + .map(ChunkMetadataEntity::getSectionId) + .distinct() + .toList(); + List sections = sectionIds.isEmpty() + ? List.of() + : sectionRepository.findAllById(sectionIds); + + List chunkIds = chunks.stream().map(ChunkMetadataEntity::getChunkId).toList(); + List linkedFigureIds = chunkFigureRefRepository.findByChunkIdIn(chunkIds) + .stream() + .map(ChunkFigureRefEntity::getFigureId) + .distinct() + .toList(); + List linkedFigures = linkedFigureIds.isEmpty() + ? List.of() + : figureRepository.findAllById(linkedFigureIds); + + // Merge caption-semantic-search figures with chunk-linked figures (dedupe by id, linked first) + Map merged = new LinkedHashMap<>(); + linkedFigures.forEach(f -> merged.put(f.getId(), f)); + semanticFigures.forEach(f -> merged.putIfAbsent(f.getId(), f)); + + List summaries = chunks.stream() + .map(ChunkMetadataEntity::getSummary) + .filter(s -> s != null && !s.isBlank()) + .distinct() + .toList(); + + return new FacetBundle(sections, new ArrayList<>(merged.values()), summaries); + } + + private List semanticFigureSearch(String query, UUID bookId) { + FilterExpressionBuilder b = new FilterExpressionBuilder(); + List figureHits = vectorStore.similaritySearch( + SearchRequest.builder() + .query(query) + .topK(FIGURE_TOP_K) + .filterExpression(b.and( + b.eq("type", "FIGURE"), + b.eq("book_id", bookId.toString()) + ).build()) + .build() + ); + List figureIds = figureHits.stream() + .map(d -> (String) d.getMetadata().get("figure_id")) + .filter(Objects::nonNull) + .toList(); + return figureIds.isEmpty() ? List.of() : figureRepository.findAllById(figureIds); + } + + static String canonicalise(String raw) { + if (raw == null) return ""; + String s = raw.trim().toLowerCase(Locale.ROOT); + if (s.endsWith("ies") && s.length() > 3) { + s = s.substring(0, s.length() - 3) + "y"; + } else if (s.endsWith("es") && s.length() > 2) { + s = s.substring(0, s.length() - 2); + } else if (s.endsWith("s") && s.length() > 1 && !s.endsWith("ss")) { + s = s.substring(0, s.length() - 1); + } + return s; + } +} diff --git a/backend/src/main/java/com/aiteacher/concept/FacetBundle.java b/backend/src/main/java/com/aiteacher/concept/FacetBundle.java new file mode 100644 index 0000000..77df371 --- /dev/null +++ b/backend/src/main/java/com/aiteacher/concept/FacetBundle.java @@ -0,0 +1,12 @@ +package com.aiteacher.concept; + +import com.aiteacher.document.FigureEntity; +import com.aiteacher.document.SectionEntity; + +import java.util.List; + +public record FacetBundle( + List sections, + List figures, + List chunkSummaries +) {} diff --git a/backend/src/main/java/com/aiteacher/concept/SavedConceptReportItem.java b/backend/src/main/java/com/aiteacher/concept/SavedConceptReportItem.java new file mode 100644 index 0000000..b875835 --- /dev/null +++ b/backend/src/main/java/com/aiteacher/concept/SavedConceptReportItem.java @@ -0,0 +1,10 @@ +package com.aiteacher.concept; + +import java.time.Instant; +import java.util.UUID; + +public record SavedConceptReportItem( + UUID id, + int reportNumber, + Instant generatedAt +) {} diff --git a/backend/src/main/java/com/aiteacher/enrichment/ChunkEnrichmentPipeline.java b/backend/src/main/java/com/aiteacher/enrichment/ChunkEnrichmentPipeline.java new file mode 100644 index 0000000..cc5ba7e --- /dev/null +++ b/backend/src/main/java/com/aiteacher/enrichment/ChunkEnrichmentPipeline.java @@ -0,0 +1,75 @@ +package com.aiteacher.enrichment; + +import com.aiteacher.document.SectionEntity; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; +import org.springframework.ai.document.Document; +import org.springframework.stereotype.Service; + +import java.time.Instant; +import java.util.List; +import java.util.Map; +import java.util.UUID; + +@Service +public class ChunkEnrichmentPipeline { + + private static final Logger log = LoggerFactory.getLogger(ChunkEnrichmentPipeline.class); + + private final ChunkEnrichmentService enrichmentService; + private final ChunkMetadataRepository metadataRepository; + + public ChunkEnrichmentPipeline(ChunkEnrichmentService enrichmentService, + ChunkMetadataRepository metadataRepository) { + this.enrichmentService = enrichmentService; + this.metadataRepository = metadataRepository; + } + + public void enrichAndPersist(List chunks, + Map sectionsById, + String bookTitle) { + int total = chunks.size(); + int done = 0; + for (Document chunk : chunks) { + String sectionId = (String) chunk.getMetadata().get("section_id"); + SectionEntity section = sectionId != null ? sectionsById.get(sectionId) : null; + UUID chunkId; + try { + chunkId = UUID.fromString(chunk.getId()); + } catch (IllegalArgumentException ex) { + log.warn("Skipping chunk with non-UUID id '{}'", chunk.getId()); + continue; + } + UUID bookId = extractBookId(chunk); + if (bookId == null || sectionId == null) { + log.warn("Skipping chunk {} missing book_id or section_id metadata", chunkId); + continue; + } + try { + ChunkEnrichmentResult result = enrichmentService.enrich(chunk.getText(), section, bookTitle); + ChunkMetadataEntity entity = new ChunkMetadataEntity( + chunkId, bookId, sectionId, + result.facet(), result.entities(), result.summary(), + ChunkEnrichmentService.MODEL_VERSION, Instant.now()); + metadataRepository.save(entity); + } catch (Exception ex) { + log.warn("Enrichment failed for chunk {}: {}", chunkId, ex.getMessage()); + } + done++; + if (done % 25 == 0) { + log.info("Enrichment progress: {}/{} chunks", done, total); + } + } + log.info("Enrichment complete: {}/{} chunks enriched", done, total); + } + + private UUID extractBookId(Document chunk) { + Object raw = chunk.getMetadata().get("book_id"); + if (raw == null) return null; + try { + return UUID.fromString(raw.toString()); + } catch (IllegalArgumentException ex) { + return null; + } + } +} diff --git a/backend/src/main/java/com/aiteacher/enrichment/ChunkEnrichmentResult.java b/backend/src/main/java/com/aiteacher/enrichment/ChunkEnrichmentResult.java new file mode 100644 index 0000000..7235551 --- /dev/null +++ b/backend/src/main/java/com/aiteacher/enrichment/ChunkEnrichmentResult.java @@ -0,0 +1,9 @@ +package com.aiteacher.enrichment; + +import java.util.List; + +public record ChunkEnrichmentResult( + List entities, + ConceptFacet facet, + String summary +) {} diff --git a/backend/src/main/java/com/aiteacher/enrichment/ChunkEnrichmentService.java b/backend/src/main/java/com/aiteacher/enrichment/ChunkEnrichmentService.java new file mode 100644 index 0000000..7e751ef --- /dev/null +++ b/backend/src/main/java/com/aiteacher/enrichment/ChunkEnrichmentService.java @@ -0,0 +1,135 @@ +package com.aiteacher.enrichment; + +import com.aiteacher.document.SectionEntity; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; +import org.springframework.ai.chat.client.ChatClient; +import org.springframework.stereotype.Service; + +import java.util.ArrayList; +import java.util.List; +import java.util.Locale; + +@Service +public class ChunkEnrichmentService { + + public static final String MODEL_VERSION = "v1"; + private static final int MAX_ENTITIES = 8; + + private static final Logger log = LoggerFactory.getLogger(ChunkEnrichmentService.class); + + private static final String SYSTEM_PROMPT = """ + You are a medical indexing assistant that classifies neurosurgery textbook excerpts. + For each excerpt you receive, extract three fields: + - entities: the medical concepts, conditions, procedures, tools, or anatomical + structures the excerpt is ABOUT. Normalise each to lowercase, singular canonical + English form. Expand abbreviations (e.g. "SAH" -> "subarachnoid hemorrhage"). + Avoid generic words ("patient", "technique"). Cap at %d entities. + + - facet: exactly one of the following. Pick the SINGLE best fit based on the + excerpt's PRIMARY teaching purpose. Use OTHER only when nothing else applies. + + DEFINITION — defines the entity / syndrome / concept ("what is X"). + ANATOMY — neuroanatomy, vascular/tract relationships, operative + landmarks, anatomical variants. + PATHOPHYSIOLOGY — mechanism of disease, etiology, natural history, + molecular/cellular basis. + EPIDEMIOLOGY — incidence, prevalence, demographics, risk factors. + CLINICAL_PRESENTATION — symptoms, signs, neurological exam findings, syndromes + as they present in patients. + IMAGING — CT / MRI / angiography / DSA / ultrasound features and + interpretation. If the excerpt describes HOW something + looks on imaging, use IMAGING. + CLASSIFICATION — named grading scales, staging systems, subtype + taxonomies (Hunt-Hess, WFNS, Fisher, Spetzler-Martin, + GCS, Karnofsky, mRS, Simpson, etc.). If the excerpt + defines or applies a named scale, use CLASSIFICATION + even if it is grounded in imaging or clinical exam. + INDICATIONS — when to operate / treat / observe; patient selection + criteria; contraindications. + SURGICAL_TECHNIQUE — operative approach, positioning, steps, landmarks, + instruments, implants, intraoperative monitoring. + NONSURGICAL_MANAGEMENT — medical therapy, endovascular treatment, stereotactic + radiosurgery, conservative / observational management. + COMPLICATIONS — intra- or postoperative complications, adverse events. + OUTCOMES_FOLLOWUP — prognosis, morbidity/mortality rates, recurrence, + surveillance schedules, follow-up care. + OTHER — history, philosophy, ethics, or anything not covered. + + Disambiguation rules: + * A named grading scale => CLASSIFICATION (even when grounded in imaging/exam). + * Tools and implants described as part of an operation => SURGICAL_TECHNIQUE, + not a standalone facet. + * Illustrative case reports => CLINICAL_PRESENTATION. + * Imaging findings of complications => COMPLICATIONS, not IMAGING. + + - summary: one or two sentences describing what the excerpt teaches. + + Respond with the structured JSON requested. Do not fabricate content not present in + the excerpt. + """.formatted(MAX_ENTITIES); + + private final ChatClient chatClient; + + public ChunkEnrichmentService(ChatClient chatClient) { + this.chatClient = chatClient; + } + + public ChunkEnrichmentResult enrich(String chunkText, SectionEntity section, String bookTitle) { + String userPrompt = buildUserPrompt(chunkText, section, bookTitle); + + LlmOutput raw = chatClient.prompt() + .system(SYSTEM_PROMPT) + .user(userPrompt) + .call() + .entity(LlmOutput.class); + + if (raw == null) { + log.warn("LLM returned null enrichment; defaulting to OTHER"); + return new ChunkEnrichmentResult(List.of(), ConceptFacet.OTHER, ""); + } + + List entities = normaliseEntities(raw.entities()); + ConceptFacet facet = parseFacet(raw.facet()); + String summary = raw.summary() != null ? raw.summary().strip() : ""; + return new ChunkEnrichmentResult(entities, facet, summary); + } + + private String buildUserPrompt(String chunkText, SectionEntity section, String bookTitle) { + String sectionTitle = section != null && section.getTitle() != null ? section.getTitle() : ""; + return """ + BOOK: %s + SECTION: %s + EXCERPT: + --- + %s + --- + """.formatted(bookTitle, sectionTitle, chunkText); + } + + private List normaliseEntities(List raw) { + if (raw == null) return List.of(); + List out = new ArrayList<>(); + for (String e : raw) { + if (e == null) continue; + String canonical = e.trim().toLowerCase(Locale.ROOT); + if (canonical.isEmpty()) continue; + if (!out.contains(canonical)) out.add(canonical); + if (out.size() >= MAX_ENTITIES) break; + } + return out; + } + + private ConceptFacet parseFacet(String raw) { + if (raw == null) return ConceptFacet.OTHER; + try { + return ConceptFacet.valueOf(raw.trim().toUpperCase(Locale.ROOT)); + } catch (IllegalArgumentException ex) { + log.warn("LLM returned unknown facet '{}', defaulting to OTHER", raw); + return ConceptFacet.OTHER; + } + } + + // DTO for Spring AI structured output; facet is read as String so we can defend against bad values + public record LlmOutput(List entities, String facet, String summary) {} +} diff --git a/backend/src/main/java/com/aiteacher/enrichment/ChunkMetadataEntity.java b/backend/src/main/java/com/aiteacher/enrichment/ChunkMetadataEntity.java new file mode 100644 index 0000000..ab39b8b --- /dev/null +++ b/backend/src/main/java/com/aiteacher/enrichment/ChunkMetadataEntity.java @@ -0,0 +1,71 @@ +package com.aiteacher.enrichment; + +import jakarta.persistence.*; +import org.hibernate.annotations.JdbcTypeCode; +import org.hibernate.type.SqlTypes; + +import java.time.Instant; +import java.util.List; +import java.util.UUID; + +@Entity +@Table(name = "chunk_metadata") +@org.hibernate.annotations.Check( + name = "chunk_metadata_facet_check", + constraints = "facet IN ('DEFINITION','ANATOMY','PATHOPHYSIOLOGY','EPIDEMIOLOGY'," + + "'CLINICAL_PRESENTATION','IMAGING','CLASSIFICATION','INDICATIONS'," + + "'SURGICAL_TECHNIQUE','NONSURGICAL_MANAGEMENT','COMPLICATIONS'," + + "'OUTCOMES_FOLLOWUP','OTHER')") +public class ChunkMetadataEntity { + + @Id + @Column(name = "chunk_id", nullable = false) + private UUID chunkId; + + @Column(name = "book_id", nullable = false) + private UUID bookId; + + @Column(name = "section_id", nullable = false, length = 200) + private String sectionId; + + @Enumerated(EnumType.STRING) + @Column(name = "facet", nullable = false, length = 32) + private ConceptFacet facet; + + @JdbcTypeCode(SqlTypes.JSON) + @Column(name = "entities", nullable = false, columnDefinition = "jsonb") + private List entities; + + @Column(name = "summary", nullable = false, columnDefinition = "TEXT") + private String summary; + + @Column(name = "model_version", nullable = false, length = 32) + private String modelVersion; + + @Column(name = "enriched_at", nullable = false) + private Instant enrichedAt; + + protected ChunkMetadataEntity() {} + + public ChunkMetadataEntity(UUID chunkId, UUID bookId, String sectionId, + ConceptFacet facet, List entities, String summary, + String modelVersion, Instant enrichedAt) { + this.chunkId = chunkId; + this.bookId = bookId; + this.sectionId = sectionId; + this.facet = facet; + this.entities = entities; + this.summary = summary; + this.modelVersion = modelVersion; + this.enrichedAt = enrichedAt; + } + + public UUID getChunkId() { return chunkId; } + public UUID getBookId() { return bookId; } + public String getSectionId() { return sectionId; } + public ConceptFacet getFacet() { return facet; } + public List getEntities() { return entities; } + public String getSummary() { return summary; } + public String getModelVersion() { return modelVersion; } + public Instant getEnrichedAt() { return enrichedAt; } +} diff --git a/backend/src/main/java/com/aiteacher/enrichment/ChunkMetadataRepository.java b/backend/src/main/java/com/aiteacher/enrichment/ChunkMetadataRepository.java new file mode 100644 index 0000000..f38339c --- /dev/null +++ b/backend/src/main/java/com/aiteacher/enrichment/ChunkMetadataRepository.java @@ -0,0 +1,36 @@ +package com.aiteacher.enrichment; + +import org.springframework.data.jpa.repository.JpaRepository; +import org.springframework.data.jpa.repository.Query; +import org.springframework.data.repository.query.Param; +import org.springframework.stereotype.Repository; +import org.springframework.transaction.annotation.Transactional; + +import java.util.Collection; +import java.util.List; +import java.util.UUID; + +@Repository +public interface ChunkMetadataRepository extends JpaRepository { + + long countByBookId(UUID bookId); + + @Query(value = """ + SELECT * FROM chunk_metadata + WHERE book_id = :bookId + AND entities @> to_jsonb(CAST(:entity AS text)) + """, nativeQuery = true) + List findByBookIdAndEntityContains(@Param("bookId") UUID bookId, + @Param("entity") String entity); + + @Query(value = """ + SELECT * FROM chunk_metadata + WHERE entities @> to_jsonb(CAST(:entity AS text)) + """, nativeQuery = true) + List findByEntityContains(@Param("entity") String entity); + + List findByChunkIdIn(Collection chunkIds); + + @Transactional + void deleteByBookId(UUID bookId); +} diff --git a/backend/src/main/java/com/aiteacher/enrichment/ConceptFacet.java b/backend/src/main/java/com/aiteacher/enrichment/ConceptFacet.java new file mode 100644 index 0000000..7962d2a --- /dev/null +++ b/backend/src/main/java/com/aiteacher/enrichment/ConceptFacet.java @@ -0,0 +1,27 @@ +package com.aiteacher.enrichment; + +public enum ConceptFacet { + DEFINITION("Definition & Overview"), + ANATOMY("Anatomy"), + PATHOPHYSIOLOGY("Pathophysiology"), + EPIDEMIOLOGY("Epidemiology"), + CLINICAL_PRESENTATION("Clinical Presentation"), + IMAGING("Imaging"), + CLASSIFICATION("Classification & Grading"), + INDICATIONS("Indications & Patient Selection"), + SURGICAL_TECHNIQUE("Surgical Technique"), + NONSURGICAL_MANAGEMENT("Non-surgical Management"), + COMPLICATIONS("Complications"), + OUTCOMES_FOLLOWUP("Outcomes & Follow-up"), + OTHER("Other"); + + private final String displayTitle; + + ConceptFacet(String displayTitle) { + this.displayTitle = displayTitle; + } + + public String displayTitle() { + return displayTitle; + } +} diff --git a/backend/src/main/java/com/aiteacher/enrichment/EnrichmentBackfillService.java b/backend/src/main/java/com/aiteacher/enrichment/EnrichmentBackfillService.java new file mode 100644 index 0000000..4552a51 --- /dev/null +++ b/backend/src/main/java/com/aiteacher/enrichment/EnrichmentBackfillService.java @@ -0,0 +1,138 @@ +package com.aiteacher.enrichment; + +import com.aiteacher.document.SectionEntity; +import com.aiteacher.document.SectionRepository; +import com.fasterxml.jackson.core.JsonProcessingException; +import com.fasterxml.jackson.databind.JsonNode; +import com.fasterxml.jackson.databind.ObjectMapper; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; +import org.springframework.ai.document.Document; +import org.springframework.jdbc.core.JdbcTemplate; +import org.springframework.scheduling.annotation.Async; +import org.springframework.stereotype.Service; + +import java.time.Instant; +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import java.util.Optional; +import java.util.UUID; +import java.util.concurrent.ConcurrentHashMap; + +@Service +public class EnrichmentBackfillService { + + private static final Logger log = LoggerFactory.getLogger(EnrichmentBackfillService.class); + + private final JdbcTemplate jdbcTemplate; + private final ChunkEnrichmentService enrichmentService; + private final ChunkMetadataRepository metadataRepository; + private final SectionRepository sectionRepository; + private final ObjectMapper objectMapper; + private final Map progressByBook = new ConcurrentHashMap<>(); + + public EnrichmentBackfillService(JdbcTemplate jdbcTemplate, + ChunkEnrichmentService enrichmentService, + ChunkMetadataRepository metadataRepository, + SectionRepository sectionRepository, + ObjectMapper objectMapper) { + this.jdbcTemplate = jdbcTemplate; + this.enrichmentService = enrichmentService; + this.metadataRepository = metadataRepository; + this.sectionRepository = sectionRepository; + this.objectMapper = objectMapper; + } + + public BackfillProgress getProgress(UUID bookId) { + return progressByBook.getOrDefault(bookId, BackfillProgress.idle()); + } + + @Async + public void backfillBook(UUID bookId, String bookTitle) { + List pending = listUnenrichedChunks(bookId); + int total = pending.size(); + progressByBook.put(bookId, new BackfillProgress("RUNNING", total, 0, null)); + log.info("Backfill starting for book {} — {} chunks pending", bookId, total); + + int done = 0; + Map sectionCache = new HashMap<>(); + for (Document chunk : pending) { + try { + String sectionId = (String) chunk.getMetadata().get("section_id"); + SectionEntity section = sectionId != null + ? sectionCache.computeIfAbsent(sectionId, + id -> sectionRepository.findById(id).orElse(null)) + : null; + ChunkEnrichmentResult result = enrichmentService.enrich(chunk.getText(), section, bookTitle); + UUID chunkId = UUID.fromString(chunk.getId()); + metadataRepository.save(new ChunkMetadataEntity( + chunkId, bookId, sectionId != null ? sectionId : "", + result.facet(), result.entities(), result.summary(), + ChunkEnrichmentService.MODEL_VERSION, Instant.now())); + } catch (Exception ex) { + log.warn("Backfill failed for chunk {} of book {}: {}", chunk.getId(), bookId, ex.getMessage()); + } + done++; + progressByBook.put(bookId, new BackfillProgress("RUNNING", total, done, null)); + } + progressByBook.put(bookId, new BackfillProgress("COMPLETED", total, done, null)); + log.info("Backfill finished for book {} — {}/{} enriched", bookId, done, total); + } + + private List listUnenrichedChunks(UUID bookId) { + // Left anti-join against chunk_metadata so re-runs are cheap. + String sql = """ + SELECT vs.id, vs.content, vs.metadata::text AS metadata_text + FROM vector_store vs + LEFT JOIN chunk_metadata cm ON cm.chunk_id = vs.id + WHERE vs.metadata->>'book_id' = ? + AND vs.metadata->>'type' = 'TEXT' + AND cm.chunk_id IS NULL + """; + return jdbcTemplate.query(sql, (rs, rowNum) -> { + String id = rs.getString("id"); + String content = rs.getString("content"); + String metaJson = rs.getString("metadata_text"); + Map meta = parseMetadata(metaJson); + return new Document(id, content != null ? content : "", meta); + }, bookId.toString()); + } + + private Map parseMetadata(String json) { + if (json == null || json.isBlank()) return Map.of(); + try { + JsonNode node = objectMapper.readTree(json); + Map out = new HashMap<>(); + node.properties().forEach(e -> { + JsonNode v = e.getValue(); + if (v.isTextual()) out.put(e.getKey(), v.asText()); + else if (v.isInt()) out.put(e.getKey(), v.asInt()); + else if (v.isLong()) out.put(e.getKey(), v.asLong()); + else if (v.isBoolean()) out.put(e.getKey(), v.asBoolean()); + else out.put(e.getKey(), v.toString()); + }); + return out; + } catch (JsonProcessingException ex) { + log.warn("Failed to parse vector_store metadata JSON: {}", ex.getMessage()); + return Map.of(); + } + } + + public Optional countEnrichedChunks(UUID bookId) { + return Optional.of((int) metadataRepository.countByBookId(bookId)); + } + + public int countTotalTextChunks(UUID bookId) { + Integer n = jdbcTemplate.queryForObject( + "SELECT COUNT(*) FROM vector_store WHERE metadata->>'book_id' = ? AND metadata->>'type' = 'TEXT'", + Integer.class, bookId.toString()); + return n != null ? n : 0; + } + + public record BackfillProgress(String status, int chunksTotal, int chunksEnriched, String errorMessage) { + public static BackfillProgress idle() { + return new BackfillProgress("IDLE", 0, 0, null); + } + } +} diff --git a/backend/src/main/java/com/aiteacher/enrichment/EnrichmentController.java b/backend/src/main/java/com/aiteacher/enrichment/EnrichmentController.java new file mode 100644 index 0000000..a74e15d --- /dev/null +++ b/backend/src/main/java/com/aiteacher/enrichment/EnrichmentController.java @@ -0,0 +1,50 @@ +package com.aiteacher.enrichment; + +import com.aiteacher.book.Book; +import com.aiteacher.book.BookRepository; +import org.springframework.http.HttpStatus; +import org.springframework.http.ResponseEntity; +import org.springframework.web.bind.annotation.*; + +import java.util.NoSuchElementException; +import java.util.UUID; + +@RestController +@RequestMapping("/api/v1/admin/books/{id}/enrich") +public class EnrichmentController { + + private final BookRepository bookRepository; + private final EnrichmentBackfillService backfillService; + + public EnrichmentController(BookRepository bookRepository, + EnrichmentBackfillService backfillService) { + this.bookRepository = bookRepository; + this.backfillService = backfillService; + } + + @PostMapping + public ResponseEntity start(@PathVariable UUID id) { + Book book = bookRepository.findById(id) + .orElseThrow(() -> new NoSuchElementException("Book not found.")); + backfillService.backfillBook(id, book.getTitle()); + int total = backfillService.countTotalTextChunks(id); + int enriched = backfillService.countEnrichedChunks(id).orElse(0); + return ResponseEntity.status(HttpStatus.ACCEPTED) + .body(new EnrichmentBackfillService.BackfillProgress("RUNNING", total, enriched, null)); + } + + @GetMapping + public ResponseEntity status(@PathVariable UUID id) { + bookRepository.findById(id) + .orElseThrow(() -> new NoSuchElementException("Book not found.")); + EnrichmentBackfillService.BackfillProgress progress = backfillService.getProgress(id); + if ("IDLE".equals(progress.status())) { + int total = backfillService.countTotalTextChunks(id); + int enriched = backfillService.countEnrichedChunks(id).orElse(0); + progress = new EnrichmentBackfillService.BackfillProgress( + enriched >= total && total > 0 ? "COMPLETED" : "IDLE", + total, enriched, null); + } + return ResponseEntity.ok(progress); + } +} diff --git a/backend/src/main/java/com/aiteacher/topic/TopicSummaryService.java b/backend/src/main/java/com/aiteacher/topic/TopicSummaryService.java index e6ddfc8..3b3ca37 100644 --- a/backend/src/main/java/com/aiteacher/topic/TopicSummaryService.java +++ b/backend/src/main/java/com/aiteacher/topic/TopicSummaryService.java @@ -27,9 +27,9 @@ public class TopicSummaryService { private static final Logger log = LoggerFactory.getLogger(TopicSummaryService.class); private static final String SYSTEM_PROMPT = """ - You are an expert neurosurgery educator. Your role is to provide accurate, - clinically relevant summaries based ONLY on the content retrieved from the - uploaded medical textbooks. Do not use any knowledge outside the provided context. + You are an expert neurosurgery educator. Your role is to provide accurate, detailed but synthetically concise educational reports on neurosurgery topics, based on the content retrieved from the uploaded medical textbooks. Your audience is highly experienced neurosurgeons, who are looking for a comprehensive yet digestible overview of a specific topic. + When generating reports, your primary goal is to distill the most important and clinically relevant information about the topic. This includes key concepts, anatomical details, surgical techniques, clinical considerations, and any other information that would be essential for a neurosurgeon to understand the topic thoroughly. + Base your reports on uploaded medical textbooks. Do not use any knowledge outside the provided context. When answering: - Structure your response clearly with key points @@ -79,7 +79,7 @@ public class TopicSummaryService { allFigures.addAll(result.figures()); } - log.debug("Topic summary for '{}': {} sections, {} figures retrieved", + log.debug("Topic reports for '{}': {} sections, {} figures retrieved", topic.getName(), allSections.size(), allFigures.size()); String contextPrompt = buildContextPrompt(question, allSections, allFigures); @@ -134,9 +134,8 @@ public class TopicSummaryService { private String buildQuestion(Topic topic) { return String.format( - "Provide a comprehensive educational summary of the following neurosurgery topic: " + - "%s. Topic description: %s. " + - "Include key concepts, diagrams, illustations and clinical considerations, and important details that a neurosurgeon should know.", + "Provide a comprehensive educational report of the following neurosurgery topic: " + + "%s. Topic description: %s. ", topic.getName(), topic.getDescription() ); } diff --git a/backend/src/main/resources/application.yaml b/backend/src/main/resources/application.yaml index e2820b9..5723e60 100644 --- a/backend/src/main/resources/application.yaml +++ b/backend/src/main/resources/application.yaml @@ -7,7 +7,7 @@ spring: jpa: hibernate: - ddl-auto: update + ddl-auto: none show-sql: false properties: hibernate: @@ -30,7 +30,8 @@ spring: api-key: ${OPENAI_API_KEY:} chat: options: - model: gpt-4o-mini + model: o4-mini + reasoning-effort: high embedding: options: model: "text-embedding-3-small" diff --git a/backend/src/main/resources/db/migration/V7__chunk_metadata.sql b/backend/src/main/resources/db/migration/V7__chunk_metadata.sql new file mode 100644 index 0000000..26223c9 --- /dev/null +++ b/backend/src/main/resources/db/migration/V7__chunk_metadata.sql @@ -0,0 +1,14 @@ +CREATE TABLE chunk_metadata ( + chunk_id UUID PRIMARY KEY, + book_id UUID NOT NULL, + section_id VARCHAR(200) NOT NULL, + facet VARCHAR(32) NOT NULL, + entities JSONB NOT NULL, + summary TEXT NOT NULL, + model_version VARCHAR(32) NOT NULL, + enriched_at TIMESTAMPTZ NOT NULL +); + +CREATE INDEX idx_chunk_metadata_book ON chunk_metadata(book_id); +CREATE INDEX idx_chunk_metadata_book_facet ON chunk_metadata(book_id, facet); +CREATE INDEX idx_chunk_metadata_entities_gin ON chunk_metadata USING GIN (entities jsonb_path_ops); diff --git a/backend/src/main/resources/db/migration/V8__concept_report.sql b/backend/src/main/resources/db/migration/V8__concept_report.sql new file mode 100644 index 0000000..adbde85 --- /dev/null +++ b/backend/src/main/resources/db/migration/V8__concept_report.sql @@ -0,0 +1,11 @@ +CREATE TABLE concept_report ( + id UUID PRIMARY KEY DEFAULT gen_random_uuid(), + topic_id VARCHAR(100) NOT NULL, + report_number INT NOT NULL, + facets_json TEXT NOT NULL, + sources_json TEXT NOT NULL, + generated_at TIMESTAMPTZ NOT NULL, + UNIQUE (topic_id, report_number) +); + +CREATE INDEX idx_concept_report_topic ON concept_report(topic_id, report_number); diff --git a/backend/src/main/resources/db/migration/V9__chunk_metadata_facet_check.sql b/backend/src/main/resources/db/migration/V9__chunk_metadata_facet_check.sql new file mode 100644 index 0000000..8ac74dd --- /dev/null +++ b/backend/src/main/resources/db/migration/V9__chunk_metadata_facet_check.sql @@ -0,0 +1,19 @@ +ALTER TABLE chunk_metadata DROP CONSTRAINT IF EXISTS chunk_metadata_facet_check; + +ALTER TABLE chunk_metadata + ADD CONSTRAINT chunk_metadata_facet_check + CHECK (facet IN ( + 'DEFINITION', + 'ANATOMY', + 'PATHOPHYSIOLOGY', + 'EPIDEMIOLOGY', + 'CLINICAL_PRESENTATION', + 'IMAGING', + 'CLASSIFICATION', + 'INDICATIONS', + 'SURGICAL_TECHNIQUE', + 'NONSURGICAL_MANAGEMENT', + 'COMPLICATIONS', + 'OUTCOMES_FOLLOWUP', + 'OTHER' + )); diff --git a/chunk-enrichment.md b/chunk-enrichment.md new file mode 100644 index 0000000..61bf11b --- /dev/null +++ b/chunk-enrichment.md @@ -0,0 +1,172 @@ +# Concept Retrieval via Indexing-Time Chunk Enrichment + +## Context + +Vector similarity alone can't answer "tell me everything about aneurysms." It surfaces the chunks most *linguistically* similar to the query, not the set of all chunks that *concern* the concept — and it has no notion of whether each chunk is a definition, a case, a technique, or a complication. + +The unlock is to move intelligence from query time to indexing time: for every text chunk, use an LLM to extract **structured metadata** (entities, facet, summary). At retrieval time, concept lookup becomes an SQL filter (`entities @> ['aneurysm']`) bucketed by facet — deterministic, exhaustive, and organized by default. Vector search remains as a fallback for typos / synonyms and for ranking within a facet. + +This plan covers: (1) defining the metadata schema, (2) enriching chunks during new book ingestion, (3) back-filling the already-embedded corpus via an admin endpoint, (4) a new concept retrieval path, and (5) a Topics-page UI to surface the result. + +## Approach + +### 1. Data model — new `chunk_metadata` table + +Flyway migration `backend/src/main/resources/db/migration/V7__chunk_metadata.sql`: + +```sql +CREATE TABLE chunk_metadata ( + chunk_id VARCHAR(64) PRIMARY KEY, -- same UUID that TextChunkingService issues and stores in vectorstore + book_id UUID NOT NULL, + section_id VARCHAR(255) NOT NULL, + facet VARCHAR(32) NOT NULL, -- enum (see ConceptFacet) + entities JSONB NOT NULL, -- canonical lowercase string[] + summary TEXT NOT NULL, + model_version VARCHAR(32) NOT NULL, -- records which LLM/prompt version tagged this chunk + enriched_at TIMESTAMPTZ NOT NULL +); +CREATE INDEX idx_chunk_metadata_book ON chunk_metadata(book_id); +CREATE INDEX idx_chunk_metadata_book_facet ON chunk_metadata(book_id, facet); +CREATE INDEX idx_chunk_metadata_entities_gin ON chunk_metadata USING GIN (entities jsonb_path_ops); +``` + +Why `chunk_id` is the natural key: `TextChunkingService` already generates a UUID per chunk, uses it as the pgvector Document id, stores it in metadata, and it's the key in `ChunkFigureRefEntity` — so the table joins cleanly to everything already in place. + +### 2. Enrichment service & facet taxonomy + +New package `com.aiteacher.enrichment`: + +- `ConceptFacet` enum — 13 values tailored to neurosurgery textbooks: `DEFINITION, ANATOMY, PATHOPHYSIOLOGY, EPIDEMIOLOGY, CLINICAL_PRESENTATION, IMAGING, CLASSIFICATION, INDICATIONS, SURGICAL_TECHNIQUE, NONSURGICAL_MANAGEMENT, COMPLICATIONS, OUTCOMES_FOLLOWUP, OTHER`. `OTHER` is mandatory so the LLM always has an out (no hallucinated bucketing). The prompt carries explicit disambiguation rules (named grading scales → `CLASSIFICATION`; imaging of a complication → `COMPLICATIONS`; tools inside an operation → `SURGICAL_TECHNIQUE`). +- `ChunkEnrichmentResult` — record `(List entities, ConceptFacet facet, String summary)` +- `ChunkEnrichmentService` — single method `enrich(String chunkText, SectionEntity section, String bookTitle) → ChunkEnrichmentResult`. Uses Spring AI `ChatClient.prompt().call().entity(Class)` for structured output. The prompt gives: book title, section title, chunk text, the fixed facet enum list, and instructs the model to return JSON with entities normalised to lowercase singular canonical form (e.g. "aneurysms" → "aneurysm"; "SAH" → "subarachnoid hemorrhage"). Caps entities at ~8 per chunk. +- `ChunkMetadataEntity` + `ChunkMetadataRepository` — JPA entity/repo mirroring the table. + +Model version string (e.g. `"v1"`) lives on the service and is stamped into each row so a future prompt rev can be rolled out by filtering `model_version <> 'v2'` in the backfill job. + +### 3. Hook into new book ingestion + +Modify `BookEmbeddingService.embedBook`: + +```java +// Step 3: Chunk and embed text +List allChunks = new ArrayList<>(); +for (SectionEntity section : sections) { + allChunks.addAll(textChunkingService.chunk(section, bookTitle)); +} +if (skipEmbedding) { ... } else { + embedInBatches(allChunks, bookId); + chunkEnrichmentPipeline.enrichAndPersist(allChunks, sectionsById, bookTitle); // NEW +} +``` + +- `ChunkEnrichmentPipeline` — new orchestrator that iterates chunks, calls `ChunkEnrichmentService.enrich(...)` per chunk, saves `ChunkMetadataEntity` rows in batches, with the same throttle pattern as `embedInBatches`. +- Runs *after* embedding, not in place of it, so a failure in enrichment doesn't corrupt the vector store. On failure, log and continue — the backfill endpoint is the universal recovery path. +- Extend `deleteBookChunks` to also delete `chunk_metadata` rows so deletion stays consistent. + +### 4. Backfill endpoint for already-embedded books + +New `EnrichmentController` in `com.aiteacher.enrichment`: + +- `POST /api/v1/admin/books/{id}/enrich` → kicks off async backfill, returns 202 with `{status, chunksTotal, chunksEnriched}` +- `GET /api/v1/admin/books/{id}/enrich` → returns progress + +Backfill flow (`EnrichmentBackfillService.backfillBook(UUID bookId)`): + +1. Query the pgvector storage table directly via `JdbcTemplate` for all chunks of the book: + ```sql + SELECT id, content, metadata + FROM vector_store + WHERE metadata->>'book_id' = ? AND metadata->>'type' = 'TEXT' + ``` +2. Left-anti-join against `chunk_metadata` to skip already-enriched chunks → idempotent, resumable. +3. For each missing chunk: look up its `SectionEntity` via `section_id` in metadata, call `ChunkEnrichmentService.enrich`, write a `ChunkMetadataEntity` row. +4. Progress tracked in an in-memory `ConcurrentHashMap` (POC scope — no cross-restart resumability needed because the left-anti-join makes re-runs free). +5. `@Async` on the backfill method using the same executor as `embedBook`. + +### 5. Concept retrieval path + +New `com.aiteacher.concept.ConceptRetriever`: + +```java +public ConceptRetrievalResult retrieveByConcept(String conceptKeyword, UUID bookId) { + String canonical = canonicalise(conceptKeyword); // lowercase, trim, simple plural strip + + // 5a. Primary: SQL entity match, grouped by facet + List hits = chunkMetadataRepository + .findByBookIdAndEntityContains(bookId, canonical); // WHERE entities @> to_jsonb(?::text) + + if (hits.isEmpty()) { + // 5b. Fallback: vector search, then enrich-join + facet-group + List vectorHits = vectorStore.similaritySearch(/* TEXT filter, book_id filter, topK=30 */); + List chunkIds = vectorHits.stream().map(Document::getId).toList(); + hits = chunkMetadataRepository.findByChunkIdIn(chunkIds); + } + + Map> byFacet = hits.stream() + .collect(groupingBy(ChunkMetadataEntity::getFacet, LinkedHashMap::new, toList())); + + // Hydrate: load SectionEntity for each chunk's section_id; load linked figures + // via ChunkFigureRefRepository.findByChunkIdIn(chunkIds) — reuses existing linkage. + return assemble(byFacet, ...); +} +``` + +`ConceptRetrievalResult` = `Map` where each `FacetBundle` holds the parent sections, linked figures, and the per-chunk `summary` strings. + +Cross-book aggregation: caller loops over READY books and merges bundles by facet. + +### 6. Concept Report service & controller + +New `ConceptReportService` in `com.aiteacher.concept` — mirrors the shape of `TopicSummaryService`, but: + +- Calls `ConceptRetriever.retrieveByConcept(topic.getName(), bookId)` per book. +- For each facet that has hits, sends **one** LLM synthesis call with the chunks/figures of that facet — producing a structured, facet-labelled report. +- Persists in a new `concept_report` table: + +```sql +CREATE TABLE concept_report ( + id UUID PRIMARY KEY, + topic_id VARCHAR(255) NOT NULL REFERENCES topic(id), + report_number INT NOT NULL, + facets_json JSONB NOT NULL, -- [{facetKey,title,markdown,refLabels[]}, ...] + sources_json JSONB NOT NULL, -- deduplicated SourceReference[] + generated_at TIMESTAMPTZ NOT NULL, + UNIQUE (topic_id, report_number) +); +``` + +Controller `ConceptReportController` exposes three endpoints under `/api/v1/topics/{id}/concept-reports` (POST generate, GET list, GET `/{reportId}`). + +Reuses `TopicSummaryResponse.SourceReference` verbatim. + +### 7. Frontend + +- `frontend/src/stores/topicStore.ts`: add parallel state `conceptReportList`, `activeConceptReport`, `conceptReportLoading`, and actions mirroring the existing summary ones. +- `frontend/src/views/TopicsView.vue`: add a **Summary / Concept Report** tab toggle at the top of the topic panel. Concept Report reuses the history-chips + Generate button UI. Report body renders each `FacetSection` as `

{title}

` + markdown. +- Loading hint: update the "up to 30 seconds" copy to "up to 60 seconds". + +### 8. README update + +Add an **Indexing Pipeline** diagram showing: PDF → parse → chunk → embed → **enrich (new)** → chunk_metadata. Plus a **Concept Retrieval** sequence diagram: query → entity-match SQL → facet-grouped bundle → synthesis → report. + +## Decisions & trade-offs + +- **Storage as separate Postgres table, not vectorstore JSON**: vectorstore has no metadata-only update API, backfill would require delete+reinsert (re-embedding cost). A dedicated table joins cleanly on `chunk_id` and is GIN-indexed. +- **Entity-match primary, vector fallback**: deterministic for the main use case, robust against typos/synonyms. Vector search stays the default for normal chat retrieval — this feature is additive. +- **Enrichment runs *after* embedding, not before**: keeps the two failure modes independent. The backfill endpoint is the universal recovery lever. +- **Fixed 9-value facet enum** (incl. `OTHER`): constrains LLM outputs; `OTHER` prevents forced mis-bucketing. +- **Direct `JdbcTemplate` read against `vector_store` for backfill**: Spring AI exposes no listing API. Acceptable for a POC, isolated behind one method. +- **Synchronous (sequential) LLM calls**: simplest; parallelism is a later optimisation if needed. +- **`model_version` column**: cheap insurance. If the prompt or facet taxonomy changes, backfill can re-enrich only stale rows. + +## Verification + +1. Migration applies V7 and V8. Tables and indexes created. +2. New book ingestion: upload PDF → `chunk_metadata` populated with plausible entities/facets/summaries. +3. Backfill: POST `/api/v1/admin/books/{id}/enrich` → idempotent, completes, re-run is a no-op. +4. Concept retrieval primary path: POST `/api/v1/topics/aneurysm/concept-reports` → 200 with facets populated. +5. Fallback path: misspelled topic still returns results via vector fallback. +6. Frontend: Concept Report tab renders facet-labelled markdown + sources + inline figures; persists across reloads. +7. Deletion: removing a book cascades to `chunk_metadata` rows. +8. Regression: existing chat and summary flows still work. +9. Lint & tests pass. diff --git a/frontend/src/components/BookCard.vue b/frontend/src/components/BookCard.vue index 9794579..879aefa 100644 --- a/frontend/src/components/BookCard.vue +++ b/frontend/src/components/BookCard.vue @@ -32,6 +32,13 @@ {{ book.status === 'PENDING' ? 'Queued for processing...' : 'Embedding in progress...' }} +
+
+ Enriching chunks {{ enrichProgress.chunksEnriched }} / {{ enrichProgress.chunksTotal }} +
+ +
{{ enrichFeedback }}
+
Read +