add new concept report

2026-04-18 17:54:54 +02:00
parent 5f03e1f41b
commit c7a77af2f4
29 changed files with 1892 additions and 41 deletions
@@ -35,11 +35,13 @@ graph TD
        EP3["Vision describe → embed caption"]
        EP4["Chunk text → embed chunks"]
        EP5["Link chunks ↔ figures"]
+        EP6["LLM enrich chunk\n(entities, facet, summary)\n→ chunk_metadata"]
        EP1 --> EP2
        EP1 --> EP4
        EP2 --> EP3
        EP4 --> EP5
        EP3 --> EP5
+        EP4 --> EP6
    end

    subgraph "Retrieval Pipeline (per chat query)"
@@ -65,6 +67,50 @@ graph TD
    end
 ```

+### Concept Retrieval Pipeline (per concept report)
+
+Concept retrieval is an alternative to the semantic-similarity flow above. It uses the
+LLM-tagged `chunk_metadata` rows written at indexing time to exhaustively gather every
+chunk that *concerns* a concept (e.g. "aneurysm"), bucketed by facet. One synthesis call
+per facet yields a structured, multi-section report.
+
+```mermaid
+sequenceDiagram
+    participant User
+    participant FE as Frontend
+    participant BE as Backend (ConceptReportService)
+    participant Retr as ConceptRetriever
+    participant DB as chunk_metadata (GIN)
+    participant Vec as vector_store
+    participant LLM
+
+    User->>FE: Click "Generate Concept Report" on topic
+    FE->>BE: POST /api/v1/topics/{id}/concept-reports
+    loop per READY book
+        BE->>Retr: retrieveByConcept(topicName, bookId)
+        Retr->>DB: WHERE entities @> [canonical]
+        alt SQL hits found
+            DB-->>Retr: chunks grouped by facet
+        else no match (typo / synonym)
+            Retr->>Vec: similaritySearch topK=30
+            Vec-->>Retr: chunk ids
+            Retr->>DB: findByChunkIdIn → group by facet
+        end
+    end
+    BE->>BE: merge facets across books, assign global [S#]/[F#]
+    loop per non-empty facet
+        BE->>LLM: synthesize facet section (focused prompt)
+        LLM-->>BE: facet markdown
+    end
+    BE->>BE: persist concept_report
+    BE-->>FE: { facets[], sources[] }
+    FE->>User: render facet-labelled report + inline figures
+```
+
+Backfill path for already-embedded books:
+`POST /api/v1/admin/books/{id}/enrich` scans `vector_store` for TEXT chunks missing
+`chunk_metadata` rows and enriches them in place. Idempotent — re-running is a no-op.
+
 ## Marker API Response Structure

 The PDF parsing pipeline calls a local [Marker](https://github.com/VikParuchuri/marker) server (`POST /marker/upload`).