add new concept report

This commit is contained in:
Adrien
2026-04-18 17:54:54 +02:00
parent 5f03e1f41b
commit c7a77af2f4
29 changed files with 1892 additions and 41 deletions
+46
View File
@@ -35,11 +35,13 @@ graph TD
EP3["Vision describe → embed caption"]
EP4["Chunk text → embed chunks"]
EP5["Link chunks ↔ figures"]
EP6["LLM enrich chunk\n(entities, facet, summary)\n→ chunk_metadata"]
EP1 --> EP2
EP1 --> EP4
EP2 --> EP3
EP4 --> EP5
EP3 --> EP5
EP4 --> EP6
end
subgraph "Retrieval Pipeline (per chat query)"
@@ -65,6 +67,50 @@ graph TD
end
```
### Concept Retrieval Pipeline (per concept report)
Concept retrieval is an alternative to the semantic-similarity flow above. It uses the
LLM-tagged `chunk_metadata` rows written at indexing time to exhaustively gather every
chunk that *concerns* a concept (e.g. "aneurysm"), bucketed by facet. One synthesis call
per facet yields a structured, multi-section report.
```mermaid
sequenceDiagram
participant User
participant FE as Frontend
participant BE as Backend (ConceptReportService)
participant Retr as ConceptRetriever
participant DB as chunk_metadata (GIN)
participant Vec as vector_store
participant LLM
User->>FE: Click "Generate Concept Report" on topic
FE->>BE: POST /api/v1/topics/{id}/concept-reports
loop per READY book
BE->>Retr: retrieveByConcept(topicName, bookId)
Retr->>DB: WHERE entities @> [canonical]
alt SQL hits found
DB-->>Retr: chunks grouped by facet
else no match (typo / synonym)
Retr->>Vec: similaritySearch topK=30
Vec-->>Retr: chunk ids
Retr->>DB: findByChunkIdIn → group by facet
end
end
BE->>BE: merge facets across books, assign global [S#]/[F#]
loop per non-empty facet
BE->>LLM: synthesize facet section (focused prompt)
LLM-->>BE: facet markdown
end
BE->>BE: persist concept_report
BE-->>FE: { facets[], sources[] }
FE->>User: render facet-labelled report + inline figures
```
Backfill path for already-embedded books:
`POST /api/v1/admin/books/{id}/enrich` scans `vector_store` for TEXT chunks missing
`chunk_metadata` rows and enriches them in place. Idempotent — re-running is a no-op.
## Marker API Response Structure
The PDF parsing pipeline calls a local [Marker](https://github.com/VikParuchuri/marker) server (`POST /marker/upload`).