first implementation - image/drawing integration

2026-04-04 12:56:56 +02:00
parent fc5b22fba1
commit 5acfdd33c1
42 changed files with 2854 additions and 151 deletions
@@ -0,0 +1,305 @@
+# Data Model: Enhanced Embedding with Image Parsing and Metadata
+
+**Branch**: `002-image-aware-embedding` | **Date**: 2026-04-03
+
+---
+
+## Overview
+
+Three storage tiers work in concert:
+
+```
+┌──────────────────────────────────────────────────────────────────┐
+│  PDF Upload                                                       │
+│     │                                                             │
+│     ▼                                                             │
+│  Parsing Pipeline                                                 │
+│     │                          │                                  │
+│     ▼                          ▼                                  │
+│  Postgres (source of truth)   pgvector (search index)            │
+│  - book                       - vector_store (text chunks)        │
+│  - chapter                    - vector_store (figure captions)    │
+│  - section (+ fullText)       File Store (images)                 │
+│  - figure (metadata)          - /uploads/figures/{bookId}/*.png  │
+│  - chunk_figure_refs                                              │
+└──────────────────────────────────────────────────────────────────┘
+```
+
+---
+
+## Postgres Schema
+
+### Existing tables (unchanged)
+
+- `book` — status, metadata, page count (V1)
+- `chat_session`, `message` — conversation (V1)
+- `vector_store` — managed by Spring AI pgvector starter (V2)
+- `topic` — predefined topics (V3)
+
+### New tables (Flyway V4)
+
+```sql
+-- V4: Document hierarchy
+
+CREATE TABLE chapter (
+    id           VARCHAR(200) PRIMARY KEY,  -- "{bookId}-ch{N}"
+    book_id      UUID NOT NULL REFERENCES book(id) ON DELETE CASCADE,
+    number       INT NOT NULL,
+    title        VARCHAR(500),
+    page_start   INT,
+    created_at   TIMESTAMPTZ NOT NULL DEFAULT now()
+);
+
+CREATE TABLE section (
+    id           VARCHAR(200) PRIMARY KEY,  -- "{bookId}-ch{N}-s{X}-{Y}"
+    chapter_id   VARCHAR(200) NOT NULL REFERENCES chapter(id) ON DELETE CASCADE,
+    book_id      UUID NOT NULL REFERENCES book(id) ON DELETE CASCADE,
+    number       VARCHAR(50),               -- "2.3" or "12.2.3"
+    title        VARCHAR(500),
+    page_start   INT NOT NULL,
+    page_end     INT NOT NULL,
+    full_text    TEXT NOT NULL,             -- NOT in vector store
+    created_at   TIMESTAMPTZ NOT NULL DEFAULT now()
+);
+
+CREATE INDEX idx_section_book    ON section(book_id);
+CREATE INDEX idx_section_chapter ON section(chapter_id);
+```
+
+### New tables (Flyway V5)
+
+```sql
+-- V5: Figures and chunk→figure links
+
+CREATE TABLE figure (
+    id                    VARCHAR(200) PRIMARY KEY, -- "{bookId}-fig-{label}"
+    book_id               UUID NOT NULL REFERENCES book(id) ON DELETE CASCADE,
+    section_id            VARCHAR(200) REFERENCES section(id) ON DELETE SET NULL,
+    chapter_id            VARCHAR(200) REFERENCES chapter(id) ON DELETE SET NULL,
+    label                 VARCHAR(100),             -- "Fig. 12-4"
+    caption               TEXT,
+    figure_type           VARCHAR(50) NOT NULL,     -- FigureType enum name
+    page                  INT NOT NULL,
+    image_path            VARCHAR(1000) NOT NULL,   -- relative path on disk
+    caption_embedding_id  UUID,                     -- ID in vector_store
+    created_at            TIMESTAMPTZ NOT NULL DEFAULT now()
+);
+
+CREATE TABLE chunk_figure_ref (
+    chunk_id      UUID NOT NULL,         -- vector_store document ID
+    figure_id     VARCHAR(200) NOT NULL REFERENCES figure(id) ON DELETE CASCADE,
+    mention_page  INT,
+    PRIMARY KEY (chunk_id, figure_id)
+);
+
+CREATE INDEX idx_figure_book    ON figure(book_id);
+CREATE INDEX idx_cfr_chunk      ON chunk_figure_ref(chunk_id);
+```
+
+---
+
+## Java Domain Records
+
+### Document hierarchy (new package `com.aiteacher.document`)
+
+```java
+// Root — in-memory only, not a JPA entity
+public record BookNode(
+    String bookId,
+    String title,
+    String isbn,
+    String edition,
+    List<String> authors,
+    List<ChapterNode> chapters
+) {}
+
+// Chapter — maps to `chapter` table
+public record ChapterNode(
+    String chapterId,
+    String bookId,
+    int number,
+    String title,
+    int pageStart,
+    List<SectionNode> sections
+) {}
+
+// Section — maps to `section` table; fullText stays in Postgres
+public record SectionNode(
+    String sectionId,
+    String chapterId,
+    String bookId,
+    String number,
+    String title,
+    int pageStart,
+    int pageEnd,
+    String fullText,
+    List<TextChunkNode> chunks,
+    List<FigureNode> figures
+) {}
+
+// Text chunk — embedded into vector_store; references its parent section
+public record TextChunkNode(
+    String chunkId,          // UUID → becomes vector_store document ID
+    String sectionId,
+    String chapterId,
+    String bookId,
+    String text,
+    int chunkIndex,
+    int totalChunksInSection,
+    int pageStart,
+    int pageEnd,
+    Map<String, Object> metadata   // flattened for Spring AI filtering
+) {
+    public Map<String, Object> toMetadata() {
+        return Map.of(
+            "type",          "TEXT",
+            "book_id",       bookId,
+            "chapter_id",    chapterId,
+            "section_id",    sectionId,
+            "section_title", /* from parent SectionNode */,
+            "page_start",    pageStart,
+            "page_end",      pageEnd,
+            "chunk_index",   chunkIndex,
+            "total_chunks",  totalChunksInSection
+        );
+    }
+}
+
+// Figure — maps to `figure` table; caption embedded into vector_store
+public record FigureNode(
+    String figureId,
+    String sectionId,
+    String chapterId,
+    String bookId,
+    String label,            // "Fig. 12-4"
+    String caption,
+    FigureType type,
+    int page,
+    String imagePath,        // relative: "figures/{bookId}/{figureId}.png"
+    UUID captionEmbeddingId  // ID in vector_store
+) {}
+```
+
+### Figure type enum
+
+```java
+public enum FigureType {
+    ANATOMICAL_DIAGRAM,
+    SURGICAL_PHOTOGRAPH,
+    MRI_CT_SCAN,
+    TABLE,
+    CHART,
+    INTRAOPERATIVE_IMAGE
+}
+```
+
+Classification heuristic (applied to caption + surrounding text):
+
+| Keyword(s) | FigureType |
+|-----------|-----------|
+| `MRI`, `CT`, `magnetic`, `resonance`, `tomography` | `MRI_CT_SCAN` |
+| `intraoperative`, `intra-op` | `INTRAOPERATIVE_IMAGE` |
+| `table`, `Table` (at line start) | `TABLE` |
+| `chart`, `graph`, `histogram` | `CHART` |
+| `photograph`, `photo` | `SURGICAL_PHOTOGRAPH` |
+| (default) | `ANATOMICAL_DIAGRAM` |
+
+### Chunk–figure join record
+
+```java
+// Maps to `chunk_figure_ref` table
+public record ChunkFigureRef(
+    UUID chunkId,
+    String figureId,
+    int mentionPage
+) {}
+```
+
+---
+
+## Vector Store Documents
+
+All documents in `vector_store` carry a `metadata` JSON column with a `type` field for filtering.
+
+### Text chunk document
+
+| Field | Value |
+|-------|-------|
+| `content` | chunk text (400–600 tokens) |
+| `metadata.type` | `"TEXT"` |
+| `metadata.book_id` | book UUID |
+| `metadata.book_title` | book title string |
+| `metadata.chapter_id` | chapter ID string |
+| `metadata.section_id` | section ID string |
+| `metadata.section_title` | section title string |
+| `metadata.page_start` | int |
+| `metadata.page_end` | int |
+| `metadata.chunk_index` | int (0-based) |
+| `metadata.total_chunks` | int |
+
+### Figure caption document
+
+| Field | Value |
+|-------|-------|
+| `content` | vision-generated description + caption text |
+| `metadata.type` | `"FIGURE"` |
+| `metadata.book_id` | book UUID |
+| `metadata.book_title` | book title string |
+| `metadata.chapter_id` | chapter ID string |
+| `metadata.section_id` | section ID string |
+| `metadata.figure_id` | figure ID string |
+| `metadata.figure_type` | enum name string |
+| `metadata.image_path` | relative file path |
+| `metadata.label` | caption label e.g. `"Fig. 12-4"` |
+| `metadata.page` | int |
+
+---
+
+## File Store Layout
+
+```
+uploads/
+└── figures/
+    └── {bookId}/
+        ├── {figureId}.png
+        └── ...
+```
+
+- Base path configurable via `app.figure-storage.base-path` (default: `./uploads`)
+- Files are served via `GET /api/v1/figures/{bookId}/{filename}` (static resource mapping)
+- Gitignored; not version-controlled
+
+---
+
+## State Transitions
+
+Book processing extends the existing `BookStatus` state machine:
+
+```
+PENDING → PROCESSING → READY
+                    ↘ FAILED
+```
+
+During `PROCESSING`:
+1. Parse PDF structure → extract chapters/sections → persist to Postgres
+2. Split sections into text chunks → embed → write to vector_store
+3. Extract images per page → filter by min size → save PNG → generate vision description → embed caption → write figure to Postgres + vector_store
+4. Write chunk_figure_refs for all detected figure references in text
+
+Failure at step 3 (individual page) → log + skip that page's images; continue.  
+Failure at any other step → set `BookStatus.FAILED`.
+
+---
+
+## Retrieval Result Structure
+
+```java
+public record RetrievalResult(
+    List<SectionNode> parentSections,    // expanded full-text context
+    List<Document> figureVectorHits,     // semantic figure matches
+    List<FigureNode> linkedFigures       // figures explicitly referenced in text chunks
+) {}
+```
+
+The `NeurosurgeryRetriever` service deduplicates figures across both lists before passing
+the result to the LLM prompt builder.