# Data Model: Enhanced Embedding with Image Parsing and Metadata **Branch**: `002-image-aware-embedding` | **Date**: 2026-04-03 --- ## Overview Three storage tiers work in concert: ``` ┌──────────────────────────────────────────────────────────────────┐ │ PDF Upload │ │ │ │ │ ▼ │ │ Parsing Pipeline │ │ │ │ │ │ ▼ ▼ │ │ Postgres (source of truth) pgvector (search index) │ │ - book - vector_store (text chunks) │ │ - chapter - vector_store (figure captions) │ │ - section (+ fullText) File Store (images) │ │ - figure (metadata) - /uploads/figures/{bookId}/*.png │ │ - chunk_figure_refs │ └──────────────────────────────────────────────────────────────────┘ ``` --- ## Postgres Schema ### Existing tables (unchanged) - `book` — status, metadata, page count (V1) - `chat_session`, `message` — conversation (V1) - `vector_store` — managed by Spring AI pgvector starter (V2) - `topic` — predefined topics (V3) ### New tables (Flyway V4) ```sql -- V4: Document hierarchy CREATE TABLE chapter ( id VARCHAR(200) PRIMARY KEY, -- "{bookId}-ch{N}" book_id UUID NOT NULL REFERENCES book(id) ON DELETE CASCADE, number INT NOT NULL, title VARCHAR(500), page_start INT, created_at TIMESTAMPTZ NOT NULL DEFAULT now() ); CREATE TABLE section ( id VARCHAR(200) PRIMARY KEY, -- "{bookId}-ch{N}-s{X}-{Y}" chapter_id VARCHAR(200) NOT NULL REFERENCES chapter(id) ON DELETE CASCADE, book_id UUID NOT NULL REFERENCES book(id) ON DELETE CASCADE, number VARCHAR(50), -- "2.3" or "12.2.3" title VARCHAR(500), page_start INT NOT NULL, page_end INT NOT NULL, full_text TEXT NOT NULL, -- NOT in vector store created_at TIMESTAMPTZ NOT NULL DEFAULT now() ); CREATE INDEX idx_section_book ON section(book_id); CREATE INDEX idx_section_chapter ON section(chapter_id); ``` ### New tables (Flyway V5) ```sql -- V5: Figures and chunk→figure links CREATE TABLE figure ( id VARCHAR(200) PRIMARY KEY, -- "{bookId}-fig-{label}" book_id UUID NOT NULL REFERENCES book(id) ON DELETE CASCADE, section_id VARCHAR(200) REFERENCES section(id) ON DELETE SET NULL, chapter_id VARCHAR(200) REFERENCES chapter(id) ON DELETE SET NULL, label VARCHAR(100), -- "Fig. 12-4" caption TEXT, figure_type VARCHAR(50) NOT NULL, -- FigureType enum name page INT NOT NULL, image_path VARCHAR(1000) NOT NULL, -- relative path on disk caption_embedding_id UUID, -- ID in vector_store created_at TIMESTAMPTZ NOT NULL DEFAULT now() ); CREATE TABLE chunk_figure_ref ( chunk_id UUID NOT NULL, -- vector_store document ID figure_id VARCHAR(200) NOT NULL REFERENCES figure(id) ON DELETE CASCADE, mention_page INT, PRIMARY KEY (chunk_id, figure_id) ); CREATE INDEX idx_figure_book ON figure(book_id); CREATE INDEX idx_cfr_chunk ON chunk_figure_ref(chunk_id); ``` --- ## Java Domain Records ### Document hierarchy (new package `com.aiteacher.document`) ```java // Root — in-memory only, not a JPA entity public record BookNode( String bookId, String title, String isbn, String edition, List authors, List chapters ) {} // Chapter — maps to `chapter` table public record ChapterNode( String chapterId, String bookId, int number, String title, int pageStart, List sections ) {} // Section — maps to `section` table; fullText stays in Postgres public record SectionNode( String sectionId, String chapterId, String bookId, String number, String title, int pageStart, int pageEnd, String fullText, List chunks, List figures ) {} // Text chunk — embedded into vector_store; references its parent section public record TextChunkNode( String chunkId, // UUID → becomes vector_store document ID String sectionId, String chapterId, String bookId, String text, int chunkIndex, int totalChunksInSection, int pageStart, int pageEnd, Map metadata // flattened for Spring AI filtering ) { public Map toMetadata() { return Map.of( "type", "TEXT", "book_id", bookId, "chapter_id", chapterId, "section_id", sectionId, "section_title", /* from parent SectionNode */, "page_start", pageStart, "page_end", pageEnd, "chunk_index", chunkIndex, "total_chunks", totalChunksInSection ); } } // Figure — maps to `figure` table; caption embedded into vector_store public record FigureNode( String figureId, String sectionId, String chapterId, String bookId, String label, // "Fig. 12-4" String caption, FigureType type, int page, String imagePath, // relative: "figures/{bookId}/{figureId}.png" UUID captionEmbeddingId // ID in vector_store ) {} ``` ### Figure type enum ```java public enum FigureType { ANATOMICAL_DIAGRAM, SURGICAL_PHOTOGRAPH, MRI_CT_SCAN, TABLE, CHART, INTRAOPERATIVE_IMAGE } ``` Classification heuristic (applied to caption + surrounding text): | Keyword(s) | FigureType | |-----------|-----------| | `MRI`, `CT`, `magnetic`, `resonance`, `tomography` | `MRI_CT_SCAN` | | `intraoperative`, `intra-op` | `INTRAOPERATIVE_IMAGE` | | `table`, `Table` (at line start) | `TABLE` | | `chart`, `graph`, `histogram` | `CHART` | | `photograph`, `photo` | `SURGICAL_PHOTOGRAPH` | | (default) | `ANATOMICAL_DIAGRAM` | ### Chunk–figure join record ```java // Maps to `chunk_figure_ref` table public record ChunkFigureRef( UUID chunkId, String figureId, int mentionPage ) {} ``` --- ## Vector Store Documents All documents in `vector_store` carry a `metadata` JSON column with a `type` field for filtering. ### Text chunk document | Field | Value | |-------|-------| | `content` | chunk text (400–600 tokens) | | `metadata.type` | `"TEXT"` | | `metadata.book_id` | book UUID | | `metadata.book_title` | book title string | | `metadata.chapter_id` | chapter ID string | | `metadata.section_id` | section ID string | | `metadata.section_title` | section title string | | `metadata.page_start` | int | | `metadata.page_end` | int | | `metadata.chunk_index` | int (0-based) | | `metadata.total_chunks` | int | ### Figure caption document | Field | Value | |-------|-------| | `content` | vision-generated description + caption text | | `metadata.type` | `"FIGURE"` | | `metadata.book_id` | book UUID | | `metadata.book_title` | book title string | | `metadata.chapter_id` | chapter ID string | | `metadata.section_id` | section ID string | | `metadata.figure_id` | figure ID string | | `metadata.figure_type` | enum name string | | `metadata.image_path` | relative file path | | `metadata.label` | caption label e.g. `"Fig. 12-4"` | | `metadata.page` | int | --- ## File Store Layout ``` uploads/ └── figures/ └── {bookId}/ ├── {figureId}.png └── ... ``` - Base path configurable via `app.figure-storage.base-path` (default: `./uploads`) - Files are served via `GET /api/v1/figures/{bookId}/{filename}` (static resource mapping) - Gitignored; not version-controlled --- ## State Transitions Book processing extends the existing `BookStatus` state machine: ``` PENDING → PROCESSING → READY ↘ FAILED ``` During `PROCESSING`: 1. Parse PDF structure → extract chapters/sections → persist to Postgres 2. Split sections into text chunks → embed → write to vector_store 3. Extract images per page → filter by min size → save PNG → generate vision description → embed caption → write figure to Postgres + vector_store 4. Write chunk_figure_refs for all detected figure references in text Failure at step 3 (individual page) → log + skip that page's images; continue. Failure at any other step → set `BookStatus.FAILED`. --- ## Retrieval Result Structure ```java public record RetrievalResult( List parentSections, // expanded full-text context List figureVectorHits, // semantic figure matches List linkedFigures // figures explicitly referenced in text chunks ) {} ``` The `NeurosurgeryRetriever` service deduplicates figures across both lists before passing the result to the LLM prompt builder.