ai-teacher/specs/002-image-aware-embedding/data-model.md

# Data Model: Enhanced Embedding with Image Parsing and Metadata

**Branch**: `002-image-aware-embedding` | **Date**: 2026-04-03

---

## Overview

Three storage tiers work in concert:

```
┌──────────────────────────────────────────────────────────────────┐
│  PDF Upload                                                       │
│     │                                                             │
│     ▼                                                             │
│  Parsing Pipeline                                                 │
│     │                          │                                  │
│     ▼                          ▼                                  │
│  Postgres (source of truth)   pgvector (search index)            │
│  - book                       - vector_store (text chunks)        │
│  - chapter                    - vector_store (figure captions)    │
│  - section (+ fullText)       File Store (images)                 │
│  - figure (metadata)          - /uploads/figures/{bookId}/*.png  │
│  - chunk_figure_refs                                              │
└──────────────────────────────────────────────────────────────────┘
```

---

## Postgres Schema

### Existing tables (unchanged)

- `book` — status, metadata, page count (V1)
- `chat_session`, `message` — conversation (V1)
- `vector_store` — managed by Spring AI pgvector starter (V2)
- `topic` — predefined topics (V3)

### New tables (Flyway V4)

```sql
-- V4: Document hierarchy

CREATE TABLE chapter (
    id           VARCHAR(200) PRIMARY KEY,  -- "{bookId}-ch{N}"
    book_id      UUID NOT NULL REFERENCES book(id) ON DELETE CASCADE,
    number       INT NOT NULL,
    title        VARCHAR(500),
    page_start   INT,
    created_at   TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE TABLE section (
    id           VARCHAR(200) PRIMARY KEY,  -- "{bookId}-ch{N}-s{X}-{Y}"
    chapter_id   VARCHAR(200) NOT NULL REFERENCES chapter(id) ON DELETE CASCADE,
    book_id      UUID NOT NULL REFERENCES book(id) ON DELETE CASCADE,
    number       VARCHAR(50),               -- "2.3" or "12.2.3"
    title        VARCHAR(500),
    page_start   INT NOT NULL,
    page_end     INT NOT NULL,
    full_text    TEXT NOT NULL,             -- NOT in vector store
    created_at   TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX idx_section_book    ON section(book_id);
CREATE INDEX idx_section_chapter ON section(chapter_id);
```

### New tables (Flyway V5)

```sql
-- V5: Figures and chunk→figure links

CREATE TABLE figure (
    id                    VARCHAR(200) PRIMARY KEY, -- "{bookId}-fig-{label}"
    book_id               UUID NOT NULL REFERENCES book(id) ON DELETE CASCADE,
    section_id            VARCHAR(200) REFERENCES section(id) ON DELETE SET NULL,
    chapter_id            VARCHAR(200) REFERENCES chapter(id) ON DELETE SET NULL,
    label                 VARCHAR(100),             -- "Fig. 12-4"
    caption               TEXT,
    figure_type           VARCHAR(50) NOT NULL,     -- FigureType enum name
    page                  INT NOT NULL,
    image_path            VARCHAR(1000) NOT NULL,   -- relative path on disk
    caption_embedding_id  UUID,                     -- ID in vector_store
    created_at            TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE TABLE chunk_figure_ref (
    chunk_id      UUID NOT NULL,         -- vector_store document ID
    figure_id     VARCHAR(200) NOT NULL REFERENCES figure(id) ON DELETE CASCADE,
    mention_page  INT,
    PRIMARY KEY (chunk_id, figure_id)
);

CREATE INDEX idx_figure_book    ON figure(book_id);
CREATE INDEX idx_cfr_chunk      ON chunk_figure_ref(chunk_id);
```

---

## Java Domain Records

### Document hierarchy (new package `com.aiteacher.document`)

```java
// Root — in-memory only, not a JPA entity
public record BookNode(
    String bookId,
    String title,
    String isbn,
    String edition,
    List<String> authors,
    List<ChapterNode> chapters
) {}

// Chapter — maps to `chapter` table
public record ChapterNode(
    String chapterId,
    String bookId,
    int number,
    String title,
    int pageStart,
    List<SectionNode> sections
) {}

// Section — maps to `section` table; fullText stays in Postgres
public record SectionNode(
    String sectionId,
    String chapterId,
    String bookId,
    String number,
    String title,
    int pageStart,
    int pageEnd,
    String fullText,
    List<TextChunkNode> chunks,
    List<FigureNode> figures
) {}

// Text chunk — embedded into vector_store; references its parent section
public record TextChunkNode(
    String chunkId,          // UUID → becomes vector_store document ID
    String sectionId,
    String chapterId,
    String bookId,
    String text,
    int chunkIndex,
    int totalChunksInSection,
    int pageStart,
    int pageEnd,
    Map<String, Object> metadata   // flattened for Spring AI filtering
) {
    public Map<String, Object> toMetadata() {
        return Map.of(
            "type",          "TEXT",
            "book_id",       bookId,
            "chapter_id",    chapterId,
            "section_id",    sectionId,
            "section_title", /* from parent SectionNode */,
            "page_start",    pageStart,
            "page_end",      pageEnd,
            "chunk_index",   chunkIndex,
            "total_chunks",  totalChunksInSection
        );
    }
}

// Figure — maps to `figure` table; caption embedded into vector_store
public record FigureNode(
    String figureId,
    String sectionId,
    String chapterId,
    String bookId,
    String label,            // "Fig. 12-4"
    String caption,
    FigureType type,
    int page,
    String imagePath,        // relative: "figures/{bookId}/{figureId}.png"
    UUID captionEmbeddingId  // ID in vector_store
) {}
```

### Figure type enum

```java
public enum FigureType {
    ANATOMICAL_DIAGRAM,
    SURGICAL_PHOTOGRAPH,
    MRI_CT_SCAN,
    TABLE,
    CHART,
    INTRAOPERATIVE_IMAGE
}
```

Classification heuristic (applied to caption + surrounding text):

| Keyword(s) | FigureType |
|-----------|-----------|
| `MRI`, `CT`, `magnetic`, `resonance`, `tomography` | `MRI_CT_SCAN` |
| `intraoperative`, `intra-op` | `INTRAOPERATIVE_IMAGE` |
| `table`, `Table` (at line start) | `TABLE` |
| `chart`, `graph`, `histogram` | `CHART` |
| `photograph`, `photo` | `SURGICAL_PHOTOGRAPH` |
| (default) | `ANATOMICAL_DIAGRAM` |

### Chunk–figure join record

```java
// Maps to `chunk_figure_ref` table
public record ChunkFigureRef(
    UUID chunkId,
    String figureId,
    int mentionPage
) {}
```

---

## Vector Store Documents

All documents in `vector_store` carry a `metadata` JSON column with a `type` field for filtering.

### Text chunk document

| Field | Value |
|-------|-------|
| `content` | chunk text (400–600 tokens) |
| `metadata.type` | `"TEXT"` |
| `metadata.book_id` | book UUID |
| `metadata.book_title` | book title string |
| `metadata.chapter_id` | chapter ID string |
| `metadata.section_id` | section ID string |
| `metadata.section_title` | section title string |
| `metadata.page_start` | int |
| `metadata.page_end` | int |
| `metadata.chunk_index` | int (0-based) |
| `metadata.total_chunks` | int |

### Figure caption document

| Field | Value |
|-------|-------|
| `content` | vision-generated description + caption text |
| `metadata.type` | `"FIGURE"` |
| `metadata.book_id` | book UUID |
| `metadata.book_title` | book title string |
| `metadata.chapter_id` | chapter ID string |
| `metadata.section_id` | section ID string |
| `metadata.figure_id` | figure ID string |
| `metadata.figure_type` | enum name string |
| `metadata.image_path` | relative file path |
| `metadata.label` | caption label e.g. `"Fig. 12-4"` |
| `metadata.page` | int |

---

## File Store Layout

```
uploads/
└── figures/
    └── {bookId}/
        ├── {figureId}.png
        └── ...
```

- Base path configurable via `app.figure-storage.base-path` (default: `./uploads`)
- Files are served via `GET /api/v1/figures/{bookId}/{filename}` (static resource mapping)
- Gitignored; not version-controlled

---

## State Transitions

Book processing extends the existing `BookStatus` state machine:

```
PENDING → PROCESSING → READY
                    ↘ FAILED
```

During `PROCESSING`:
1. Parse PDF structure → extract chapters/sections → persist to Postgres
2. Split sections into text chunks → embed → write to vector_store
3. Extract images per page → filter by min size → save PNG → generate vision description → embed caption → write figure to Postgres + vector_store
4. Write chunk_figure_refs for all detected figure references in text

Failure at step 3 (individual page) → log + skip that page's images; continue.
Failure at any other step → set `BookStatus.FAILED`.

---

## Retrieval Result Structure

```java
public record RetrievalResult(
    List<SectionNode> parentSections,    // expanded full-text context
    List<Document> figureVectorHits,     // semantic figure matches
    List<FigureNode> linkedFigures       // figures explicitly referenced in text chunks
) {}
```

The `NeurosurgeryRetriever` service deduplicates figures across both lists before passing
the result to the LLM prompt builder.