306 lines
9.4 KiB
Markdown
306 lines
9.4 KiB
Markdown
# Data Model: Enhanced Embedding with Image Parsing and Metadata
|
||
|
||
**Branch**: `002-image-aware-embedding` | **Date**: 2026-04-03
|
||
|
||
---
|
||
|
||
## Overview
|
||
|
||
Three storage tiers work in concert:
|
||
|
||
```
|
||
┌──────────────────────────────────────────────────────────────────┐
|
||
│ PDF Upload │
|
||
│ │ │
|
||
│ ▼ │
|
||
│ Parsing Pipeline │
|
||
│ │ │ │
|
||
│ ▼ ▼ │
|
||
│ Postgres (source of truth) pgvector (search index) │
|
||
│ - book - vector_store (text chunks) │
|
||
│ - chapter - vector_store (figure captions) │
|
||
│ - section (+ fullText) File Store (images) │
|
||
│ - figure (metadata) - /uploads/figures/{bookId}/*.png │
|
||
│ - chunk_figure_refs │
|
||
└──────────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
---
|
||
|
||
## Postgres Schema
|
||
|
||
### Existing tables (unchanged)
|
||
|
||
- `book` — status, metadata, page count (V1)
|
||
- `chat_session`, `message` — conversation (V1)
|
||
- `vector_store` — managed by Spring AI pgvector starter (V2)
|
||
- `topic` — predefined topics (V3)
|
||
|
||
### New tables (Flyway V4)
|
||
|
||
```sql
|
||
-- V4: Document hierarchy
|
||
|
||
CREATE TABLE chapter (
|
||
id VARCHAR(200) PRIMARY KEY, -- "{bookId}-ch{N}"
|
||
book_id UUID NOT NULL REFERENCES book(id) ON DELETE CASCADE,
|
||
number INT NOT NULL,
|
||
title VARCHAR(500),
|
||
page_start INT,
|
||
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
|
||
);
|
||
|
||
CREATE TABLE section (
|
||
id VARCHAR(200) PRIMARY KEY, -- "{bookId}-ch{N}-s{X}-{Y}"
|
||
chapter_id VARCHAR(200) NOT NULL REFERENCES chapter(id) ON DELETE CASCADE,
|
||
book_id UUID NOT NULL REFERENCES book(id) ON DELETE CASCADE,
|
||
number VARCHAR(50), -- "2.3" or "12.2.3"
|
||
title VARCHAR(500),
|
||
page_start INT NOT NULL,
|
||
page_end INT NOT NULL,
|
||
full_text TEXT NOT NULL, -- NOT in vector store
|
||
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
|
||
);
|
||
|
||
CREATE INDEX idx_section_book ON section(book_id);
|
||
CREATE INDEX idx_section_chapter ON section(chapter_id);
|
||
```
|
||
|
||
### New tables (Flyway V5)
|
||
|
||
```sql
|
||
-- V5: Figures and chunk→figure links
|
||
|
||
CREATE TABLE figure (
|
||
id VARCHAR(200) PRIMARY KEY, -- "{bookId}-fig-{label}"
|
||
book_id UUID NOT NULL REFERENCES book(id) ON DELETE CASCADE,
|
||
section_id VARCHAR(200) REFERENCES section(id) ON DELETE SET NULL,
|
||
chapter_id VARCHAR(200) REFERENCES chapter(id) ON DELETE SET NULL,
|
||
label VARCHAR(100), -- "Fig. 12-4"
|
||
caption TEXT,
|
||
figure_type VARCHAR(50) NOT NULL, -- FigureType enum name
|
||
page INT NOT NULL,
|
||
image_path VARCHAR(1000) NOT NULL, -- relative path on disk
|
||
caption_embedding_id UUID, -- ID in vector_store
|
||
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
|
||
);
|
||
|
||
CREATE TABLE chunk_figure_ref (
|
||
chunk_id UUID NOT NULL, -- vector_store document ID
|
||
figure_id VARCHAR(200) NOT NULL REFERENCES figure(id) ON DELETE CASCADE,
|
||
mention_page INT,
|
||
PRIMARY KEY (chunk_id, figure_id)
|
||
);
|
||
|
||
CREATE INDEX idx_figure_book ON figure(book_id);
|
||
CREATE INDEX idx_cfr_chunk ON chunk_figure_ref(chunk_id);
|
||
```
|
||
|
||
---
|
||
|
||
## Java Domain Records
|
||
|
||
### Document hierarchy (new package `com.aiteacher.document`)
|
||
|
||
```java
|
||
// Root — in-memory only, not a JPA entity
|
||
public record BookNode(
|
||
String bookId,
|
||
String title,
|
||
String isbn,
|
||
String edition,
|
||
List<String> authors,
|
||
List<ChapterNode> chapters
|
||
) {}
|
||
|
||
// Chapter — maps to `chapter` table
|
||
public record ChapterNode(
|
||
String chapterId,
|
||
String bookId,
|
||
int number,
|
||
String title,
|
||
int pageStart,
|
||
List<SectionNode> sections
|
||
) {}
|
||
|
||
// Section — maps to `section` table; fullText stays in Postgres
|
||
public record SectionNode(
|
||
String sectionId,
|
||
String chapterId,
|
||
String bookId,
|
||
String number,
|
||
String title,
|
||
int pageStart,
|
||
int pageEnd,
|
||
String fullText,
|
||
List<TextChunkNode> chunks,
|
||
List<FigureNode> figures
|
||
) {}
|
||
|
||
// Text chunk — embedded into vector_store; references its parent section
|
||
public record TextChunkNode(
|
||
String chunkId, // UUID → becomes vector_store document ID
|
||
String sectionId,
|
||
String chapterId,
|
||
String bookId,
|
||
String text,
|
||
int chunkIndex,
|
||
int totalChunksInSection,
|
||
int pageStart,
|
||
int pageEnd,
|
||
Map<String, Object> metadata // flattened for Spring AI filtering
|
||
) {
|
||
public Map<String, Object> toMetadata() {
|
||
return Map.of(
|
||
"type", "TEXT",
|
||
"book_id", bookId,
|
||
"chapter_id", chapterId,
|
||
"section_id", sectionId,
|
||
"section_title", /* from parent SectionNode */,
|
||
"page_start", pageStart,
|
||
"page_end", pageEnd,
|
||
"chunk_index", chunkIndex,
|
||
"total_chunks", totalChunksInSection
|
||
);
|
||
}
|
||
}
|
||
|
||
// Figure — maps to `figure` table; caption embedded into vector_store
|
||
public record FigureNode(
|
||
String figureId,
|
||
String sectionId,
|
||
String chapterId,
|
||
String bookId,
|
||
String label, // "Fig. 12-4"
|
||
String caption,
|
||
FigureType type,
|
||
int page,
|
||
String imagePath, // relative: "figures/{bookId}/{figureId}.png"
|
||
UUID captionEmbeddingId // ID in vector_store
|
||
) {}
|
||
```
|
||
|
||
### Figure type enum
|
||
|
||
```java
|
||
public enum FigureType {
|
||
ANATOMICAL_DIAGRAM,
|
||
SURGICAL_PHOTOGRAPH,
|
||
MRI_CT_SCAN,
|
||
TABLE,
|
||
CHART,
|
||
INTRAOPERATIVE_IMAGE
|
||
}
|
||
```
|
||
|
||
Classification heuristic (applied to caption + surrounding text):
|
||
|
||
| Keyword(s) | FigureType |
|
||
|-----------|-----------|
|
||
| `MRI`, `CT`, `magnetic`, `resonance`, `tomography` | `MRI_CT_SCAN` |
|
||
| `intraoperative`, `intra-op` | `INTRAOPERATIVE_IMAGE` |
|
||
| `table`, `Table` (at line start) | `TABLE` |
|
||
| `chart`, `graph`, `histogram` | `CHART` |
|
||
| `photograph`, `photo` | `SURGICAL_PHOTOGRAPH` |
|
||
| (default) | `ANATOMICAL_DIAGRAM` |
|
||
|
||
### Chunk–figure join record
|
||
|
||
```java
|
||
// Maps to `chunk_figure_ref` table
|
||
public record ChunkFigureRef(
|
||
UUID chunkId,
|
||
String figureId,
|
||
int mentionPage
|
||
) {}
|
||
```
|
||
|
||
---
|
||
|
||
## Vector Store Documents
|
||
|
||
All documents in `vector_store` carry a `metadata` JSON column with a `type` field for filtering.
|
||
|
||
### Text chunk document
|
||
|
||
| Field | Value |
|
||
|-------|-------|
|
||
| `content` | chunk text (400–600 tokens) |
|
||
| `metadata.type` | `"TEXT"` |
|
||
| `metadata.book_id` | book UUID |
|
||
| `metadata.book_title` | book title string |
|
||
| `metadata.chapter_id` | chapter ID string |
|
||
| `metadata.section_id` | section ID string |
|
||
| `metadata.section_title` | section title string |
|
||
| `metadata.page_start` | int |
|
||
| `metadata.page_end` | int |
|
||
| `metadata.chunk_index` | int (0-based) |
|
||
| `metadata.total_chunks` | int |
|
||
|
||
### Figure caption document
|
||
|
||
| Field | Value |
|
||
|-------|-------|
|
||
| `content` | vision-generated description + caption text |
|
||
| `metadata.type` | `"FIGURE"` |
|
||
| `metadata.book_id` | book UUID |
|
||
| `metadata.book_title` | book title string |
|
||
| `metadata.chapter_id` | chapter ID string |
|
||
| `metadata.section_id` | section ID string |
|
||
| `metadata.figure_id` | figure ID string |
|
||
| `metadata.figure_type` | enum name string |
|
||
| `metadata.image_path` | relative file path |
|
||
| `metadata.label` | caption label e.g. `"Fig. 12-4"` |
|
||
| `metadata.page` | int |
|
||
|
||
---
|
||
|
||
## File Store Layout
|
||
|
||
```
|
||
uploads/
|
||
└── figures/
|
||
└── {bookId}/
|
||
├── {figureId}.png
|
||
└── ...
|
||
```
|
||
|
||
- Base path configurable via `app.figure-storage.base-path` (default: `./uploads`)
|
||
- Files are served via `GET /api/v1/figures/{bookId}/{filename}` (static resource mapping)
|
||
- Gitignored; not version-controlled
|
||
|
||
---
|
||
|
||
## State Transitions
|
||
|
||
Book processing extends the existing `BookStatus` state machine:
|
||
|
||
```
|
||
PENDING → PROCESSING → READY
|
||
↘ FAILED
|
||
```
|
||
|
||
During `PROCESSING`:
|
||
1. Parse PDF structure → extract chapters/sections → persist to Postgres
|
||
2. Split sections into text chunks → embed → write to vector_store
|
||
3. Extract images per page → filter by min size → save PNG → generate vision description → embed caption → write figure to Postgres + vector_store
|
||
4. Write chunk_figure_refs for all detected figure references in text
|
||
|
||
Failure at step 3 (individual page) → log + skip that page's images; continue.
|
||
Failure at any other step → set `BookStatus.FAILED`.
|
||
|
||
---
|
||
|
||
## Retrieval Result Structure
|
||
|
||
```java
|
||
public record RetrievalResult(
|
||
List<SectionNode> parentSections, // expanded full-text context
|
||
List<Document> figureVectorHits, // semantic figure matches
|
||
List<FigureNode> linkedFigures // figures explicitly referenced in text chunks
|
||
) {}
|
||
```
|
||
|
||
The `NeurosurgeryRetriever` service deduplicates figures across both lists before passing
|
||
the result to the LLM prompt builder.
|