giteaadmin/ai-teacher

Fork 0

Files

T

Adrien 5acfdd33c1 first implementation - image/drawing integration

2026-04-04 12:56:56 +02:00

9.4 KiB

Raw Blame History

Data Model: Enhanced Embedding with Image Parsing and Metadata

Branch: 002-image-aware-embedding | Date: 2026-04-03

Overview

Three storage tiers work in concert:

┌──────────────────────────────────────────────────────────────────┐
│  PDF Upload                                                       │
│     │                                                             │
│     ▼                                                             │
│  Parsing Pipeline                                                 │
│     │                          │                                  │
│     ▼                          ▼                                  │
│  Postgres (source of truth)   pgvector (search index)            │
│  - book                       - vector_store (text chunks)        │
│  - chapter                    - vector_store (figure captions)    │
│  - section (+ fullText)       File Store (images)                 │
│  - figure (metadata)          - /uploads/figures/{bookId}/*.png  │
│  - chunk_figure_refs                                              │
└──────────────────────────────────────────────────────────────────┘

Postgres Schema

Existing tables (unchanged)

book — status, metadata, page count (V1)
chat_session, message — conversation (V1)
vector_store — managed by Spring AI pgvector starter (V2)
topic — predefined topics (V3)

New tables (Flyway V4)

-- V4: Document hierarchy

CREATE TABLE chapter (
    id           VARCHAR(200) PRIMARY KEY,  -- "{bookId}-ch{N}"
    book_id      UUID NOT NULL REFERENCES book(id) ON DELETE CASCADE,
    number       INT NOT NULL,
    title        VARCHAR(500),
    page_start   INT,
    created_at   TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE TABLE section (
    id           VARCHAR(200) PRIMARY KEY,  -- "{bookId}-ch{N}-s{X}-{Y}"
    chapter_id   VARCHAR(200) NOT NULL REFERENCES chapter(id) ON DELETE CASCADE,
    book_id      UUID NOT NULL REFERENCES book(id) ON DELETE CASCADE,
    number       VARCHAR(50),               -- "2.3" or "12.2.3"
    title        VARCHAR(500),
    page_start   INT NOT NULL,
    page_end     INT NOT NULL,
    full_text    TEXT NOT NULL,             -- NOT in vector store
    created_at   TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX idx_section_book    ON section(book_id);
CREATE INDEX idx_section_chapter ON section(chapter_id);

New tables (Flyway V5)

-- V5: Figures and chunk→figure links

CREATE TABLE figure (
    id                    VARCHAR(200) PRIMARY KEY, -- "{bookId}-fig-{label}"
    book_id               UUID NOT NULL REFERENCES book(id) ON DELETE CASCADE,
    section_id            VARCHAR(200) REFERENCES section(id) ON DELETE SET NULL,
    chapter_id            VARCHAR(200) REFERENCES chapter(id) ON DELETE SET NULL,
    label                 VARCHAR(100),             -- "Fig. 12-4"
    caption               TEXT,
    figure_type           VARCHAR(50) NOT NULL,     -- FigureType enum name
    page                  INT NOT NULL,
    image_path            VARCHAR(1000) NOT NULL,   -- relative path on disk
    caption_embedding_id  UUID,                     -- ID in vector_store
    created_at            TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE TABLE chunk_figure_ref (
    chunk_id      UUID NOT NULL,         -- vector_store document ID
    figure_id     VARCHAR(200) NOT NULL REFERENCES figure(id) ON DELETE CASCADE,
    mention_page  INT,
    PRIMARY KEY (chunk_id, figure_id)
);

CREATE INDEX idx_figure_book    ON figure(book_id);
CREATE INDEX idx_cfr_chunk      ON chunk_figure_ref(chunk_id);

Java Domain Records

Document hierarchy (new package `com.aiteacher.document`)

// Root — in-memory only, not a JPA entity
public record BookNode(
    String bookId,
    String title,
    String isbn,
    String edition,
    List<String> authors,
    List<ChapterNode> chapters
) {}

// Chapter — maps to `chapter` table
public record ChapterNode(
    String chapterId,
    String bookId,
    int number,
    String title,
    int pageStart,
    List<SectionNode> sections
) {}

// Section — maps to `section` table; fullText stays in Postgres
public record SectionNode(
    String sectionId,
    String chapterId,
    String bookId,
    String number,
    String title,
    int pageStart,
    int pageEnd,
    String fullText,
    List<TextChunkNode> chunks,
    List<FigureNode> figures
) {}

// Text chunk — embedded into vector_store; references its parent section
public record TextChunkNode(
    String chunkId,          // UUID → becomes vector_store document ID
    String sectionId,
    String chapterId,
    String bookId,
    String text,
    int chunkIndex,
    int totalChunksInSection,
    int pageStart,
    int pageEnd,
    Map<String, Object> metadata   // flattened for Spring AI filtering
) {
    public Map<String, Object> toMetadata() {
        return Map.of(
            "type",          "TEXT",
            "book_id",       bookId,
            "chapter_id",    chapterId,
            "section_id",    sectionId,
            "section_title", /* from parent SectionNode */,
            "page_start",    pageStart,
            "page_end",      pageEnd,
            "chunk_index",   chunkIndex,
            "total_chunks",  totalChunksInSection
        );
    }
}

// Figure — maps to `figure` table; caption embedded into vector_store
public record FigureNode(
    String figureId,
    String sectionId,
    String chapterId,
    String bookId,
    String label,            // "Fig. 12-4"
    String caption,
    FigureType type,
    int page,
    String imagePath,        // relative: "figures/{bookId}/{figureId}.png"
    UUID captionEmbeddingId  // ID in vector_store
) {}

Figure type enum

public enum FigureType {
    ANATOMICAL_DIAGRAM,
    SURGICAL_PHOTOGRAPH,
    MRI_CT_SCAN,
    TABLE,
    CHART,
    INTRAOPERATIVE_IMAGE
}

Classification heuristic (applied to caption + surrounding text):

Keyword(s)	FigureType
`MRI`, `CT`, `magnetic`, `resonance`, `tomography`	`MRI_CT_SCAN`
`intraoperative`, `intra-op`	`INTRAOPERATIVE_IMAGE`
`table`, `Table` (at line start)	`TABLE`
`chart`, `graph`, `histogram`	`CHART`
`photograph`, `photo`	`SURGICAL_PHOTOGRAPH`
(default)	`ANATOMICAL_DIAGRAM`

Chunk–figure join record

// Maps to `chunk_figure_ref` table
public record ChunkFigureRef(
    UUID chunkId,
    String figureId,
    int mentionPage
) {}

Vector Store Documents

All documents in vector_store carry a metadata JSON column with a type field for filtering.

Text chunk document

Field	Value
`content`	chunk text (400–600 tokens)
`metadata.type`	`"TEXT"`
`metadata.book_id`	book UUID
`metadata.book_title`	book title string
`metadata.chapter_id`	chapter ID string
`metadata.section_id`	section ID string
`metadata.section_title`	section title string
`metadata.page_start`	int
`metadata.page_end`	int
`metadata.chunk_index`	int (0-based)
`metadata.total_chunks`	int

Figure caption document

Field	Value
`content`	vision-generated description + caption text
`metadata.type`	`"FIGURE"`
`metadata.book_id`	book UUID
`metadata.book_title`	book title string
`metadata.chapter_id`	chapter ID string
`metadata.section_id`	section ID string
`metadata.figure_id`	figure ID string
`metadata.figure_type`	enum name string
`metadata.image_path`	relative file path
`metadata.label`	caption label e.g. `"Fig. 12-4"`
`metadata.page`	int

File Store Layout

uploads/
└── figures/
    └── {bookId}/
        ├── {figureId}.png
        └── ...

Base path configurable via app.figure-storage.base-path (default: ./uploads)
Files are served via GET /api/v1/figures/{bookId}/{filename} (static resource mapping)
Gitignored; not version-controlled

State Transitions

Book processing extends the existing BookStatus state machine:

PENDING → PROCESSING → READY
                    ↘ FAILED

During PROCESSING:

Parse PDF structure → extract chapters/sections → persist to Postgres
Split sections into text chunks → embed → write to vector_store
Extract images per page → filter by min size → save PNG → generate vision description → embed caption → write figure to Postgres + vector_store
Write chunk_figure_refs for all detected figure references in text

Failure at step 3 (individual page) → log + skip that page's images; continue.
Failure at any other step → set BookStatus.FAILED.

Retrieval Result Structure

public record RetrievalResult(
    List<SectionNode> parentSections,    // expanded full-text context
    List<Document> figureVectorHits,     // semantic figure matches
    List<FigureNode> linkedFigures       // figures explicitly referenced in text chunks
) {}

The NeurosurgeryRetriever service deduplicates figures across both lists before passing the result to the LLM prompt builder.

9.4 KiB Raw Blame History Unescape Escape