Implementation Plan: Enhanced Embedding with Image Parsing and Metadata

Branch: 002-image-aware-embedding | Date: 2026-04-03 | Spec: spec.md
Input: Feature specification from /specs/002-image-aware-embedding/spec.md

Summary

Enhance the book embedding pipeline to extract images from every PDF page, generate descriptive text for each image, and store all content (text chunks + figure captions) with rich, consistent metadata in the vector store. A new document hierarchy (Book → Chapter → Section → TextChunk + Figure) is introduced. Postgres holds the full-text sections and figure metadata; the vector store holds chunk and figure caption embeddings; the local file store holds extracted image files. At query time, both the text-chunk store and figure-caption store are searched in parallel and results are merged before being sent to the LLM.

Technical Context

Language/Version: Java 25 (backend), TypeScript / Node 20 (frontend)
Primary Dependencies: Spring Boot 4.0.5, Spring AI 2.0.0-M4, OpenAI API (embeddings + chat), PDFBox (via Spring AI PDF reader dependency)
Storage: PostgreSQL (JPA + Flyway), pgvector (Spring AI VectorStore), local file system (extracted images — /uploads/figures/)
Testing: Spring Boot Test, JUnit 5, Mockito
Target Platform: Linux server (Docker Compose)
Project Type: Web application — backend REST API + Vue 3 frontend
Performance Goals: Full book (up to 500 pages with images) processed in ≤ 30 minutes; query response unchanged from existing baseline
Constraints: No new deployable units; all changes within the existing backend/ module; image storage on local disk (S3 migration is a future concern, behind an interface)
Scale/Scope: POC — <10 concurrent users; single shared book library

Constitution Check

GATE: Must pass before Phase 0 research. Re-check after Phase 1 design.

Principle	Status	Notes
I — KISS	⚠️ Justified violation — see Complexity Tracking	Hierarchical model + dual search adds complexity; justified by precision requirement
II — Easy to Change	✅	Figure storage wrapped behind `FigureStorageService` interface; can swap local disk for S3
III — Web-First	✅	All new capabilities exposed via existing REST API; no new deployable units
IV — Docs as Architecture	⚠️ Required	README Mermaid diagram MUST be updated in this PR to show new storage tiers

Project Structure

Documentation (this feature)

specs/002-image-aware-embedding/
├── plan.md              # This file
├── research.md          # Phase 0 output
├── data-model.md        # Phase 1 output
├── quickstart.md        # Phase 1 output
├── contracts/           # Phase 1 output
└── tasks.md             # Phase 2 output (/speckit.tasks)

Source Code (repository root)

backend/
├── src/main/java/com/aiteacher/
│   ├── book/
│   │   ├── Book.java                         (existing)
│   │   ├── BookController.java               (existing)
│   │   ├── BookService.java                  (existing)
│   │   ├── BookRepository.java               (existing)
│   │   ├── BookStatus.java                   (existing)
│   │   ├── BookEmbeddingService.java         (existing — enhanced)
│   │   └── NoKnowledgeSourceException.java   (existing)
│   ├── document/                             (new package)
│   │   ├── BookNode.java
│   │   ├── ChapterNode.java
│   │   ├── SectionNode.java
│   │   ├── SectionRepository.java
│   │   ├── TextChunkNode.java
│   │   ├── FigureNode.java
│   │   ├── FigureRepository.java
│   │   ├── FigureType.java
│   │   ├── ChunkFigureRef.java
│   │   └── ChunkFigureRefRepository.java
│   ├── figure/                               (new package)
│   │   ├── FigureStorageService.java         (interface)
│   │   └── LocalFigureStorageService.java    (implementation)
│   ├── retrieval/                            (new package)
│   │   └── NeurosurgeryRetriever.java
│   ├── chat/
│   │   └── ChatService.java                  (updated — uses NeurosurgeryRetriever)
│   └── config/
│       └── FigureStorageConfig.java          (new — configures upload dir)
└── src/main/resources/
    └── db/migration/
        ├── V4__document_hierarchy.sql        (new)
        └── V5__figures_and_refs.sql          (new)

uploads/
└── figures/                                  (runtime — extracted images; gitignored)

Structure Decision: Option 2 (Web Application) confirmed. All backend changes stay within backend/. Two new packages (document/, retrieval/) plus one interface package (figure/) keep concerns separated without adding a deployable unit.

Complexity Tracking

Violation	Why Needed	Simpler Alternative Rejected Because
Document hierarchy (BookNode → ChapterNode → SectionNode)	Parent-child retrieval: chunks reference their parent section so the LLM receives full section context, not just the matching fragment. This is the established solution for RAG precision.	Flat page-per-doc model (current) loses inter-sentence context; chunk-only retrieval produces incomplete answers for multi-paragraph clinical questions
Dual vector search (text chunks + figure captions)	Figure captions must be independently searchable — a query about "cavernous sinus anatomy" must surface the diagram even if no text chunk scores highly	Single vector store search would miss figures whose captions don't happen to be the highest-similarity hit; this is the core deliverable of the feature
Third storage tier (local file store for images)	Extracted images cannot live in Postgres (binary blobs degrade query performance) or the vector store (only vectors). A file-per-image approach is standard.	Storing images as base64 in Postgres JSONB would bloat the DB and complicate backup/restore; the `FigureStorageService` interface keeps the implementation swappable

6.3 KiB Raw Blame History