6.3 KiB
Implementation Plan: Enhanced Embedding with Image Parsing and Metadata
Branch: 002-image-aware-embedding | Date: 2026-04-03 | Spec: spec.md
Input: Feature specification from /specs/002-image-aware-embedding/spec.md
Summary
Enhance the book embedding pipeline to extract images from every PDF page, generate descriptive text for each image, and store all content (text chunks + figure captions) with rich, consistent metadata in the vector store. A new document hierarchy (Book → Chapter → Section → TextChunk + Figure) is introduced. Postgres holds the full-text sections and figure metadata; the vector store holds chunk and figure caption embeddings; the local file store holds extracted image files. At query time, both the text-chunk store and figure-caption store are searched in parallel and results are merged before being sent to the LLM.
Technical Context
Language/Version: Java 25 (backend), TypeScript / Node 20 (frontend)
Primary Dependencies: Spring Boot 4.0.5, Spring AI 2.0.0-M4, OpenAI API (embeddings + chat), PDFBox (via Spring AI PDF reader dependency)
Storage: PostgreSQL (JPA + Flyway), pgvector (Spring AI VectorStore), local file system (extracted images — /uploads/figures/)
Testing: Spring Boot Test, JUnit 5, Mockito
Target Platform: Linux server (Docker Compose)
Project Type: Web application — backend REST API + Vue 3 frontend
Performance Goals: Full book (up to 500 pages with images) processed in ≤ 30 minutes; query response unchanged from existing baseline
Constraints: No new deployable units; all changes within the existing backend/ module; image storage on local disk (S3 migration is a future concern, behind an interface)
Scale/Scope: POC — <10 concurrent users; single shared book library
Constitution Check
GATE: Must pass before Phase 0 research. Re-check after Phase 1 design.
| Principle | Status | Notes |
|---|---|---|
| I — KISS | ⚠️ Justified violation — see Complexity Tracking | Hierarchical model + dual search adds complexity; justified by precision requirement |
| II — Easy to Change | ✅ | Figure storage wrapped behind FigureStorageService interface; can swap local disk for S3 |
| III — Web-First | ✅ | All new capabilities exposed via existing REST API; no new deployable units |
| IV — Docs as Architecture | ⚠️ Required | README Mermaid diagram MUST be updated in this PR to show new storage tiers |
Project Structure
Documentation (this feature)
specs/002-image-aware-embedding/
├── plan.md # This file
├── research.md # Phase 0 output
├── data-model.md # Phase 1 output
├── quickstart.md # Phase 1 output
├── contracts/ # Phase 1 output
└── tasks.md # Phase 2 output (/speckit.tasks)
Source Code (repository root)
backend/
├── src/main/java/com/aiteacher/
│ ├── book/
│ │ ├── Book.java (existing)
│ │ ├── BookController.java (existing)
│ │ ├── BookService.java (existing)
│ │ ├── BookRepository.java (existing)
│ │ ├── BookStatus.java (existing)
│ │ ├── BookEmbeddingService.java (existing — enhanced)
│ │ └── NoKnowledgeSourceException.java (existing)
│ ├── document/ (new package)
│ │ ├── BookNode.java
│ │ ├── ChapterNode.java
│ │ ├── SectionNode.java
│ │ ├── SectionRepository.java
│ │ ├── TextChunkNode.java
│ │ ├── FigureNode.java
│ │ ├── FigureRepository.java
│ │ ├── FigureType.java
│ │ ├── ChunkFigureRef.java
│ │ └── ChunkFigureRefRepository.java
│ ├── figure/ (new package)
│ │ ├── FigureStorageService.java (interface)
│ │ └── LocalFigureStorageService.java (implementation)
│ ├── retrieval/ (new package)
│ │ └── NeurosurgeryRetriever.java
│ ├── chat/
│ │ └── ChatService.java (updated — uses NeurosurgeryRetriever)
│ └── config/
│ └── FigureStorageConfig.java (new — configures upload dir)
└── src/main/resources/
└── db/migration/
├── V4__document_hierarchy.sql (new)
└── V5__figures_and_refs.sql (new)
uploads/
└── figures/ (runtime — extracted images; gitignored)
Structure Decision: Option 2 (Web Application) confirmed. All backend changes stay within
backend/. Two new packages (document/, retrieval/) plus one interface package (figure/)
keep concerns separated without adding a deployable unit.
Complexity Tracking
| Violation | Why Needed | Simpler Alternative Rejected Because |
|---|---|---|
| Document hierarchy (BookNode → ChapterNode → SectionNode) | Parent-child retrieval: chunks reference their parent section so the LLM receives full section context, not just the matching fragment. This is the established solution for RAG precision. | Flat page-per-doc model (current) loses inter-sentence context; chunk-only retrieval produces incomplete answers for multi-paragraph clinical questions |
| Dual vector search (text chunks + figure captions) | Figure captions must be independently searchable — a query about "cavernous sinus anatomy" must surface the diagram even if no text chunk scores highly | Single vector store search would miss figures whose captions don't happen to be the highest-similarity hit; this is the core deliverable of the feature |
| Third storage tier (local file store for images) | Extracted images cannot live in Postgres (binary blobs degrade query performance) or the vector store (only vectors). A file-per-image approach is standard. | Storing images as base64 in Postgres JSONB would bloat the DB and complicate backup/restore; the FigureStorageService interface keeps the implementation swappable |