first implementation - image/drawing integration

2026-04-04 12:56:56 +02:00
parent fc5b22fba1
commit 5acfdd33c1
42 changed files with 2854 additions and 151 deletions
@@ -0,0 +1,105 @@
+# Implementation Plan: Enhanced Embedding with Image Parsing and Metadata
+
+**Branch**: `002-image-aware-embedding` | **Date**: 2026-04-03 | **Spec**: [spec.md](spec.md)  
+**Input**: Feature specification from `/specs/002-image-aware-embedding/spec.md`
+
+## Summary
+
+Enhance the book embedding pipeline to extract images from every PDF page, generate descriptive
+text for each image, and store all content (text chunks + figure captions) with rich, consistent
+metadata in the vector store. A new document hierarchy (Book → Chapter → Section → TextChunk +
+Figure) is introduced. Postgres holds the full-text sections and figure metadata; the vector
+store holds chunk and figure caption embeddings; the local file store holds extracted image files.
+At query time, both the text-chunk store and figure-caption store are searched in parallel and
+results are merged before being sent to the LLM.
+
+## Technical Context
+
+**Language/Version**: Java 25 (backend), TypeScript / Node 20 (frontend)  
+**Primary Dependencies**: Spring Boot 4.0.5, Spring AI 2.0.0-M4, OpenAI API (embeddings + chat), PDFBox (via Spring AI PDF reader dependency)  
+**Storage**: PostgreSQL (JPA + Flyway), pgvector (Spring AI `VectorStore`), local file system (extracted images — `/uploads/figures/`)  
+**Testing**: Spring Boot Test, JUnit 5, Mockito  
+**Target Platform**: Linux server (Docker Compose)  
+**Project Type**: Web application — backend REST API + Vue 3 frontend  
+**Performance Goals**: Full book (up to 500 pages with images) processed in ≤ 30 minutes; query response unchanged from existing baseline  
+**Constraints**: No new deployable units; all changes within the existing `backend/` module; image storage on local disk (S3 migration is a future concern, behind an interface)  
+**Scale/Scope**: POC — <10 concurrent users; single shared book library
+
+## Constitution Check
+
+*GATE: Must pass before Phase 0 research. Re-check after Phase 1 design.*
+
+| Principle | Status | Notes |
+|-----------|--------|-------|
+| I — KISS | ⚠️ Justified violation — see Complexity Tracking | Hierarchical model + dual search adds complexity; justified by precision requirement |
+| II — Easy to Change | ✅ | Figure storage wrapped behind `FigureStorageService` interface; can swap local disk for S3 |
+| III — Web-First | ✅ | All new capabilities exposed via existing REST API; no new deployable units |
+| IV — Docs as Architecture | ⚠️ Required | README Mermaid diagram MUST be updated in this PR to show new storage tiers |
+
+## Project Structure
+
+### Documentation (this feature)
+
+```text
+specs/002-image-aware-embedding/
+├── plan.md              # This file
+├── research.md          # Phase 0 output
+├── data-model.md        # Phase 1 output
+├── quickstart.md        # Phase 1 output
+├── contracts/           # Phase 1 output
+└── tasks.md             # Phase 2 output (/speckit.tasks)
+```
+
+### Source Code (repository root)
+
+```text
+backend/
+├── src/main/java/com/aiteacher/
+│   ├── book/
+│   │   ├── Book.java                         (existing)
+│   │   ├── BookController.java               (existing)
+│   │   ├── BookService.java                  (existing)
+│   │   ├── BookRepository.java               (existing)
+│   │   ├── BookStatus.java                   (existing)
+│   │   ├── BookEmbeddingService.java         (existing — enhanced)
+│   │   └── NoKnowledgeSourceException.java   (existing)
+│   ├── document/                             (new package)
+│   │   ├── BookNode.java
+│   │   ├── ChapterNode.java
+│   │   ├── SectionNode.java
+│   │   ├── SectionRepository.java
+│   │   ├── TextChunkNode.java
+│   │   ├── FigureNode.java
+│   │   ├── FigureRepository.java
+│   │   ├── FigureType.java
+│   │   ├── ChunkFigureRef.java
+│   │   └── ChunkFigureRefRepository.java
+│   ├── figure/                               (new package)
+│   │   ├── FigureStorageService.java         (interface)
+│   │   └── LocalFigureStorageService.java    (implementation)
+│   ├── retrieval/                            (new package)
+│   │   └── NeurosurgeryRetriever.java
+│   ├── chat/
+│   │   └── ChatService.java                  (updated — uses NeurosurgeryRetriever)
+│   └── config/
+│       └── FigureStorageConfig.java          (new — configures upload dir)
+└── src/main/resources/
+    └── db/migration/
+        ├── V4__document_hierarchy.sql        (new)
+        └── V5__figures_and_refs.sql          (new)
+
+uploads/
+└── figures/                                  (runtime — extracted images; gitignored)
+```
+
+**Structure Decision**: Option 2 (Web Application) confirmed. All backend changes stay within
+`backend/`. Two new packages (`document/`, `retrieval/`) plus one interface package (`figure/`)
+keep concerns separated without adding a deployable unit.
+
+## Complexity Tracking
+
+| Violation | Why Needed | Simpler Alternative Rejected Because |
+|-----------|------------|-------------------------------------|
+| Document hierarchy (BookNode → ChapterNode → SectionNode) | Parent-child retrieval: chunks reference their parent section so the LLM receives full section context, not just the matching fragment. This is the established solution for RAG precision. | Flat page-per-doc model (current) loses inter-sentence context; chunk-only retrieval produces incomplete answers for multi-paragraph clinical questions |
+| Dual vector search (text chunks + figure captions) | Figure captions must be independently searchable — a query about "cavernous sinus anatomy" must surface the diagram even if no text chunk scores highly | Single vector store search would miss figures whose captions don't happen to be the highest-similarity hit; this is the core deliverable of the feature |
+| Third storage tier (local file store for images) | Extracted images cannot live in Postgres (binary blobs degrade query performance) or the vector store (only vectors). A file-per-image approach is standard. | Storing images as base64 in Postgres JSONB would bloat the DB and complicate backup/restore; the `FigureStorageService` interface keeps the implementation swappable |