first implementation - image/drawing integration

2026-04-04 12:56:56 +02:00
parent fc5b22fba1
commit 5acfdd33c1
42 changed files with 2854 additions and 151 deletions
@@ -0,0 +1,188 @@
+# Research: Enhanced Embedding with Image Parsing and Metadata
+
+**Branch**: `002-image-aware-embedding` | **Date**: 2026-04-03
+
+This document resolves all technical unknowns identified during planning. The primary source for
+decisions is the detailed architecture provided directly by the project owner, supplemented by
+Spring AI 2.0.0-M4 API specifics.
+
+---
+
+## Decision 1: Document Hierarchy Model
+
+**Decision**: Adopt a four-level hierarchy — `BookNode` → `ChapterNode` → `SectionNode` →
+`TextChunkNode` + `FigureNode`. The `SectionNode` is the pivotal unit: it holds the full section
+text in Postgres and is used for parent-child context expansion at retrieval time.
+
+**Rationale**: A flat page-per-document model (current implementation) loses structural context.
+When a user asks a multi-faceted clinical question, the LLM needs the surrounding section text,
+not just the matching fragment. Parent-child retrieval — where chunks point to their parent
+section — is the established pattern for RAG precision. The hierarchy also makes figure-to-section
+association explicit and queryable.
+
+**Alternatives considered**:
+- Keep flat page model, add metadata only → rejected: insufficient for precise citation and
+  context expansion
+- Chapter-level retrieval (coarser than section) → rejected: too much irrelevant context sent
+  to LLM; cost and latency increase
+
+---
+
+## Decision 2: Image Extraction Strategy
+
+**Decision**: Use PDFBox (already on classpath via `spring-ai-pdf-document-reader`) to extract
+images per page. Each image is tagged with `page`, `figure_id` (derived from caption, e.g.
+"Fig. 12-4"), and the parent `sectionId`. Images are saved to local disk under
+`/uploads/figures/{bookId}/`.
+
+**Rationale**: PDFBox is already present (Spring AI bundles it). No new dependency needed.
+Per-page extraction ensures every image is captured regardless of PDF structure.
+
+**Alternatives considered**:
+- iText / iText7 → additional commercial dependency; overkill for extraction
+- Screenshot each page as PNG, then OCR → far slower; loses vector quality
+
+---
+
+## Decision 3: Figure Content Representation
+
+**Decision**: Generate a textual description of each extracted image using the OpenAI vision
+model (GPT-4o). This description becomes the `content` field of the figure's vector store
+document. The figure caption (parsed from the surrounding text) is also included to maximise
+retrieval signal.
+
+**Rationale**: Caption-only embedding would miss figures with no caption or with sparse labels.
+Vision-generated descriptions produce richer semantic content (anatomy terms, structural
+relationships) that matches clinical queries. The OpenAI client already in use supports image
+inputs; no additional dependency is required.
+
+**Alternatives considered**:
+- Caption-only embedding → insufficient when captions are absent or terse (common in textbooks)
+- Local vision model (LLaVA) → requires self-hosting; out of scope for POC
+- OCR only → extracts text visible in image but misses non-text visual content (diagrams, MRI)
+
+---
+
+## Decision 4: Dual Vector Search
+
+**Decision**: At query time, run two parallel similarity searches:
+1. Text chunk search (filtered by `type = "TEXT"` and `book_id`)
+2. Figure caption search (filtered by `type = "FIGURE"` and `book_id`)
+
+Results are merged and deduplicated. The LLM prompt receives the expanded parent section text
+plus a structured figure reference list.
+
+**Rationale**: A single search would rank text and figures against each other; figures with
+terse captions would systematically lose to text chunks. Separate searches with independent
+`topK` allow tuning each modality independently.
+
+**Alternatives considered**:
+- Single search, filter by relevance score → figure captions score lower than text; figures
+  are systematically under-retrieved
+- Post-process text results to look up linked figures only → misses figures that are relevant
+  to the query but not explicitly referenced in the retrieved text chunks
+
+---
+
+## Decision 5: Chunk-to-Figure Linking
+
+**Decision**: During text parsing, whenever a pattern matching `Fig.\s+\d+[\-\.]\d+` or
+`Figure\s+\d+[\-\.]\d+` is found in a chunk, insert a row into the `chunk_figure_refs` table
+linking `chunkId` → `figureId`. At retrieval time, after text chunks are retrieved, their
+associated figures are fetched from this table and added to the LLM prompt.
+
+**Rationale**: Explicit linking ensures that when a text chunk is retrieved, its referenced
+figures are always surfaced — even if the figure's caption did not score highly in the vector
+search. This is the higher-recall path; dual search (Decision 4) is the higher-precision path.
+
+**Alternatives considered**:
+- Rely entirely on dual vector search → may miss figures referenced in retrieved text but
+  scoring below the topK threshold in the figure search
+
+---
+
+## Decision 6: Image Storage
+
+**Decision**: Extracted images are saved as PNG files to a local directory
+(`${app.figure-storage.base-path}`, defaults to `./uploads/figures/{bookId}/`). The path is
+stored in `figure.image_path` in Postgres. A `FigureStorageService` interface wraps all disk
+I/O so the implementation can be swapped to S3 or another object store without changing
+callers.
+
+**Rationale**: Local disk is the simplest viable option for a POC with <10 users. The interface
+boundary satisfies Constitution Principle II (Easy to Change).
+
+**Alternatives considered**:
+- S3 from day 1 → operational overhead not justified at POC scale
+- Base64 in Postgres JSONB → bloats DB; complicates backup; query performance degrades
+
+---
+
+## Decision 7: Figure Type Classification
+
+**Decision**: Use the enum `FigureType { ANATOMICAL_DIAGRAM, SURGICAL_PHOTOGRAPH, MRI_CT_SCAN,
+TABLE, CHART, INTRAOPERATIVE_IMAGE }`. Classification is derived from:
+1. Caption keywords ("MRI", "CT", "Fig.", "Table") — heuristic, no model needed
+2. Fall back to `ANATOMICAL_DIAGRAM` if unclassifiable
+
+**Rationale**: Allows the frontend to render different icon/label per type (e.g., "MRI" badge).
+Heuristic classification avoids a separate model call per image at extraction time.
+
+**Alternatives considered**:
+- Vision model classification → accurate but adds latency and cost per figure; deferrable
+- Single `FIGURE` type → loses citation granularity required by spec FR-004
+
+---
+
+## Decision 8: Metadata Schema for Vector Store Documents
+
+**Decision**: All vector store documents carry a flat `Map<String, Object>` metadata for Spring
+AI filtering. Schema:
+
+| Field | Text Chunk | Figure Chunk |
+|-------|-----------|-------------|
+| `type` | `"TEXT"` | `"FIGURE"` |
+| `book_id` | ✓ | ✓ |
+| `book_title` | ✓ | ✓ |
+| `chapter_id` | ✓ | ✓ |
+| `section_id` | ✓ | ✓ |
+| `section_title` | ✓ | ✓ |
+| `page_start` | ✓ | — |
+| `page_end` | ✓ | — |
+| `chunk_index` | ✓ | — |
+| `total_chunks` | ✓ | — |
+| `figure_id` | — | ✓ |
+| `figure_type` | — | ✓ |
+| `image_path` | — | ✓ |
+| `label` | — | ✓ |
+| `page` | — | ✓ |
+
+**Rationale**: Flat map is required by Spring AI `FilterExpressionBuilder`. Separation by `type`
+allows independent filtering in dual search.
+
+---
+
+## Decision 9: Re-embedding Existing Books
+
+**Decision**: Books already processed under feature 001 (text-only) are NOT automatically
+re-embedded. An explicit re-embed action is exposed via `POST /api/v1/books/{id}/reembed`
+(admin-triggered). The existing chunks remain valid for text queries until re-embedding completes.
+
+**Rationale**: Automatic re-embedding on deploy would block the system and risk data loss if
+the process fails mid-way. An explicit, idempotent trigger is safer and more observable.
+
+---
+
+## Decision 10: Minimum Image Size Threshold
+
+**Decision**: Images smaller than 100×100 pixels are discarded and no chunk is created. This
+threshold filters out decorative elements (bullets, dividers, publisher logos) without a
+classification model.
+
+**Rationale**: Neurosurgery textbook diagrams and MRI scans are never smaller than 100×100 px.
+The threshold is configurable via `app.figure-storage.min-image-size-px` in
+`application.properties`.
+
+**Alternatives considered**:
+- No threshold → decorative icons pollute the figure index
+- ML-based classification → accurate but adds model dependency; not needed at POC scale