# Research: Enhanced Embedding with Image Parsing and Metadata **Branch**: `002-image-aware-embedding` | **Date**: 2026-04-03 This document resolves all technical unknowns identified during planning. The primary source for decisions is the detailed architecture provided directly by the project owner, supplemented by Spring AI 2.0.0-M4 API specifics. --- ## Decision 1: Document Hierarchy Model **Decision**: Adopt a four-level hierarchy — `BookNode` → `ChapterNode` → `SectionNode` → `TextChunkNode` + `FigureNode`. The `SectionNode` is the pivotal unit: it holds the full section text in Postgres and is used for parent-child context expansion at retrieval time. **Rationale**: A flat page-per-document model (current implementation) loses structural context. When a user asks a multi-faceted clinical question, the LLM needs the surrounding section text, not just the matching fragment. Parent-child retrieval — where chunks point to their parent section — is the established pattern for RAG precision. The hierarchy also makes figure-to-section association explicit and queryable. **Alternatives considered**: - Keep flat page model, add metadata only → rejected: insufficient for precise citation and context expansion - Chapter-level retrieval (coarser than section) → rejected: too much irrelevant context sent to LLM; cost and latency increase --- ## Decision 2: Image Extraction Strategy **Decision**: Use PDFBox (already on classpath via `spring-ai-pdf-document-reader`) to extract images per page. Each image is tagged with `page`, `figure_id` (derived from caption, e.g. "Fig. 12-4"), and the parent `sectionId`. Images are saved to local disk under `/uploads/figures/{bookId}/`. **Rationale**: PDFBox is already present (Spring AI bundles it). No new dependency needed. Per-page extraction ensures every image is captured regardless of PDF structure. **Alternatives considered**: - iText / iText7 → additional commercial dependency; overkill for extraction - Screenshot each page as PNG, then OCR → far slower; loses vector quality --- ## Decision 3: Figure Content Representation **Decision**: Generate a textual description of each extracted image using the OpenAI vision model (GPT-4o). This description becomes the `content` field of the figure's vector store document. The figure caption (parsed from the surrounding text) is also included to maximise retrieval signal. **Rationale**: Caption-only embedding would miss figures with no caption or with sparse labels. Vision-generated descriptions produce richer semantic content (anatomy terms, structural relationships) that matches clinical queries. The OpenAI client already in use supports image inputs; no additional dependency is required. **Alternatives considered**: - Caption-only embedding → insufficient when captions are absent or terse (common in textbooks) - Local vision model (LLaVA) → requires self-hosting; out of scope for POC - OCR only → extracts text visible in image but misses non-text visual content (diagrams, MRI) --- ## Decision 4: Dual Vector Search **Decision**: At query time, run two parallel similarity searches: 1. Text chunk search (filtered by `type = "TEXT"` and `book_id`) 2. Figure caption search (filtered by `type = "FIGURE"` and `book_id`) Results are merged and deduplicated. The LLM prompt receives the expanded parent section text plus a structured figure reference list. **Rationale**: A single search would rank text and figures against each other; figures with terse captions would systematically lose to text chunks. Separate searches with independent `topK` allow tuning each modality independently. **Alternatives considered**: - Single search, filter by relevance score → figure captions score lower than text; figures are systematically under-retrieved - Post-process text results to look up linked figures only → misses figures that are relevant to the query but not explicitly referenced in the retrieved text chunks --- ## Decision 5: Chunk-to-Figure Linking **Decision**: During text parsing, whenever a pattern matching `Fig.\s+\d+[\-\.]\d+` or `Figure\s+\d+[\-\.]\d+` is found in a chunk, insert a row into the `chunk_figure_refs` table linking `chunkId` → `figureId`. At retrieval time, after text chunks are retrieved, their associated figures are fetched from this table and added to the LLM prompt. **Rationale**: Explicit linking ensures that when a text chunk is retrieved, its referenced figures are always surfaced — even if the figure's caption did not score highly in the vector search. This is the higher-recall path; dual search (Decision 4) is the higher-precision path. **Alternatives considered**: - Rely entirely on dual vector search → may miss figures referenced in retrieved text but scoring below the topK threshold in the figure search --- ## Decision 6: Image Storage **Decision**: Extracted images are saved as PNG files to a local directory (`${app.figure-storage.base-path}`, defaults to `./uploads/figures/{bookId}/`). The path is stored in `figure.image_path` in Postgres. A `FigureStorageService` interface wraps all disk I/O so the implementation can be swapped to S3 or another object store without changing callers. **Rationale**: Local disk is the simplest viable option for a POC with <10 users. The interface boundary satisfies Constitution Principle II (Easy to Change). **Alternatives considered**: - S3 from day 1 → operational overhead not justified at POC scale - Base64 in Postgres JSONB → bloats DB; complicates backup; query performance degrades --- ## Decision 7: Figure Type Classification **Decision**: Use the enum `FigureType { ANATOMICAL_DIAGRAM, SURGICAL_PHOTOGRAPH, MRI_CT_SCAN, TABLE, CHART, INTRAOPERATIVE_IMAGE }`. Classification is derived from: 1. Caption keywords ("MRI", "CT", "Fig.", "Table") — heuristic, no model needed 2. Fall back to `ANATOMICAL_DIAGRAM` if unclassifiable **Rationale**: Allows the frontend to render different icon/label per type (e.g., "MRI" badge). Heuristic classification avoids a separate model call per image at extraction time. **Alternatives considered**: - Vision model classification → accurate but adds latency and cost per figure; deferrable - Single `FIGURE` type → loses citation granularity required by spec FR-004 --- ## Decision 8: Metadata Schema for Vector Store Documents **Decision**: All vector store documents carry a flat `Map` metadata for Spring AI filtering. Schema: | Field | Text Chunk | Figure Chunk | |-------|-----------|-------------| | `type` | `"TEXT"` | `"FIGURE"` | | `book_id` | ✓ | ✓ | | `book_title` | ✓ | ✓ | | `chapter_id` | ✓ | ✓ | | `section_id` | ✓ | ✓ | | `section_title` | ✓ | ✓ | | `page_start` | ✓ | — | | `page_end` | ✓ | — | | `chunk_index` | ✓ | — | | `total_chunks` | ✓ | — | | `figure_id` | — | ✓ | | `figure_type` | — | ✓ | | `image_path` | — | ✓ | | `label` | — | ✓ | | `page` | — | ✓ | **Rationale**: Flat map is required by Spring AI `FilterExpressionBuilder`. Separation by `type` allows independent filtering in dual search. --- ## Decision 9: Re-embedding Existing Books **Decision**: Books already processed under feature 001 (text-only) are NOT automatically re-embedded. An explicit re-embed action is exposed via `POST /api/v1/books/{id}/reembed` (admin-triggered). The existing chunks remain valid for text queries until re-embedding completes. **Rationale**: Automatic re-embedding on deploy would block the system and risk data loss if the process fails mid-way. An explicit, idempotent trigger is safer and more observable. --- ## Decision 10: Minimum Image Size Threshold **Decision**: Images smaller than 100×100 pixels are discarded and no chunk is created. This threshold filters out decorative elements (bullets, dividers, publisher logos) without a classification model. **Rationale**: Neurosurgery textbook diagrams and MRI scans are never smaller than 100×100 px. The threshold is configurable via `app.figure-storage.min-image-size-px` in `application.properties`. **Alternatives considered**: - No threshold → decorative icons pollute the figure index - ML-based classification → accurate but adds model dependency; not needed at POC scale