8.1 KiB
Research: Enhanced Embedding with Image Parsing and Metadata
Branch: 002-image-aware-embedding | Date: 2026-04-03
This document resolves all technical unknowns identified during planning. The primary source for decisions is the detailed architecture provided directly by the project owner, supplemented by Spring AI 2.0.0-M4 API specifics.
Decision 1: Document Hierarchy Model
Decision: Adopt a four-level hierarchy — BookNode → ChapterNode → SectionNode →
TextChunkNode + FigureNode. The SectionNode is the pivotal unit: it holds the full section
text in Postgres and is used for parent-child context expansion at retrieval time.
Rationale: A flat page-per-document model (current implementation) loses structural context. When a user asks a multi-faceted clinical question, the LLM needs the surrounding section text, not just the matching fragment. Parent-child retrieval — where chunks point to their parent section — is the established pattern for RAG precision. The hierarchy also makes figure-to-section association explicit and queryable.
Alternatives considered:
- Keep flat page model, add metadata only → rejected: insufficient for precise citation and context expansion
- Chapter-level retrieval (coarser than section) → rejected: too much irrelevant context sent to LLM; cost and latency increase
Decision 2: Image Extraction Strategy
Decision: Use PDFBox (already on classpath via spring-ai-pdf-document-reader) to extract
images per page. Each image is tagged with page, figure_id (derived from caption, e.g.
"Fig. 12-4"), and the parent sectionId. Images are saved to local disk under
/uploads/figures/{bookId}/.
Rationale: PDFBox is already present (Spring AI bundles it). No new dependency needed. Per-page extraction ensures every image is captured regardless of PDF structure.
Alternatives considered:
- iText / iText7 → additional commercial dependency; overkill for extraction
- Screenshot each page as PNG, then OCR → far slower; loses vector quality
Decision 3: Figure Content Representation
Decision: Generate a textual description of each extracted image using the OpenAI vision
model (GPT-4o). This description becomes the content field of the figure's vector store
document. The figure caption (parsed from the surrounding text) is also included to maximise
retrieval signal.
Rationale: Caption-only embedding would miss figures with no caption or with sparse labels. Vision-generated descriptions produce richer semantic content (anatomy terms, structural relationships) that matches clinical queries. The OpenAI client already in use supports image inputs; no additional dependency is required.
Alternatives considered:
- Caption-only embedding → insufficient when captions are absent or terse (common in textbooks)
- Local vision model (LLaVA) → requires self-hosting; out of scope for POC
- OCR only → extracts text visible in image but misses non-text visual content (diagrams, MRI)
Decision 4: Dual Vector Search
Decision: At query time, run two parallel similarity searches:
- Text chunk search (filtered by
type = "TEXT"andbook_id) - Figure caption search (filtered by
type = "FIGURE"andbook_id)
Results are merged and deduplicated. The LLM prompt receives the expanded parent section text plus a structured figure reference list.
Rationale: A single search would rank text and figures against each other; figures with
terse captions would systematically lose to text chunks. Separate searches with independent
topK allow tuning each modality independently.
Alternatives considered:
- Single search, filter by relevance score → figure captions score lower than text; figures are systematically under-retrieved
- Post-process text results to look up linked figures only → misses figures that are relevant to the query but not explicitly referenced in the retrieved text chunks
Decision 5: Chunk-to-Figure Linking
Decision: During text parsing, whenever a pattern matching Fig.\s+\d+[\-\.]\d+ or
Figure\s+\d+[\-\.]\d+ is found in a chunk, insert a row into the chunk_figure_refs table
linking chunkId → figureId. At retrieval time, after text chunks are retrieved, their
associated figures are fetched from this table and added to the LLM prompt.
Rationale: Explicit linking ensures that when a text chunk is retrieved, its referenced figures are always surfaced — even if the figure's caption did not score highly in the vector search. This is the higher-recall path; dual search (Decision 4) is the higher-precision path.
Alternatives considered:
- Rely entirely on dual vector search → may miss figures referenced in retrieved text but scoring below the topK threshold in the figure search
Decision 6: Image Storage
Decision: Extracted images are saved as PNG files to a local directory
(${app.figure-storage.base-path}, defaults to ./uploads/figures/{bookId}/). The path is
stored in figure.image_path in Postgres. A FigureStorageService interface wraps all disk
I/O so the implementation can be swapped to S3 or another object store without changing
callers.
Rationale: Local disk is the simplest viable option for a POC with <10 users. The interface boundary satisfies Constitution Principle II (Easy to Change).
Alternatives considered:
- S3 from day 1 → operational overhead not justified at POC scale
- Base64 in Postgres JSONB → bloats DB; complicates backup; query performance degrades
Decision 7: Figure Type Classification
Decision: Use the enum FigureType { ANATOMICAL_DIAGRAM, SURGICAL_PHOTOGRAPH, MRI_CT_SCAN, TABLE, CHART, INTRAOPERATIVE_IMAGE }. Classification is derived from:
- Caption keywords ("MRI", "CT", "Fig.", "Table") — heuristic, no model needed
- Fall back to
ANATOMICAL_DIAGRAMif unclassifiable
Rationale: Allows the frontend to render different icon/label per type (e.g., "MRI" badge). Heuristic classification avoids a separate model call per image at extraction time.
Alternatives considered:
- Vision model classification → accurate but adds latency and cost per figure; deferrable
- Single
FIGUREtype → loses citation granularity required by spec FR-004
Decision 8: Metadata Schema for Vector Store Documents
Decision: All vector store documents carry a flat Map<String, Object> metadata for Spring
AI filtering. Schema:
| Field | Text Chunk | Figure Chunk |
|---|---|---|
type |
"TEXT" |
"FIGURE" |
book_id |
✓ | ✓ |
book_title |
✓ | ✓ |
chapter_id |
✓ | ✓ |
section_id |
✓ | ✓ |
section_title |
✓ | ✓ |
page_start |
✓ | — |
page_end |
✓ | — |
chunk_index |
✓ | — |
total_chunks |
✓ | — |
figure_id |
— | ✓ |
figure_type |
— | ✓ |
image_path |
— | ✓ |
label |
— | ✓ |
page |
— | ✓ |
Rationale: Flat map is required by Spring AI FilterExpressionBuilder. Separation by type
allows independent filtering in dual search.
Decision 9: Re-embedding Existing Books
Decision: Books already processed under feature 001 (text-only) are NOT automatically
re-embedded. An explicit re-embed action is exposed via POST /api/v1/books/{id}/reembed
(admin-triggered). The existing chunks remain valid for text queries until re-embedding completes.
Rationale: Automatic re-embedding on deploy would block the system and risk data loss if the process fails mid-way. An explicit, idempotent trigger is safer and more observable.
Decision 10: Minimum Image Size Threshold
Decision: Images smaller than 100×100 pixels are discarded and no chunk is created. This threshold filters out decorative elements (bullets, dividers, publisher logos) without a classification model.
Rationale: Neurosurgery textbook diagrams and MRI scans are never smaller than 100×100 px.
The threshold is configurable via app.figure-storage.min-image-size-px in
application.properties.
Alternatives considered:
- No threshold → decorative icons pollute the figure index
- ML-based classification → accurate but adds model dependency; not needed at POC scale