Files
ai-teacher/specs/002-image-aware-embedding/research.md
T

8.1 KiB
Raw Blame History

Research: Enhanced Embedding with Image Parsing and Metadata

Branch: 002-image-aware-embedding | Date: 2026-04-03

This document resolves all technical unknowns identified during planning. The primary source for decisions is the detailed architecture provided directly by the project owner, supplemented by Spring AI 2.0.0-M4 API specifics.


Decision 1: Document Hierarchy Model

Decision: Adopt a four-level hierarchy — BookNodeChapterNodeSectionNodeTextChunkNode + FigureNode. The SectionNode is the pivotal unit: it holds the full section text in Postgres and is used for parent-child context expansion at retrieval time.

Rationale: A flat page-per-document model (current implementation) loses structural context. When a user asks a multi-faceted clinical question, the LLM needs the surrounding section text, not just the matching fragment. Parent-child retrieval — where chunks point to their parent section — is the established pattern for RAG precision. The hierarchy also makes figure-to-section association explicit and queryable.

Alternatives considered:

  • Keep flat page model, add metadata only → rejected: insufficient for precise citation and context expansion
  • Chapter-level retrieval (coarser than section) → rejected: too much irrelevant context sent to LLM; cost and latency increase

Decision 2: Image Extraction Strategy

Decision: Use PDFBox (already on classpath via spring-ai-pdf-document-reader) to extract images per page. Each image is tagged with page, figure_id (derived from caption, e.g. "Fig. 12-4"), and the parent sectionId. Images are saved to local disk under /uploads/figures/{bookId}/.

Rationale: PDFBox is already present (Spring AI bundles it). No new dependency needed. Per-page extraction ensures every image is captured regardless of PDF structure.

Alternatives considered:

  • iText / iText7 → additional commercial dependency; overkill for extraction
  • Screenshot each page as PNG, then OCR → far slower; loses vector quality

Decision 3: Figure Content Representation

Decision: Generate a textual description of each extracted image using the OpenAI vision model (GPT-4o). This description becomes the content field of the figure's vector store document. The figure caption (parsed from the surrounding text) is also included to maximise retrieval signal.

Rationale: Caption-only embedding would miss figures with no caption or with sparse labels. Vision-generated descriptions produce richer semantic content (anatomy terms, structural relationships) that matches clinical queries. The OpenAI client already in use supports image inputs; no additional dependency is required.

Alternatives considered:

  • Caption-only embedding → insufficient when captions are absent or terse (common in textbooks)
  • Local vision model (LLaVA) → requires self-hosting; out of scope for POC
  • OCR only → extracts text visible in image but misses non-text visual content (diagrams, MRI)

Decision: At query time, run two parallel similarity searches:

  1. Text chunk search (filtered by type = "TEXT" and book_id)
  2. Figure caption search (filtered by type = "FIGURE" and book_id)

Results are merged and deduplicated. The LLM prompt receives the expanded parent section text plus a structured figure reference list.

Rationale: A single search would rank text and figures against each other; figures with terse captions would systematically lose to text chunks. Separate searches with independent topK allow tuning each modality independently.

Alternatives considered:

  • Single search, filter by relevance score → figure captions score lower than text; figures are systematically under-retrieved
  • Post-process text results to look up linked figures only → misses figures that are relevant to the query but not explicitly referenced in the retrieved text chunks

Decision 5: Chunk-to-Figure Linking

Decision: During text parsing, whenever a pattern matching Fig.\s+\d+[\-\.]\d+ or Figure\s+\d+[\-\.]\d+ is found in a chunk, insert a row into the chunk_figure_refs table linking chunkIdfigureId. At retrieval time, after text chunks are retrieved, their associated figures are fetched from this table and added to the LLM prompt.

Rationale: Explicit linking ensures that when a text chunk is retrieved, its referenced figures are always surfaced — even if the figure's caption did not score highly in the vector search. This is the higher-recall path; dual search (Decision 4) is the higher-precision path.

Alternatives considered:

  • Rely entirely on dual vector search → may miss figures referenced in retrieved text but scoring below the topK threshold in the figure search

Decision 6: Image Storage

Decision: Extracted images are saved as PNG files to a local directory (${app.figure-storage.base-path}, defaults to ./uploads/figures/{bookId}/). The path is stored in figure.image_path in Postgres. A FigureStorageService interface wraps all disk I/O so the implementation can be swapped to S3 or another object store without changing callers.

Rationale: Local disk is the simplest viable option for a POC with <10 users. The interface boundary satisfies Constitution Principle II (Easy to Change).

Alternatives considered:

  • S3 from day 1 → operational overhead not justified at POC scale
  • Base64 in Postgres JSONB → bloats DB; complicates backup; query performance degrades

Decision 7: Figure Type Classification

Decision: Use the enum FigureType { ANATOMICAL_DIAGRAM, SURGICAL_PHOTOGRAPH, MRI_CT_SCAN, TABLE, CHART, INTRAOPERATIVE_IMAGE }. Classification is derived from:

  1. Caption keywords ("MRI", "CT", "Fig.", "Table") — heuristic, no model needed
  2. Fall back to ANATOMICAL_DIAGRAM if unclassifiable

Rationale: Allows the frontend to render different icon/label per type (e.g., "MRI" badge). Heuristic classification avoids a separate model call per image at extraction time.

Alternatives considered:

  • Vision model classification → accurate but adds latency and cost per figure; deferrable
  • Single FIGURE type → loses citation granularity required by spec FR-004

Decision 8: Metadata Schema for Vector Store Documents

Decision: All vector store documents carry a flat Map<String, Object> metadata for Spring AI filtering. Schema:

Field Text Chunk Figure Chunk
type "TEXT" "FIGURE"
book_id
book_title
chapter_id
section_id
section_title
page_start
page_end
chunk_index
total_chunks
figure_id
figure_type
image_path
label
page

Rationale: Flat map is required by Spring AI FilterExpressionBuilder. Separation by type allows independent filtering in dual search.


Decision 9: Re-embedding Existing Books

Decision: Books already processed under feature 001 (text-only) are NOT automatically re-embedded. An explicit re-embed action is exposed via POST /api/v1/books/{id}/reembed (admin-triggered). The existing chunks remain valid for text queries until re-embedding completes.

Rationale: Automatic re-embedding on deploy would block the system and risk data loss if the process fails mid-way. An explicit, idempotent trigger is safer and more observable.


Decision 10: Minimum Image Size Threshold

Decision: Images smaller than 100×100 pixels are discarded and no chunk is created. This threshold filters out decorative elements (bullets, dividers, publisher logos) without a classification model.

Rationale: Neurosurgery textbook diagrams and MRI scans are never smaller than 100×100 px. The threshold is configurable via app.figure-storage.min-image-size-px in application.properties.

Alternatives considered:

  • No threshold → decorative icons pollute the figure index
  • ML-based classification → accurate but adds model dependency; not needed at POC scale