ai-teacher/specs/002-image-aware-embedding/research.md

# Research: Enhanced Embedding with Image Parsing and Metadata

**Branch**: `002-image-aware-embedding` | **Date**: 2026-04-03

This document resolves all technical unknowns identified during planning. The primary source for
decisions is the detailed architecture provided directly by the project owner, supplemented by
Spring AI 2.0.0-M4 API specifics.

---

## Decision 1: Document Hierarchy Model

**Decision**: Adopt a four-level hierarchy — `BookNode` → `ChapterNode` → `SectionNode` →
`TextChunkNode` + `FigureNode`. The `SectionNode` is the pivotal unit: it holds the full section
text in Postgres and is used for parent-child context expansion at retrieval time.

**Rationale**: A flat page-per-document model (current implementation) loses structural context.
When a user asks a multi-faceted clinical question, the LLM needs the surrounding section text,
not just the matching fragment. Parent-child retrieval — where chunks point to their parent
section — is the established pattern for RAG precision. The hierarchy also makes figure-to-section
association explicit and queryable.

**Alternatives considered**:
- Keep flat page model, add metadata only → rejected: insufficient for precise citation and
  context expansion
- Chapter-level retrieval (coarser than section) → rejected: too much irrelevant context sent
  to LLM; cost and latency increase

---

## Decision 2: Image Extraction Strategy

**Decision**: Use PDFBox (already on classpath via `spring-ai-pdf-document-reader`) to extract
images per page. Each image is tagged with `page`, `figure_id` (derived from caption, e.g.
"Fig. 12-4"), and the parent `sectionId`. Images are saved to local disk under
`/uploads/figures/{bookId}/`.

**Rationale**: PDFBox is already present (Spring AI bundles it). No new dependency needed.
Per-page extraction ensures every image is captured regardless of PDF structure.

**Alternatives considered**:
- iText / iText7 → additional commercial dependency; overkill for extraction
- Screenshot each page as PNG, then OCR → far slower; loses vector quality

---

## Decision 3: Figure Content Representation

**Decision**: Generate a textual description of each extracted image using the OpenAI vision
model (GPT-4o). This description becomes the `content` field of the figure's vector store
document. The figure caption (parsed from the surrounding text) is also included to maximise
retrieval signal.

**Rationale**: Caption-only embedding would miss figures with no caption or with sparse labels.
Vision-generated descriptions produce richer semantic content (anatomy terms, structural
relationships) that matches clinical queries. The OpenAI client already in use supports image
inputs; no additional dependency is required.

**Alternatives considered**:
- Caption-only embedding → insufficient when captions are absent or terse (common in textbooks)
- Local vision model (LLaVA) → requires self-hosting; out of scope for POC
- OCR only → extracts text visible in image but misses non-text visual content (diagrams, MRI)

---

## Decision 4: Dual Vector Search

**Decision**: At query time, run two parallel similarity searches:
1. Text chunk search (filtered by `type = "TEXT"` and `book_id`)
2. Figure caption search (filtered by `type = "FIGURE"` and `book_id`)

Results are merged and deduplicated. The LLM prompt receives the expanded parent section text
plus a structured figure reference list.

**Rationale**: A single search would rank text and figures against each other; figures with
terse captions would systematically lose to text chunks. Separate searches with independent
`topK` allow tuning each modality independently.

**Alternatives considered**:
- Single search, filter by relevance score → figure captions score lower than text; figures
  are systematically under-retrieved
- Post-process text results to look up linked figures only → misses figures that are relevant
  to the query but not explicitly referenced in the retrieved text chunks

---

## Decision 5: Chunk-to-Figure Linking

**Decision**: During text parsing, whenever a pattern matching `Fig.\s+\d+[\-\.]\d+` or
`Figure\s+\d+[\-\.]\d+` is found in a chunk, insert a row into the `chunk_figure_refs` table
linking `chunkId` → `figureId`. At retrieval time, after text chunks are retrieved, their
associated figures are fetched from this table and added to the LLM prompt.

**Rationale**: Explicit linking ensures that when a text chunk is retrieved, its referenced
figures are always surfaced — even if the figure's caption did not score highly in the vector
search. This is the higher-recall path; dual search (Decision 4) is the higher-precision path.

**Alternatives considered**:
- Rely entirely on dual vector search → may miss figures referenced in retrieved text but
  scoring below the topK threshold in the figure search

---

## Decision 6: Image Storage

**Decision**: Extracted images are saved as PNG files to a local directory
(`${app.figure-storage.base-path}`, defaults to `./uploads/figures/{bookId}/`). The path is
stored in `figure.image_path` in Postgres. A `FigureStorageService` interface wraps all disk
I/O so the implementation can be swapped to S3 or another object store without changing
callers.

**Rationale**: Local disk is the simplest viable option for a POC with <10 users. The interface
boundary satisfies Constitution Principle II (Easy to Change).

**Alternatives considered**:
- S3 from day 1 → operational overhead not justified at POC scale
- Base64 in Postgres JSONB → bloats DB; complicates backup; query performance degrades

---

## Decision 7: Figure Type Classification

**Decision**: Use the enum `FigureType { ANATOMICAL_DIAGRAM, SURGICAL_PHOTOGRAPH, MRI_CT_SCAN,
TABLE, CHART, INTRAOPERATIVE_IMAGE }`. Classification is derived from:
1. Caption keywords ("MRI", "CT", "Fig.", "Table") — heuristic, no model needed
2. Fall back to `ANATOMICAL_DIAGRAM` if unclassifiable

**Rationale**: Allows the frontend to render different icon/label per type (e.g., "MRI" badge).
Heuristic classification avoids a separate model call per image at extraction time.

**Alternatives considered**:
- Vision model classification → accurate but adds latency and cost per figure; deferrable
- Single `FIGURE` type → loses citation granularity required by spec FR-004

---

## Decision 8: Metadata Schema for Vector Store Documents

**Decision**: All vector store documents carry a flat `Map<String, Object>` metadata for Spring
AI filtering. Schema:

| Field | Text Chunk | Figure Chunk |
|-------|-----------|-------------|
| `type` | `"TEXT"` | `"FIGURE"` |
| `book_id` | ✓ | ✓ |
| `book_title` | ✓ | ✓ |
| `chapter_id` | ✓ | ✓ |
| `section_id` | ✓ | ✓ |
| `section_title` | ✓ | ✓ |
| `page_start` | ✓ | — |
| `page_end` | ✓ | — |
| `chunk_index` | ✓ | — |
| `total_chunks` | ✓ | — |
| `figure_id` | — | ✓ |
| `figure_type` | — | ✓ |
| `image_path` | — | ✓ |
| `label` | — | ✓ |
| `page` | — | ✓ |

**Rationale**: Flat map is required by Spring AI `FilterExpressionBuilder`. Separation by `type`
allows independent filtering in dual search.

---

## Decision 9: Re-embedding Existing Books

**Decision**: Books already processed under feature 001 (text-only) are NOT automatically
re-embedded. An explicit re-embed action is exposed via `POST /api/v1/books/{id}/reembed`
(admin-triggered). The existing chunks remain valid for text queries until re-embedding completes.

**Rationale**: Automatic re-embedding on deploy would block the system and risk data loss if
the process fails mid-way. An explicit, idempotent trigger is safer and more observable.

---

## Decision 10: Minimum Image Size Threshold

**Decision**: Images smaller than 100×100 pixels are discarded and no chunk is created. This
threshold filters out decorative elements (bullets, dividers, publisher logos) without a
classification model.

**Rationale**: Neurosurgery textbook diagrams and MRI scans are never smaller than 100×100 px.
The threshold is configurable via `app.figure-storage.min-image-size-px` in
`application.properties`.

**Alternatives considered**:
- No threshold → decorative icons pollute the figure index
- ML-based classification → accurate but adds model dependency; not needed at POC scale