Files
ai-teacher/specs/002-image-aware-embedding/research.md
T

189 lines
8.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Research: Enhanced Embedding with Image Parsing and Metadata
**Branch**: `002-image-aware-embedding` | **Date**: 2026-04-03
This document resolves all technical unknowns identified during planning. The primary source for
decisions is the detailed architecture provided directly by the project owner, supplemented by
Spring AI 2.0.0-M4 API specifics.
---
## Decision 1: Document Hierarchy Model
**Decision**: Adopt a four-level hierarchy — `BookNode``ChapterNode``SectionNode`
`TextChunkNode` + `FigureNode`. The `SectionNode` is the pivotal unit: it holds the full section
text in Postgres and is used for parent-child context expansion at retrieval time.
**Rationale**: A flat page-per-document model (current implementation) loses structural context.
When a user asks a multi-faceted clinical question, the LLM needs the surrounding section text,
not just the matching fragment. Parent-child retrieval — where chunks point to their parent
section — is the established pattern for RAG precision. The hierarchy also makes figure-to-section
association explicit and queryable.
**Alternatives considered**:
- Keep flat page model, add metadata only → rejected: insufficient for precise citation and
context expansion
- Chapter-level retrieval (coarser than section) → rejected: too much irrelevant context sent
to LLM; cost and latency increase
---
## Decision 2: Image Extraction Strategy
**Decision**: Use PDFBox (already on classpath via `spring-ai-pdf-document-reader`) to extract
images per page. Each image is tagged with `page`, `figure_id` (derived from caption, e.g.
"Fig. 12-4"), and the parent `sectionId`. Images are saved to local disk under
`/uploads/figures/{bookId}/`.
**Rationale**: PDFBox is already present (Spring AI bundles it). No new dependency needed.
Per-page extraction ensures every image is captured regardless of PDF structure.
**Alternatives considered**:
- iText / iText7 → additional commercial dependency; overkill for extraction
- Screenshot each page as PNG, then OCR → far slower; loses vector quality
---
## Decision 3: Figure Content Representation
**Decision**: Generate a textual description of each extracted image using the OpenAI vision
model (GPT-4o). This description becomes the `content` field of the figure's vector store
document. The figure caption (parsed from the surrounding text) is also included to maximise
retrieval signal.
**Rationale**: Caption-only embedding would miss figures with no caption or with sparse labels.
Vision-generated descriptions produce richer semantic content (anatomy terms, structural
relationships) that matches clinical queries. The OpenAI client already in use supports image
inputs; no additional dependency is required.
**Alternatives considered**:
- Caption-only embedding → insufficient when captions are absent or terse (common in textbooks)
- Local vision model (LLaVA) → requires self-hosting; out of scope for POC
- OCR only → extracts text visible in image but misses non-text visual content (diagrams, MRI)
---
## Decision 4: Dual Vector Search
**Decision**: At query time, run two parallel similarity searches:
1. Text chunk search (filtered by `type = "TEXT"` and `book_id`)
2. Figure caption search (filtered by `type = "FIGURE"` and `book_id`)
Results are merged and deduplicated. The LLM prompt receives the expanded parent section text
plus a structured figure reference list.
**Rationale**: A single search would rank text and figures against each other; figures with
terse captions would systematically lose to text chunks. Separate searches with independent
`topK` allow tuning each modality independently.
**Alternatives considered**:
- Single search, filter by relevance score → figure captions score lower than text; figures
are systematically under-retrieved
- Post-process text results to look up linked figures only → misses figures that are relevant
to the query but not explicitly referenced in the retrieved text chunks
---
## Decision 5: Chunk-to-Figure Linking
**Decision**: During text parsing, whenever a pattern matching `Fig.\s+\d+[\-\.]\d+` or
`Figure\s+\d+[\-\.]\d+` is found in a chunk, insert a row into the `chunk_figure_refs` table
linking `chunkId``figureId`. At retrieval time, after text chunks are retrieved, their
associated figures are fetched from this table and added to the LLM prompt.
**Rationale**: Explicit linking ensures that when a text chunk is retrieved, its referenced
figures are always surfaced — even if the figure's caption did not score highly in the vector
search. This is the higher-recall path; dual search (Decision 4) is the higher-precision path.
**Alternatives considered**:
- Rely entirely on dual vector search → may miss figures referenced in retrieved text but
scoring below the topK threshold in the figure search
---
## Decision 6: Image Storage
**Decision**: Extracted images are saved as PNG files to a local directory
(`${app.figure-storage.base-path}`, defaults to `./uploads/figures/{bookId}/`). The path is
stored in `figure.image_path` in Postgres. A `FigureStorageService` interface wraps all disk
I/O so the implementation can be swapped to S3 or another object store without changing
callers.
**Rationale**: Local disk is the simplest viable option for a POC with <10 users. The interface
boundary satisfies Constitution Principle II (Easy to Change).
**Alternatives considered**:
- S3 from day 1 → operational overhead not justified at POC scale
- Base64 in Postgres JSONB → bloats DB; complicates backup; query performance degrades
---
## Decision 7: Figure Type Classification
**Decision**: Use the enum `FigureType { ANATOMICAL_DIAGRAM, SURGICAL_PHOTOGRAPH, MRI_CT_SCAN,
TABLE, CHART, INTRAOPERATIVE_IMAGE }`. Classification is derived from:
1. Caption keywords ("MRI", "CT", "Fig.", "Table") — heuristic, no model needed
2. Fall back to `ANATOMICAL_DIAGRAM` if unclassifiable
**Rationale**: Allows the frontend to render different icon/label per type (e.g., "MRI" badge).
Heuristic classification avoids a separate model call per image at extraction time.
**Alternatives considered**:
- Vision model classification → accurate but adds latency and cost per figure; deferrable
- Single `FIGURE` type → loses citation granularity required by spec FR-004
---
## Decision 8: Metadata Schema for Vector Store Documents
**Decision**: All vector store documents carry a flat `Map<String, Object>` metadata for Spring
AI filtering. Schema:
| Field | Text Chunk | Figure Chunk |
|-------|-----------|-------------|
| `type` | `"TEXT"` | `"FIGURE"` |
| `book_id` | ✓ | ✓ |
| `book_title` | ✓ | ✓ |
| `chapter_id` | ✓ | ✓ |
| `section_id` | ✓ | ✓ |
| `section_title` | ✓ | ✓ |
| `page_start` | ✓ | — |
| `page_end` | ✓ | — |
| `chunk_index` | ✓ | — |
| `total_chunks` | ✓ | — |
| `figure_id` | — | ✓ |
| `figure_type` | — | ✓ |
| `image_path` | — | ✓ |
| `label` | — | ✓ |
| `page` | — | ✓ |
**Rationale**: Flat map is required by Spring AI `FilterExpressionBuilder`. Separation by `type`
allows independent filtering in dual search.
---
## Decision 9: Re-embedding Existing Books
**Decision**: Books already processed under feature 001 (text-only) are NOT automatically
re-embedded. An explicit re-embed action is exposed via `POST /api/v1/books/{id}/reembed`
(admin-triggered). The existing chunks remain valid for text queries until re-embedding completes.
**Rationale**: Automatic re-embedding on deploy would block the system and risk data loss if
the process fails mid-way. An explicit, idempotent trigger is safer and more observable.
---
## Decision 10: Minimum Image Size Threshold
**Decision**: Images smaller than 100×100 pixels are discarded and no chunk is created. This
threshold filters out decorative elements (bullets, dividers, publisher logos) without a
classification model.
**Rationale**: Neurosurgery textbook diagrams and MRI scans are never smaller than 100×100 px.
The threshold is configurable via `app.figure-storage.min-image-size-px` in
`application.properties`.
**Alternatives considered**:
- No threshold → decorative icons pollute the figure index
- ML-based classification → accurate but adds model dependency; not needed at POC scale